SlideShare a Scribd company logo
WEB SCRAPING
WITH PYTHON
10/2019
Applied Analytics Club
Set Up
• Google Chrome is needed to follow along with this tutorial.
• *Install the Selector Gadget Extension for Chrome as well.*
• If you haven’t done already, download and install Anaconda
Python 3 Version at:
• https://www.anaconda.com/distribution
• Next, use Terminal or Command Prompt to enter the
following, one by one:
• pip install bs4
• pip install selenium
• pip install requests
• Download all workshop materials @ ^
• In case of errors, raise your hand and we will come around. For those who have
successfully completed the install, please assist others.
bit.ly/2Mmi6vH
Contents
■ Define Scraping
■ Python Basic Components (Data Types, Functions, Containers, For Loops)
■ Applications:
– Beautiful Soup
■ Demonstration with Follow Up
■ Practice Exercise
– Selenium
■ Demonstration with Follow Up
■ Practice Exercise
■ Things to keep in mind when scraping (robots.txt)
■ Challenge Introduction
■ Q & A
Web Scraping
■ Used for extracting data from websites
■ Automates the process of gathering data
which is typically only accessible via a web
browser
■ Each website is naturally different,
therefore each requires a slightly modified
approach while scraping
■ Not everything can be scrapped
Python Basics: Data Types
■ Int e.g. 2,3,4
■ Float e.g. 2.0,3.4, 4.3
■ String e.g. “scraping ftw!”, ”John Doe”
■ Boolean True, False
■ Others (Complex, Unicode etc.)
Python Basics: Functions
■ Functions start with “def” with the following format
– def function1(paramter1,parameter2):
answer = parameter1+paramter2
return answer
■ There are two ways to call functions:
1. Function1()
1. E.g. type(5) # int
2. Object.function1()
1. “python”.upper() # “PYTHON”
– Used under different circumstances (examples to
come later)
Python Basics: Lists
■ Type of data container which is used to store multiple data at the same time
■ Mutable (Can be changed)
■ Comparable to R’s vector
– E.g. list1 = [0,1,2,3,4]
■ Can contain items of varying data types
– E.g. list2 = [6,’harry’, True, 1.0]
■ Indexing starts with 0
– E.g. list2[0] = 6
■ A list can be nested in another list
– E.g. [1 , [98,109], 6, 7]
■ Call the ”append” function to add an item to a list
– E.g. list1.append(5)
Python Basics: Dictionaries
■ Collection of key-value pairs
■ Very similar to JSON objects
■ Mutable
■ E.g. dict1 = {‘r’:4,’w’:9, ‘t’:5}
■ Indexed with keys
– E.g. dict1[‘r’]
■ Keys are unique
■ Values can be lists or other nested dictionaries
■ A dictionary can also be nested into a list e.g. [{3:4,5:6}, 6,7]
Python Basics: For Loops
■ Used for iterating over a sequence (a list, a tuple, a dictionary, a set, or a
string)
■ E.g.
– cities_list = [‘hong kong”, “new york”, “miami”]
– for item in cities_list:
print(item)
# hong kong
# new york
# miami
Beautiful Soup
■ Switch to Jupiter Notebook
– Open Anaconda
– Launch Jupyter Notebook
– Go to IMDB’s 250 movies:
■ https://www.imdb.com/search/title?genres=drama&groups=top_250&sort=us
er_rating,desc
Selenium
■ Download the chrome web driver from
– http://chromedriver.chromium.org/downloads
■ Place the driver in your working directory
■ Continue with Jupyter Notebook
Scraping Ethics
■ Be respectful of websites’ permissions
■ View the website’s robots.txt file to learn which areas of the site are allowed
or disallowed from scraping
– You can access this file by replacing sitename.com in the following:
www.[sitename.com]/robots.txt
– E.g. imdb’s robots txt can be found at https://www.imdb.com/robots.txt
– You can also use https://canicrawl.com/ to check if a website allows
scrapping
■ Don’t overload website servers by sending too many requests. Use
“time.sleep(xx)” function to delay requests.
– This will also prevent your IP address from being banned
Interpreting the robots.txt file
■ All pages of the website can be
scrapped if you see the following:
– User-agent: *
– Disallow:
■ None of the pages of the website can
be scrapped if you see the following:
– User-agent: *
– Disallow: /
■ Example from imdb 
– The sub-directories mentioned
here are disallowed from being
scrapped
Take-home Challenge
■ Scrape a fictional book store: http://books.toscrape.com/?
■ Use what you have learned to create efficiently scrape the following data for
Travel, Poetry, Art, Humor and Academic books:
– Book Title
– Product Description
– Price (excl. tax)
– Number of Reviews
■ Store all of the data in a single Pandas DataFrame
■ The most efficient scraper will be awarded with a prize
■ Deadline for submissions are in a week from today, 4/18/2019 11:59pm
Resources
■ https://github.com/devkosal/scraping_tutorial
– All code provided in this lecture can be found here
■ http://toscrape.com/
– Great sample websites to perform beginner to intermediate scrapping on
■ https://www.edx.org/course/introduction-to-computer-science-and-programming-
using-python-0
– Introduction to Computer Science using Python
– Highly recommended course on learning Python and CS form scratch
■ https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/
– Further reading on interpreting robots.txt
■ https://canicrawl.com/
– Check scraping permissions for any website

More Related Content

PPTX
Sesi 8_Scraping & API for really bnegineer.pptx
PDF
Intro to web scraping with Python
PPTX
Web data from R
PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PPT
Logstash
PDF
The Web Application Hackers Toolchain
KEY
Scraping Scripting Hacking
PPT
Learning to code
Sesi 8_Scraping & API for really bnegineer.pptx
Intro to web scraping with Python
Web data from R
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Logstash
The Web Application Hackers Toolchain
Scraping Scripting Hacking
Learning to code

Similar to Python ScrapingPresentation for dummy.pptx (20)

PDF
Top ten-list
PPTX
DSpace 4.2 Transmission: Import/Export
PPTX
PPTX
CI_CONF 2012: Scaling
PPTX
CI_CONF 2012: Scaling - Chris Miller
PDF
Big data analysis in python @ PyCon.tw 2013
PPTX
Centralized log-management-with-elastic-stack
PDF
Logs aggregation and analysis
PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
PPTX
Crawl the entire web in 10 minutes...and just 100€
PPTX
Info 2402 irt-chapter_3
PDF
Capacity planning for your data stores
PPTX
Google Dorks
PPTX
Static Site Generators: what they are and when they are useful
KEY
YQL: Select * from Internet
PPTX
Scraping the web with Laravel, Dusk, Docker, and PHP
PDF
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
PPTX
PHP language presentation
ODP
Search Engine Spiders
PPTX
REST Api Tips and Tricks
Top ten-list
DSpace 4.2 Transmission: Import/Export
CI_CONF 2012: Scaling
CI_CONF 2012: Scaling - Chris Miller
Big data analysis in python @ PyCon.tw 2013
Centralized log-management-with-elastic-stack
Logs aggregation and analysis
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Crawl the entire web in 10 minutes...and just 100€
Info 2402 irt-chapter_3
Capacity planning for your data stores
Google Dorks
Static Site Generators: what they are and when they are useful
YQL: Select * from Internet
Scraping the web with Laravel, Dusk, Docker, and PHP
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
PHP language presentation
Search Engine Spiders
REST Api Tips and Tricks
Ad

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
Complications of Minimal Access Surgery at WLH
PPTX
History, Philosophy and sociology of education (1).pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
Lesson notes of climatology university.
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
advance database management system book.pdf
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
Computing-Curriculum for Schools in Ghana
Classroom Observation Tools for Teachers
Chinmaya Tiranga quiz Grand Finale.pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
LDMMIA Reiki Yoga Finals Review Spring Summer
Complications of Minimal Access Surgery at WLH
History, Philosophy and sociology of education (1).pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Final Presentation General Medicine 03-08-2024.pptx
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Weekly quiz Compilation Jan -July 25.pdf
Lesson notes of climatology university.
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
advance database management system book.pdf
Indian roads congress 037 - 2012 Flexible pavement
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
Supply Chain Operations Speaking Notes -ICLT Program
Paper A Mock Exam 9_ Attempt review.pdf.
Computing-Curriculum for Schools in Ghana
Ad

Python ScrapingPresentation for dummy.pptx

  • 2. Set Up • Google Chrome is needed to follow along with this tutorial. • *Install the Selector Gadget Extension for Chrome as well.* • If you haven’t done already, download and install Anaconda Python 3 Version at: • https://www.anaconda.com/distribution • Next, use Terminal or Command Prompt to enter the following, one by one: • pip install bs4 • pip install selenium • pip install requests • Download all workshop materials @ ^ • In case of errors, raise your hand and we will come around. For those who have successfully completed the install, please assist others. bit.ly/2Mmi6vH
  • 3. Contents ■ Define Scraping ■ Python Basic Components (Data Types, Functions, Containers, For Loops) ■ Applications: – Beautiful Soup ■ Demonstration with Follow Up ■ Practice Exercise – Selenium ■ Demonstration with Follow Up ■ Practice Exercise ■ Things to keep in mind when scraping (robots.txt) ■ Challenge Introduction ■ Q & A
  • 4. Web Scraping ■ Used for extracting data from websites ■ Automates the process of gathering data which is typically only accessible via a web browser ■ Each website is naturally different, therefore each requires a slightly modified approach while scraping ■ Not everything can be scrapped
  • 5. Python Basics: Data Types ■ Int e.g. 2,3,4 ■ Float e.g. 2.0,3.4, 4.3 ■ String e.g. “scraping ftw!”, ”John Doe” ■ Boolean True, False ■ Others (Complex, Unicode etc.)
  • 6. Python Basics: Functions ■ Functions start with “def” with the following format – def function1(paramter1,parameter2): answer = parameter1+paramter2 return answer ■ There are two ways to call functions: 1. Function1() 1. E.g. type(5) # int 2. Object.function1() 1. “python”.upper() # “PYTHON” – Used under different circumstances (examples to come later)
  • 7. Python Basics: Lists ■ Type of data container which is used to store multiple data at the same time ■ Mutable (Can be changed) ■ Comparable to R’s vector – E.g. list1 = [0,1,2,3,4] ■ Can contain items of varying data types – E.g. list2 = [6,’harry’, True, 1.0] ■ Indexing starts with 0 – E.g. list2[0] = 6 ■ A list can be nested in another list – E.g. [1 , [98,109], 6, 7] ■ Call the ”append” function to add an item to a list – E.g. list1.append(5)
  • 8. Python Basics: Dictionaries ■ Collection of key-value pairs ■ Very similar to JSON objects ■ Mutable ■ E.g. dict1 = {‘r’:4,’w’:9, ‘t’:5} ■ Indexed with keys – E.g. dict1[‘r’] ■ Keys are unique ■ Values can be lists or other nested dictionaries ■ A dictionary can also be nested into a list e.g. [{3:4,5:6}, 6,7]
  • 9. Python Basics: For Loops ■ Used for iterating over a sequence (a list, a tuple, a dictionary, a set, or a string) ■ E.g. – cities_list = [‘hong kong”, “new york”, “miami”] – for item in cities_list: print(item) # hong kong # new york # miami
  • 10. Beautiful Soup ■ Switch to Jupiter Notebook – Open Anaconda – Launch Jupyter Notebook – Go to IMDB’s 250 movies: ■ https://www.imdb.com/search/title?genres=drama&groups=top_250&sort=us er_rating,desc
  • 11. Selenium ■ Download the chrome web driver from – http://chromedriver.chromium.org/downloads ■ Place the driver in your working directory ■ Continue with Jupyter Notebook
  • 12. Scraping Ethics ■ Be respectful of websites’ permissions ■ View the website’s robots.txt file to learn which areas of the site are allowed or disallowed from scraping – You can access this file by replacing sitename.com in the following: www.[sitename.com]/robots.txt – E.g. imdb’s robots txt can be found at https://www.imdb.com/robots.txt – You can also use https://canicrawl.com/ to check if a website allows scrapping ■ Don’t overload website servers by sending too many requests. Use “time.sleep(xx)” function to delay requests. – This will also prevent your IP address from being banned
  • 13. Interpreting the robots.txt file ■ All pages of the website can be scrapped if you see the following: – User-agent: * – Disallow: ■ None of the pages of the website can be scrapped if you see the following: – User-agent: * – Disallow: / ■ Example from imdb  – The sub-directories mentioned here are disallowed from being scrapped
  • 14. Take-home Challenge ■ Scrape a fictional book store: http://books.toscrape.com/? ■ Use what you have learned to create efficiently scrape the following data for Travel, Poetry, Art, Humor and Academic books: – Book Title – Product Description – Price (excl. tax) – Number of Reviews ■ Store all of the data in a single Pandas DataFrame ■ The most efficient scraper will be awarded with a prize ■ Deadline for submissions are in a week from today, 4/18/2019 11:59pm
  • 15. Resources ■ https://github.com/devkosal/scraping_tutorial – All code provided in this lecture can be found here ■ http://toscrape.com/ – Great sample websites to perform beginner to intermediate scrapping on ■ https://www.edx.org/course/introduction-to-computer-science-and-programming- using-python-0 – Introduction to Computer Science using Python – Highly recommended course on learning Python and CS form scratch ■ https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/ – Further reading on interpreting robots.txt ■ https://canicrawl.com/ – Check scraping permissions for any website