SlideShare a Scribd company logo
How to scrape Data as
Economics Student
Nikolay Tretyakov
04.12.2018
Department of Economics
University of Ioannina
Agenda
1. Introduction
2. Theoretical Background
2.1. Language behind HTML code
2.2. Navigate HTML using XPATH
3. No coding Tricks
3.1. Extensions Chrome/Firefox
3.2. Using Google Spreadsheets functions
3.3. Scraping static pages using Python
3.4. Urllib, Requests
3.5. Beautiful Soup
4. Scraping dynamic pages using Python
4.1. Selenium Web Driver
5. Conclusion
6. References
1
Introduction
- Erasmus trainee from Otto-von-Guericke Universität, Magdeburg,
Germany
- Making research about tourism industry in Epirus, Greece
- Scrapping is 70 percent of work
- Big variety of scraped data analysis:
- Descriptive statistics
- Sentiment analysis
- Seasonality
2
Theoretical Background
- Web Scraping, or Web harvesting - methods of data
extraction from across the internet, mostly using software
simulating user behavior
- Web Crawler, or Spider, or Web Robot - program which
browses World Wide Web in a methodical manner
- Most advanced web crawler - Google Search Engine
3
- Access HTML code: view page
source in the browser
- XML stands for eXtensible Markup
Language
- (W3C) DOM: API which treats XML as
a structure where each node is an
object representing part of document
<?xml version="1.0" encoding="utf-8"?>
<destinationslist>
<dest>
<dest_en> Ioannina </dest_en>
<dest_ru> Янина </dest_ru>
<dest_gr> Ιωάννινα </dest_gr>
</dest>
</destinationslist>
Language behind HTML code
4
Example of XML structure
Navigating HTML using XPATH
5
/html[@class='js bootstrap-anchors-processed']/body[@class='html not-front not-logged-in one-sidebar sidebar-first
page-traineeships navbar-is-fixed-top']/div[@class='main-container container']/div[@class='row']/div[@class='region
region-content col-sm-9']/section[@id='block-system-main']/div[@class='view view-erasmusintern-traineeships-search
view-id-erasmusintern_traineeships_search view-display-id-page media-list-container
view-dom-id-bfcb6580db560c43fd748355dce05662']/div[@class='view-content']/div[1]/div[@class='node node-traineeship
view-mode-media_list clearfix']/div[@class='row media-list-items']/div[@class='col-md-12']/div[@class='ds-header
inline-header-content']/div[@class='field field-name-title field-type-ds field-label-hidden
pull-left']/div[@class='field-items']/div[@class='field-item even']/h3[@class='dot-title']/a
XPath Query:
Navigating HTML using XPATH
6
Key Characters:
/ : starts the root,leads to children`s
node;
// : starts wherever (relative path);
@ : select attributes;
[] : answers question Which one?
[*] : grabs everything
XPATH: “//div[@class='field-item even']/h3[@class='dot-title']/a[1]”
Types of nodes:
- element
- attribute
- text
- namespace
- processing instructions
- comment
- document node
No Coding Tricks
7
- Using browser extensions (demo1):
- XPath helper (Google Chrome)
- Selector Gadget
- Chro Path (Firefox, Chrome)
- Firefox Quantum Developer Edition (former Firebug)
- ImportXML() in Google Spreadsheets:
- arguments: xpath and url
- importfeed() and importhtml()
- Third-party services with free pricing
Scraping static pages using Python
8
- Requests and urrlib libraries
- Beautiful Soup library
Get to the desired URL
Scrape the existing
content on the page
Save the JSON or
CSV file
Scraping static pages using BS4 + Requests
9
Python Source Code example of JSON file
import requests
import BeautifulSoup as BS
def save_the_JSON :***
url = requests.get(erasmus_intern_url )
soup = BS(url, "lxml")
scrape_results = soup.find_all("div", class_="field-item even" )
for element in scrape_results :
title = element.find("h3",
class_='dot-title' ).find_next('a').get_text()
save_the_JSON ()
“results”:[
{
“id” : 1
“city” : “Berlin”,
“company” : “UIZ”,
“title” : “Manager”
}
]
Crawling dynamic web pages using Python
10
- Selenium Web Driver (source,
output, demo2):
- Initially for automated tests
- Make AJAX requests
- Submit forms, click buttons,
close pop-ups
- Waits are important
(recommended to wait if
more than 20 http requests
per minute)
- Requests html package:
- simple JS calls
- introduced early 2018
Web-etiquette and cat-mouse game
11
- Websites protect themselves!
- crawling brings extra load to servers
- copyright issues
- loss of income
- Sanctions vary from IP temporary ban to
opening a case
- Respect robots.txt and Terms and Conditions (ToS)
- Try to obtain API
- Do not publish obtained copyrighted information
Conclusion
12
- Automate the boring stuff
https://automatetheboringstuff.com/
- Apply scrapping only if it is worth it
- Try Erasmus!
References
13
1. https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-
right/
2. https://erasmusintern.org/traineeships?search_api_views_fulltext=&field_tr
aineeship_full_location_field_traineeship_location_count=242
3. https://www.slideshare.net/anniecushing/web-scraping-for-codeophobes
4. https://www.pythonforbeginners.com/requests/using-requests-in-python
5. https://www.techopedia.com/definition/5212/web-scraping
6. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
7. https://selenium-python.readthedocs.io/

More Related Content

PDF
Python webinar 2nd july
PDF
Internet Explorer 8
PDF
Big data analysis in python @ PyCon.tw 2013
PDF
Play framework
PDF
Web Scraping for Non Programmers
PPTX
How to scraping content from web for location-based mobile app.
PDF
release_python_day3_slides_201606.pdf
PDF
Web Architectures - Web Technologies (1019888BNR)
Python webinar 2nd july
Internet Explorer 8
Big data analysis in python @ PyCon.tw 2013
Play framework
Web Scraping for Non Programmers
How to scraping content from web for location-based mobile app.
release_python_day3_slides_201606.pdf
Web Architectures - Web Technologies (1019888BNR)

Similar to How to scrape data as economics student (20)

PDF
Intro to web scraping with Python
PPTX
Scrapy.for.dummies
PDF
Getting started with Web Scraping in Python
PPTX
Weather data analysis presentation .pptx
PDF
Sociopath presentation
PPTX
Web_Scraping_Presentation_today pptx.pptx
PDF
Talking to Web Services
PDF
How do we develop open source software to help open data ? (MOSC 2013)
PDF
Lightweight web frameworks
PDF
Google Chrome Extensions - DevFest09
PPT
PHPExcel and OPENXML4J
PPTX
Web Scrapping Using Python
PPTX
Jeremy cabral search marketing summit - scraping data-driven content (1)
PDF
End To End Machine Learning With Google Cloud
PDF
Web Crawling with Apache Nutch
PPTX
Data-Analytics using python (Module 4).pptx
PDF
Reproducibility and automation of machine learning process
PDF
Leveraging the Globus Platform (GlobusWorld Tour - Columbia University)
PPTX
News web application
PDF
PHP Lab
Intro to web scraping with Python
Scrapy.for.dummies
Getting started with Web Scraping in Python
Weather data analysis presentation .pptx
Sociopath presentation
Web_Scraping_Presentation_today pptx.pptx
Talking to Web Services
How do we develop open source software to help open data ? (MOSC 2013)
Lightweight web frameworks
Google Chrome Extensions - DevFest09
PHPExcel and OPENXML4J
Web Scrapping Using Python
Jeremy cabral search marketing summit - scraping data-driven content (1)
End To End Machine Learning With Google Cloud
Web Crawling with Apache Nutch
Data-Analytics using python (Module 4).pptx
Reproducibility and automation of machine learning process
Leveraging the Globus Platform (GlobusWorld Tour - Columbia University)
News web application
PHP Lab
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Introduction to the R Programming Language
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Mega Projects Data Mega Projects Data
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx
Supervised vs unsupervised machine learning algorithms
IBA_Chapter_11_Slides_Final_Accessible.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to the R Programming Language
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Database Infoormation System (DBIS).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
[EN] Industrial Machine Downtime Prediction
Mega Projects Data Mega Projects Data
Ad

How to scrape data as economics student

  • 1. How to scrape Data as Economics Student Nikolay Tretyakov 04.12.2018 Department of Economics University of Ioannina
  • 2. Agenda 1. Introduction 2. Theoretical Background 2.1. Language behind HTML code 2.2. Navigate HTML using XPATH 3. No coding Tricks 3.1. Extensions Chrome/Firefox 3.2. Using Google Spreadsheets functions 3.3. Scraping static pages using Python 3.4. Urllib, Requests 3.5. Beautiful Soup 4. Scraping dynamic pages using Python 4.1. Selenium Web Driver 5. Conclusion 6. References 1
  • 3. Introduction - Erasmus trainee from Otto-von-Guericke Universität, Magdeburg, Germany - Making research about tourism industry in Epirus, Greece - Scrapping is 70 percent of work - Big variety of scraped data analysis: - Descriptive statistics - Sentiment analysis - Seasonality 2
  • 4. Theoretical Background - Web Scraping, or Web harvesting - methods of data extraction from across the internet, mostly using software simulating user behavior - Web Crawler, or Spider, or Web Robot - program which browses World Wide Web in a methodical manner - Most advanced web crawler - Google Search Engine 3
  • 5. - Access HTML code: view page source in the browser - XML stands for eXtensible Markup Language - (W3C) DOM: API which treats XML as a structure where each node is an object representing part of document <?xml version="1.0" encoding="utf-8"?> <destinationslist> <dest> <dest_en> Ioannina </dest_en> <dest_ru> Янина </dest_ru> <dest_gr> Ιωάννινα </dest_gr> </dest> </destinationslist> Language behind HTML code 4 Example of XML structure
  • 6. Navigating HTML using XPATH 5 /html[@class='js bootstrap-anchors-processed']/body[@class='html not-front not-logged-in one-sidebar sidebar-first page-traineeships navbar-is-fixed-top']/div[@class='main-container container']/div[@class='row']/div[@class='region region-content col-sm-9']/section[@id='block-system-main']/div[@class='view view-erasmusintern-traineeships-search view-id-erasmusintern_traineeships_search view-display-id-page media-list-container view-dom-id-bfcb6580db560c43fd748355dce05662']/div[@class='view-content']/div[1]/div[@class='node node-traineeship view-mode-media_list clearfix']/div[@class='row media-list-items']/div[@class='col-md-12']/div[@class='ds-header inline-header-content']/div[@class='field field-name-title field-type-ds field-label-hidden pull-left']/div[@class='field-items']/div[@class='field-item even']/h3[@class='dot-title']/a XPath Query:
  • 7. Navigating HTML using XPATH 6 Key Characters: / : starts the root,leads to children`s node; // : starts wherever (relative path); @ : select attributes; [] : answers question Which one? [*] : grabs everything XPATH: “//div[@class='field-item even']/h3[@class='dot-title']/a[1]” Types of nodes: - element - attribute - text - namespace - processing instructions - comment - document node
  • 8. No Coding Tricks 7 - Using browser extensions (demo1): - XPath helper (Google Chrome) - Selector Gadget - Chro Path (Firefox, Chrome) - Firefox Quantum Developer Edition (former Firebug) - ImportXML() in Google Spreadsheets: - arguments: xpath and url - importfeed() and importhtml() - Third-party services with free pricing
  • 9. Scraping static pages using Python 8 - Requests and urrlib libraries - Beautiful Soup library Get to the desired URL Scrape the existing content on the page Save the JSON or CSV file
  • 10. Scraping static pages using BS4 + Requests 9 Python Source Code example of JSON file import requests import BeautifulSoup as BS def save_the_JSON :*** url = requests.get(erasmus_intern_url ) soup = BS(url, "lxml") scrape_results = soup.find_all("div", class_="field-item even" ) for element in scrape_results : title = element.find("h3", class_='dot-title' ).find_next('a').get_text() save_the_JSON () “results”:[ { “id” : 1 “city” : “Berlin”, “company” : “UIZ”, “title” : “Manager” } ]
  • 11. Crawling dynamic web pages using Python 10 - Selenium Web Driver (source, output, demo2): - Initially for automated tests - Make AJAX requests - Submit forms, click buttons, close pop-ups - Waits are important (recommended to wait if more than 20 http requests per minute) - Requests html package: - simple JS calls - introduced early 2018
  • 12. Web-etiquette and cat-mouse game 11 - Websites protect themselves! - crawling brings extra load to servers - copyright issues - loss of income - Sanctions vary from IP temporary ban to opening a case - Respect robots.txt and Terms and Conditions (ToS) - Try to obtain API - Do not publish obtained copyrighted information
  • 13. Conclusion 12 - Automate the boring stuff https://automatetheboringstuff.com/ - Apply scrapping only if it is worth it - Try Erasmus!
  • 14. References 13 1. https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal- right/ 2. https://erasmusintern.org/traineeships?search_api_views_fulltext=&field_tr aineeship_full_location_field_traineeship_location_count=242 3. https://www.slideshare.net/anniecushing/web-scraping-for-codeophobes 4. https://www.pythonforbeginners.com/requests/using-requests-in-python 5. https://www.techopedia.com/definition/5212/web-scraping 6. https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 7. https://selenium-python.readthedocs.io/