SlideShare a Scribd company logo
Web Scraping
Gathering Data from Websites
M.MOHAMED MUSTHAFA
Web Scraping - What We’ll Cover
1. Build a data corpus of congressional press releases
2. APIs and gather latitude and longitude -- using JSON formatted data
3. A brief hands-on introduction into HTML parsing
4. APIs and Documentation (FTP) -- OpenSecrets.org
5. Discussion of APIs and Social Media data gathering
6. A brief discussion on the ethics of scraping
2
This is not a programming workshop, but...
1. We will discuss Python and BeautifulSoup
2. We will not learn or use Python in the workshop
3. However, some automation tools are used in this workshop
4. Web Scraping is about deconstructing websites. Effective scraping requires
learning about technical infrastructure as well as subject content
5. Not a workshop on Text Analysis (tools that calculate or correlate your data)
6. Not a workshop on data cleaning
3
Technical Definitions
Deconstruction v Construction
4
Definitions
5
● Scraping
Using tools to gather data you can see on
a webpage
A wide range of web scraping techniques
and tools exist. These can be as simple
as copy/paste and increase in complexity
to automation tools, HTML parsing, APIs
and programming
Scraping propolis from the sides of the bee box
Image by Abalg~commonswiki
Definitions
6
● Scraping
● HTTP
HyperText Transfer Protocol
Machine interchange information
transported over the Internet to enable
multi-media data exchange, aka WWW.
The protocol defines aspects of
authentication, requests, status codes,
persistent connections, client/server
request/response. etc.
Access a server on port 80; the declarative
Document Type Definition ( HTML, XML,
JSON, etc.) Images from
commons.WikiMedia.org
● Scraping
● HTTP
● HTML
HyperText Markup Language
The standard markup language on the
Web
As the web evolves so does the
proliferation of technical wrappers
surrounding the visible content of websites
(text and data)
Definitions
7
● Scraping
● HTTP
● HTML
● Parsing
The act of analyzing the strings and
symbols to reveal only the data you need
Definitions
8
● Scraping
● HTTP
● HTML
● Parsing
● Crawling
Moving across or through a website in an
attempt to gather data from more than one
URL or page
Definitions
9
Image by Dave Gingrich
● Scraping
● HTTP
● HTML
● Parsing
● Crawling
● JSON
Javascript Open Notation
Readable text used to transmit data
objects consisting of attribute-value pairs
Definitions
10
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}
● Scraping
● HTTP
● HTML
● Parsing
● JSON
● Crawling
● API
Application Programming Interface
A set of rules and protocols used to build a
software application. In the context of Web
Scraping an API is a method used to gather
clean data from a website (i.e. data that is not
wrapped in HTML, Javascript, bound in HTTP,
etc.)
Definitions
11
Image by Tsahi Levent-Levi
Webscraper.io
Demonstration & Hands-on
12
Demo: Scraping Congressional Press Releases
● Representative Nancy Pelosi’s Press Releases
○ CONTENT
■ Structure of the Press Release subsection of the site
● Pagination
● Links to each release
● Common elements of release
○ TOOLS
■ Webscraper.io tool works inside of Chrome
● Tutorials
● Documentation
● Community
● Free or, alternatively, Fee for Service
13
14
Now You Try It
1. http://v.gd/webscraping1111
2. Follow the download & installation instructions: step 9a
3. Find some congressional press release sites: step 9b
4. Follow the instructions: steps 1-8
15
Google Sheets
16
Demonstration
ImportHTML
ImportXML
Google Sheets -- ImportHTML
● Example Site: http://www.boxofficemojo.com/
● IMPORTHTML(url, query, index)
● In Practice:
=IMPORTHTML("http://www.boxofficemojo.com/","table",3)
http://www.boxofficemojo.com/movies/?page=intl&id=annie2014.htm
=IMPORTHTML(A6,"table",6)
Help Docs: https://support.google.com/docs/answer/3093339?hl=en
17
Google Sheets -- ImportXML
● Example Site: http://www.nytimes.com/
● IMPORTXML(url, xpath_query)
● In Practice:
=IMPORTXML("http://nytimes.com/", "//*/p[contains(@class,
'summary')]")
Help Docs: https://support.google.com/docs/answer/3093342?hl=en
18
Resources
YouTube Videos
● Web Scraping with Google Sheets
● Importing Data 4 ImportXML
● Web scraping using Google Docs - Xpath
Other Resources
● CSS Tutorial
● XPath
● XPath Language defined by W3C
Your Turn:
● https://github.com/data-and-visualization/Rfun
19
APIs, Parsing, & JSON
20
OpenRefine & JSON file format
● Demonstration (http://v.gd/parsing3333)
○ A step-by-step guide using OpenRefine to gather JSON data via Google Map’s API; then
parse the JSON for latitude & longitude
21
Parsing HTML
22
OpenRefine & ParseHtml
● BeautifulSoup Libraries
○ Refine uses the Jython Libraries and has Jsoup
○ jSoup is a Java library built on BeautifulSoup -- a tool for HTML Extraction
● Resources (OpenRefine)
○ Step-by-step example documented in the demonstration above
● Documentation
○ Refine’s documentation on HTML Parsing
○ jSoup Documentation
● Now You Try it -- http://v.gd/parsing2222
23
Case Study: OpenSecrets
and documentation
24
OpenSecrets API and FTP
● OpenSecrets tracks the effects of money and lobbying in elections and
politics
● OpenSecrets has an API
● OpenSecrets API Documentation
● OpenSecrets Bulk Data downloader
○ Login
○ Lobby.zip
25
Social Media
26
Social Media
1. Many ways to gather social media data
a. IFTTT where you compose rules to connect sites and can deposit data in a spreadsheet
b. APIs - often requires registered keys
c. Buy your data from a service such as GNIP
2. After you download it you may want to perform analysis
a. Sentiment Analysis, Word Frequency, Correlation, etc.
b. Text Analysis tools (from Digital Humanities LibGuide)
c. Digital Studio’s program on working with Texts: Comparing and choosing texts analysis tools
27
TAGS: a tool for collecting Twitter streams
● TAGS (“New Sheets”; Version 6.0ns) - https://tags.hawksey.info/
○ Form driven (not command line)
○ Minimal setup
○ Data are collected in Google Sheets
○ Gather twitter stream data by type
■ screen-name stream data
■ screen-name status updates
■ twitter user favorited tweets
■ Search term for last 7 days: hashtag stream, username, boolean logic
■ Limit by date
■ Schedule to run hourly - set your interval, or run once.
■ 3 minute setup-video; easy to use - https://youtu.be/Vm0kjAvH5HM
■ Outputs: raw CSV structured data, plus default social graph visualizations
28
Thank You!
29
Please complete feedback forms

More Related Content

PDF
How do we develop open source software to help open data ? (MOSC 2013)
PDF
Creating Open Data with Open Source (beta2)
PPTX
Data-Analytics using python (Module 4).pptx
PDF
Open Source Weather Information Project with OpenStack Object Storage
PDF
Open Web and Open Data Conf irm 2013
PDF
Open Social Summit Korea Overview
PDF
Big Query Basics
PDF
Website & Internet + Performance testing
How do we develop open source software to help open data ? (MOSC 2013)
Creating Open Data with Open Source (beta2)
Data-Analytics using python (Module 4).pptx
Open Source Weather Information Project with OpenStack Object Storage
Open Web and Open Data Conf irm 2013
Open Social Summit Korea Overview
Big Query Basics
Website & Internet + Performance testing

Similar to Web Scraping_ Gathering Data from Websites.pptx (20)

PDF
Workflow Engines + Luigi
PDF
Crawling and Processing the Italian Corporate Web
PDF
What is web scraping?
PDF
Elasticsearch Performance Testing and Scaling @ Signal
PPTX
Jeremy cabral search marketing summit - scraping data-driven content (1)
PDF
Deep Web
PDF
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
PPTX
Revealing ALLSTOCKER
PDF
Big Query - Women Techmarkers (Ukraine - March 2014)
PDF
Data Collection and Consumption
PPTX
Democratizing data science Using spark, hive and druid
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PPTX
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
PDF
Web Service and Mobile Integrated Day I
PPTX
Reto2.011 APEX API
PDF
An EyeWitness View into your Network
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
PDF
Ice dec04-04-sammy
PDF
How to Get Hidden Web Data Using ChatGPT Web Scraping_.pdf
PDF
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Workflow Engines + Luigi
Crawling and Processing the Italian Corporate Web
What is web scraping?
Elasticsearch Performance Testing and Scaling @ Signal
Jeremy cabral search marketing summit - scraping data-driven content (1)
Deep Web
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Revealing ALLSTOCKER
Big Query - Women Techmarkers (Ukraine - March 2014)
Data Collection and Consumption
Democratizing data science Using spark, hive and druid
Advanced Analytics and Machine Learning with Data Virtualization
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Web Service and Mobile Integrated Day I
Reto2.011 APEX API
An EyeWitness View into your Network
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Ice dec04-04-sammy
How to Get Hidden Web Data Using ChatGPT Web Scraping_.pdf
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Ad

Recently uploaded (20)

PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
Leprosy and NLEP programme community medicine
PPTX
Managing Community Partner Relationships
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
Modelling in Business Intelligence , information system
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
modul_python (1).pptx for professional and student
PDF
Introduction to the R Programming Language
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Optimise Shopper Experiences with a Strong Data Estate.pdf
Pilar Kemerdekaan dan Identi Bangsa.pptx
climate analysis of Dhaka ,Banglades.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
[EN] Industrial Machine Downtime Prediction
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Leprosy and NLEP programme community medicine
Managing Community Partner Relationships
DATA COLLECTION METHODS-ppt for nursing research
Modelling in Business Intelligence , information system
STERILIZATION AND DISINFECTION-1.ppthhhbx
modul_python (1).pptx for professional and student
Introduction to the R Programming Language
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
ISS -ESG Data flows What is ESG and HowHow
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Ad

Web Scraping_ Gathering Data from Websites.pptx

  • 1. Web Scraping Gathering Data from Websites M.MOHAMED MUSTHAFA
  • 2. Web Scraping - What We’ll Cover 1. Build a data corpus of congressional press releases 2. APIs and gather latitude and longitude -- using JSON formatted data 3. A brief hands-on introduction into HTML parsing 4. APIs and Documentation (FTP) -- OpenSecrets.org 5. Discussion of APIs and Social Media data gathering 6. A brief discussion on the ethics of scraping 2
  • 3. This is not a programming workshop, but... 1. We will discuss Python and BeautifulSoup 2. We will not learn or use Python in the workshop 3. However, some automation tools are used in this workshop 4. Web Scraping is about deconstructing websites. Effective scraping requires learning about technical infrastructure as well as subject content 5. Not a workshop on Text Analysis (tools that calculate or correlate your data) 6. Not a workshop on data cleaning 3
  • 5. Definitions 5 ● Scraping Using tools to gather data you can see on a webpage A wide range of web scraping techniques and tools exist. These can be as simple as copy/paste and increase in complexity to automation tools, HTML parsing, APIs and programming Scraping propolis from the sides of the bee box Image by Abalg~commonswiki
  • 6. Definitions 6 ● Scraping ● HTTP HyperText Transfer Protocol Machine interchange information transported over the Internet to enable multi-media data exchange, aka WWW. The protocol defines aspects of authentication, requests, status codes, persistent connections, client/server request/response. etc. Access a server on port 80; the declarative Document Type Definition ( HTML, XML, JSON, etc.) Images from commons.WikiMedia.org
  • 7. ● Scraping ● HTTP ● HTML HyperText Markup Language The standard markup language on the Web As the web evolves so does the proliferation of technical wrappers surrounding the visible content of websites (text and data) Definitions 7
  • 8. ● Scraping ● HTTP ● HTML ● Parsing The act of analyzing the strings and symbols to reveal only the data you need Definitions 8
  • 9. ● Scraping ● HTTP ● HTML ● Parsing ● Crawling Moving across or through a website in an attempt to gather data from more than one URL or page Definitions 9 Image by Dave Gingrich
  • 10. ● Scraping ● HTTP ● HTML ● Parsing ● Crawling ● JSON Javascript Open Notation Readable text used to transmit data objects consisting of attribute-value pairs Definitions 10 { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ], "children": [], "spouse": null }
  • 11. ● Scraping ● HTTP ● HTML ● Parsing ● JSON ● Crawling ● API Application Programming Interface A set of rules and protocols used to build a software application. In the context of Web Scraping an API is a method used to gather clean data from a website (i.e. data that is not wrapped in HTML, Javascript, bound in HTTP, etc.) Definitions 11 Image by Tsahi Levent-Levi
  • 13. Demo: Scraping Congressional Press Releases ● Representative Nancy Pelosi’s Press Releases ○ CONTENT ■ Structure of the Press Release subsection of the site ● Pagination ● Links to each release ● Common elements of release ○ TOOLS ■ Webscraper.io tool works inside of Chrome ● Tutorials ● Documentation ● Community ● Free or, alternatively, Fee for Service 13
  • 14. 14
  • 15. Now You Try It 1. http://v.gd/webscraping1111 2. Follow the download & installation instructions: step 9a 3. Find some congressional press release sites: step 9b 4. Follow the instructions: steps 1-8 15
  • 17. Google Sheets -- ImportHTML ● Example Site: http://www.boxofficemojo.com/ ● IMPORTHTML(url, query, index) ● In Practice: =IMPORTHTML("http://www.boxofficemojo.com/","table",3) http://www.boxofficemojo.com/movies/?page=intl&id=annie2014.htm =IMPORTHTML(A6,"table",6) Help Docs: https://support.google.com/docs/answer/3093339?hl=en 17
  • 18. Google Sheets -- ImportXML ● Example Site: http://www.nytimes.com/ ● IMPORTXML(url, xpath_query) ● In Practice: =IMPORTXML("http://nytimes.com/", "//*/p[contains(@class, 'summary')]") Help Docs: https://support.google.com/docs/answer/3093342?hl=en 18
  • 19. Resources YouTube Videos ● Web Scraping with Google Sheets ● Importing Data 4 ImportXML ● Web scraping using Google Docs - Xpath Other Resources ● CSS Tutorial ● XPath ● XPath Language defined by W3C Your Turn: ● https://github.com/data-and-visualization/Rfun 19
  • 20. APIs, Parsing, & JSON 20
  • 21. OpenRefine & JSON file format ● Demonstration (http://v.gd/parsing3333) ○ A step-by-step guide using OpenRefine to gather JSON data via Google Map’s API; then parse the JSON for latitude & longitude 21
  • 23. OpenRefine & ParseHtml ● BeautifulSoup Libraries ○ Refine uses the Jython Libraries and has Jsoup ○ jSoup is a Java library built on BeautifulSoup -- a tool for HTML Extraction ● Resources (OpenRefine) ○ Step-by-step example documented in the demonstration above ● Documentation ○ Refine’s documentation on HTML Parsing ○ jSoup Documentation ● Now You Try it -- http://v.gd/parsing2222 23
  • 24. Case Study: OpenSecrets and documentation 24
  • 25. OpenSecrets API and FTP ● OpenSecrets tracks the effects of money and lobbying in elections and politics ● OpenSecrets has an API ● OpenSecrets API Documentation ● OpenSecrets Bulk Data downloader ○ Login ○ Lobby.zip 25
  • 27. Social Media 1. Many ways to gather social media data a. IFTTT where you compose rules to connect sites and can deposit data in a spreadsheet b. APIs - often requires registered keys c. Buy your data from a service such as GNIP 2. After you download it you may want to perform analysis a. Sentiment Analysis, Word Frequency, Correlation, etc. b. Text Analysis tools (from Digital Humanities LibGuide) c. Digital Studio’s program on working with Texts: Comparing and choosing texts analysis tools 27
  • 28. TAGS: a tool for collecting Twitter streams ● TAGS (“New Sheets”; Version 6.0ns) - https://tags.hawksey.info/ ○ Form driven (not command line) ○ Minimal setup ○ Data are collected in Google Sheets ○ Gather twitter stream data by type ■ screen-name stream data ■ screen-name status updates ■ twitter user favorited tweets ■ Search term for last 7 days: hashtag stream, username, boolean logic ■ Limit by date ■ Schedule to run hourly - set your interval, or run once. ■ 3 minute setup-video; easy to use - https://youtu.be/Vm0kjAvH5HM ■ Outputs: raw CSV structured data, plus default social graph visualizations 28

Editor's Notes

  • #26: The main point of this slide highlights that making data available via [public] APIs is an evolving trend. There are older standards, like FTP, which still facilitate the release of information and clean data. Sometimes the data are right under your nose if you simply read the available documentation. In this example, the APIs and webpages do not support a basic data need, but the Bulk Downloader -- once registered and logged in -- does make data available in a needed format/configuration. https://www.opensecrets.org/resources/create/data.php