Web Scraping With Python

Web Scraping With Python
Robert Dempsey

 There is a lot of data provided freely on the Internet.
 Not all data is free, and not all site owners allow you to scrape
data from their sites.
 ALWAYS check the terms of service for a website BEFORE
scraping it.
 Be responsible, and stay within legal limits at all times.
Important Disclaimer

Data Wranglers LinkedIn Group
Where the discussions happen.

 If you have a question – ask it.
 Be polite and courteous to others.
 Turn your cell phones to vibrate when you come to the meeting.
 You know more than you think. At some point, I’d like you to
share, with us, something you’ve learned so we can all benefit
from it.
Group Rules

 Wireless Network: Logik_guest
 Password: logik1234
Connecting to the Internet

XPath
Xpath Helper – Adam Sadovsky
Xpath finder

 Our method: BeautifulSoup4 + Python libraries
 Scrapy
 Application framework (you still have to code)
 http://scrapy.org
DIY Scraper - Python

 Bare Metal = Nokogiri + Mechanize
 Frameworks
 Upton: https://github.com/propublica/upton
 Wombat: https://github.com/felipecsl/wombat
DIY Scraper - Ruby

Browser Extensions For Scraping
Scraper
https://chrome.google.com/webstore/detail/s
craper/mbigbapnjcgaffohmbkdlecaccepngjd

Grabbing The Full Monty
SiteSucker: sitesucker.us
Wget: http://www.gnu.org/s/wget/

 CSS Sprites
 Honeypots
 IP blocking
 Captcha
 Login
 Ad popups
The Ways Websites Try To Block Us

NetShade
http://raynersoftware.com/netshade/
WinGate
http://www.wingate.com/

 Continuum.io: Anaconda
 http://continuum.io/downloads
 BeautifulSoup
 http://www.crummy.com/software/BeautifulSoup/
 pip install beautifulsoup4
 easy_install beautifulsoup4
 Unicodecsv
 pip install unicodecsv
Installs

 Find the webpage(s) you want
 Get the path to the data using Xpath or the CSS selectors
 Write the code
 Test
 Scrape
 Export to CSV
 Enjoy your data!
General Steps

1. Ensure you’ve installed the extension
2. Log in to Google Docs (this is where the data goes)
3. Open the URL: http://www.inc.com/inc5000/list
4. Highlight the first line
5. Right-click and select “Scrape Similar”
6. Verify the data in the window that pops up
7. Click the “Export to Google Docs…” button
8. Voila!
#1: Scraping the Inc. 5000 with Scraper

 Only works with data in a tabular format
 Only exports to Google Docs
 Works on one page at a time
 Suggestion: Keep the scraping window open, go to the next page, click
“Scrape” again.
Notes On Scraper

 BeautifulSoup
 A toolkit for dissecting a document and extracting what you need.
 Automatically converts incoming documents to Unicode and outgoing
documents to UTF-8.
 Sits on top of popular Python parsers like lxml and html5lib
 Examples
 http://www.crummy.com/software/BeautifulSoup/bs4/doc/
#2: Using Python to Scrape Pages

1. Import your libraries
2. Take a LinkedIn URL as input
3. Build an opener
4. Create the soup using BS4
5. Extract the company description and specialties
6. Clean up the rest of the data
7. Extract the website, type, founded, industry, and company
size if they exist, otherwise set them to “N/A”
8. Output to CSV
9. Sleep some random number of seconds & milliseconds
Scraping LinkedIn Company Pages -
PseudoCode

 https://github.com/rdempsey/dwdc
Get The Code

Contacting Rob
 robertonrails@gmail.com
 Twitter: rdempsey
 LinkedIn: robertwdempsey

Web Scraping With Python

More Related Content

What's hot (20)

Similar to Web Scraping With Python (20)

More from Robert Dempsey (20)

Recently uploaded (20)

Web Scraping With Python

Editor's Notes