SlideShare a Scribd company logo
scraping,




                               http://www.flickr.com/photos/juan23/82888194/
 scripting and
 hacking your way to
 API-less data
[AKA: if you don’t have data
feeds, we’ll get it anyway]
overview

•   “getting data out”
•   non-exhaustive (and rapid!)
•   slightly random
•   live examples (hopefully)
•   mainly non-technical(ish)
•   mainly non-illegal. I think.
anything goes

•   have no fear!
•   feel no remorse!
•   be shameless!
•   long live the open data revolution!
you

• half newbie, half “done some”
me

• not really a developer
• ..but code enough ASP (stop giggling)
to do what I want to do
• slides will be at slideshare.net/dmje
• www.electronicmuseum.org.uk
• mike.ellis@eduserv.org.uk
we <3 data

• we want programmatic access...
• ...but sites are often lacking
• ...and APIs are usually a pipe dream

 http://www.ucas.com/instit/i/h60.html




                                         http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
scraping

 • copy & paste, without having to copy &
 paste...
 • an inexact but really rather beautiful
 science




Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")

Call xmlhttp.Open("GET",url,False)
Call xmlhttp.send

ReturnedXML = xmlhttp.responsetext
scraping (cont)

• frowned on by purists...
• but really rather powerful
• http://hoard.it
extraction #1: Y!Pipes

•   find your data on page
•   view source
•   determine the delimeters
•   put it into Pipes
•   extract the output




                               originating page | output
extraction #2: Google Docs

• create a new google spreadsheet
• find the URL of the data you want
• identify how it is encapsulated (list/
table)
• use the importHTML() function (others for
feeds, xml, data, etc)
• dump out data as...CSV/XML/RSS/etc




                           originating page | output
extraction #3: dapper.net

• go to dapper.net/open
• identify several of the urls with the same
“shapes” that you want to scrape
• use the dapper dashboard to identify
content areas
• build the “dapp”
• pass in url’s of pages you want to extract
data from
• extract results from the output (xml,
flash, csv, etc)




                          originating page | output
extraction #4: YQL

•   view source on the page you want to grab
•   go to http://developer.yahoo.com/yql/console/
•   get your XPath hat on and build a query
•   grab the data from a RESTful query




      http://developer.yahoo.com/yql/console/?
      q=select%20*%20from%20html%20where%20url%3D
      %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq
      %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa
      %5B%40class%3D%22result%22%5D%27




                                   originating page | output
extraction #5: httrack

• grab a copy of httrack (or similar)from
  http://www.httrack.com/
• point it at the bit of the site you want,
make sure the filters are correct, and push
go...
• you now have a local copy of the site, to
munge as you see fit
extraction #6: hacked search

• get an API key from Yahoo!
• use it to search within a domain
• script a standard download script to pick
out each page and download it
• hack that mumma
• (variation on a theme: build a simple
spider...)
now you’ve got your data..

• once you’ve got your data, you usually
need to munge it...
munging #1: regex!

• I’m terrible at regex
• ([A-PR-UWYZ0-9][A-HK-Y0-9]
[AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}
[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)
• but it’s incredibly powerful...




                                            output
munging #2: find/replace

• use whatever scripting language you work
best with
• (even Word...)
• you’ll find that replace double space,
replace weird characters, replace paragraph
marks are about the most common needs
munging #3: mail merge!

• for rapid builds of html, javascript or
xml
• have a source document (often extracted or
munged from other sites) in Excel
• you can use filters to effectively grab
the data you need
• build the merge in Word, using the
“directory” option
• copy and paste the result out
munging #4: html removal

• have a function handy that you can pass a
block of html
• it is handy to have a script where you can
define which particular tags to remove or
leave in place
munging #5: html tidy

• grab a copy of html tidy from
 http://tidy.sourceforge.net/
• tidy is available as a downloadable .exe
or a component that you can pass data to in
your code
processing #1: Open Calais

• a service from Reuters for analysing
blocks of text for semantic “meaning”
• get an API key from Open Calais
• send data via a POST to the REST service
• retrieve results from the RDF
• OR...just paste your text into
http://sws.clearforest.com/calaisviewer/




                                             output
processing #2: Yahoo! TE

• a webservice for grabbing tags/terms from
blocks of text
• sign up for a Yahoo! API key
• pass your block of text using POST
• grab the results..




                                          output
processing #3: geo!

• go to http://developer.yahoo.com/geo !
the ugly sisters

• Access
• Excel (!)
the last resorts

• FOI (frankie!)
• OCR (me)
the very last resort..

• re-type it...
• (or use Amazon Mechanical Turk)
...any more?

More Related Content

PDF
Scraping with Python for Fun and Profit - PyCon India 2010
PDF
CSS 201
PDF
Google Hacking Basics
PPTX
Welcome to hack
PDF
API Design & Security in django
PPTX
Building Beautiful REST APIs in ASP.NET Core
PPTX
PDF
libinjection: from SQLi to XSS  by Nick Galbreath
Scraping with Python for Fun and Profit - PyCon India 2010
CSS 201
Google Hacking Basics
Welcome to hack
API Design & Security in django
Building Beautiful REST APIs in ASP.NET Core
libinjection: from SQLi to XSS  by Nick Galbreath

Viewers also liked (9)

PPTX
CLV e Mídia Programática
PDF
Top Mobile App Monetization Tactics You Ought to Know
PDF
Calculating LTV Using Flurry
PDF
Calculating LTV Using Google Analytics
PPTX
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
PPTX
Two Methods for Modeling LTV with a Spreadsheet
PPTX
Everything You Need to Know About Customer Lifetime Value (CLV)
PPTX
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
PPTX
A step by-step guide to calculating customer lifetime value
CLV e Mídia Programática
Top Mobile App Monetization Tactics You Ought to Know
Calculating LTV Using Flurry
Calculating LTV Using Google Analytics
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Two Methods for Modeling LTV with a Spreadsheet
Everything You Need to Know About Customer Lifetime Value (CLV)
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
A step by-step guide to calculating customer lifetime value

Similar to Scraping Scripting Hacking (20)

PDF
The Web Application Hackers Toolchain
PPTX
PPT
Learning to code
PDF
Google Hacking 101
KEY
YQL: Select * from Internet
PDF
Html5: Something wicked this way comes (Hack in Paris)
PDF
Rapid API Development ArangoDB Foxx
PDF
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
PPTX
Web Scrapping Using Python
PPTX
Protect Your Payloads: Modern Keying Techniques
PPTX
Sesi 8_Scraping & API for really bnegineer.pptx
KEY
YQL:: Select * from Internet
PDF
Jinx - Malware 2.0
PPTX
Data-Analytics using python (Module 4).pptx
KEY
[2010]我有一个梦想
PPTX
Session 03 acquiring data
PPTX
Session 03 acquiring data
PPTX
Splunk bsides
ODP
Yahoo! Search monkey API - CEBIT 2008
PPTX
Basic PowerShell Toolmaking - Spiceworld 2016 session
The Web Application Hackers Toolchain
Learning to code
Google Hacking 101
YQL: Select * from Internet
Html5: Something wicked this way comes (Hack in Paris)
Rapid API Development ArangoDB Foxx
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Web Scrapping Using Python
Protect Your Payloads: Modern Keying Techniques
Sesi 8_Scraping & API for really bnegineer.pptx
YQL:: Select * from Internet
Jinx - Malware 2.0
Data-Analytics using python (Module 4).pptx
[2010]我有一个梦想
Session 03 acquiring data
Session 03 acquiring data
Splunk bsides
Yahoo! Search monkey API - CEBIT 2008
Basic PowerShell Toolmaking - Spiceworld 2016 session

More from Mike Ellis (20)

PPTX
5 digital habits of highly effective museums
PPTX
How to stop freelance from killing you
PPTX
Getting collections online
PDF
Why Wordpress is better than your cms
KEY
Forget the objects, tell the stories
PPT
Bath Digital general introduction
KEY
Stop the noise - ten digital marketing tips
KEY
Bathcamp 2010 zeitgeist
PPT
Strategic digital marketing: some ideas for joining things up
PPT
If you love your content, set it free (v3.0)
PPT
Mobile: the next frontier
PPT
Niche or Platform - what next for our institutions online?
PPT
The Intertubes Everywhere
PPT
Bathcamp #8: Quiz Of The Year
KEY
The Benefits Of Doing Things Differently
PPT
Collaboration 2.0
KEY
Getting people together
KEY
3 minutes, one technology: the piano
PDF
Don't Think Websites, think data
PPT
Everyware - "the future is already here, it's just not well distributed yet"
5 digital habits of highly effective museums
How to stop freelance from killing you
Getting collections online
Why Wordpress is better than your cms
Forget the objects, tell the stories
Bath Digital general introduction
Stop the noise - ten digital marketing tips
Bathcamp 2010 zeitgeist
Strategic digital marketing: some ideas for joining things up
If you love your content, set it free (v3.0)
Mobile: the next frontier
Niche or Platform - what next for our institutions online?
The Intertubes Everywhere
Bathcamp #8: Quiz Of The Year
The Benefits Of Doing Things Differently
Collaboration 2.0
Getting people together
3 minutes, one technology: the piano
Don't Think Websites, think data
Everyware - "the future is already here, it's just not well distributed yet"

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PPTX
Spectroscopy.pptx food analysis technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Spectroscopy.pptx food analysis technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.

Scraping Scripting Hacking

  • 1. scraping, http://www.flickr.com/photos/juan23/82888194/ scripting and hacking your way to API-less data [AKA: if you don’t have data feeds, we’ll get it anyway]
  • 2. overview • “getting data out” • non-exhaustive (and rapid!) • slightly random • live examples (hopefully) • mainly non-technical(ish) • mainly non-illegal. I think.
  • 3. anything goes • have no fear! • feel no remorse! • be shameless! • long live the open data revolution!
  • 4. you • half newbie, half “done some”
  • 5. me • not really a developer • ..but code enough ASP (stop giggling) to do what I want to do • slides will be at slideshare.net/dmje • www.electronicmuseum.org.uk • mike.ellis@eduserv.org.uk
  • 6. we <3 data • we want programmatic access... • ...but sites are often lacking • ...and APIs are usually a pipe dream http://www.ucas.com/instit/i/h60.html http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
  • 7. scraping • copy & paste, without having to copy & paste... • an inexact but really rather beautiful science Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0") Call xmlhttp.Open("GET",url,False) Call xmlhttp.send ReturnedXML = xmlhttp.responsetext
  • 8. scraping (cont) • frowned on by purists... • but really rather powerful • http://hoard.it
  • 9. extraction #1: Y!Pipes • find your data on page • view source • determine the delimeters • put it into Pipes • extract the output originating page | output
  • 10. extraction #2: Google Docs • create a new google spreadsheet • find the URL of the data you want • identify how it is encapsulated (list/ table) • use the importHTML() function (others for feeds, xml, data, etc) • dump out data as...CSV/XML/RSS/etc originating page | output
  • 11. extraction #3: dapper.net • go to dapper.net/open • identify several of the urls with the same “shapes” that you want to scrape • use the dapper dashboard to identify content areas • build the “dapp” • pass in url’s of pages you want to extract data from • extract results from the output (xml, flash, csv, etc) originating page | output
  • 12. extraction #4: YQL • view source on the page you want to grab • go to http://developer.yahoo.com/yql/console/ • get your XPath hat on and build a query • grab the data from a RESTful query http://developer.yahoo.com/yql/console/? q=select%20*%20from%20html%20where%20url%3D %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa %5B%40class%3D%22result%22%5D%27 originating page | output
  • 13. extraction #5: httrack • grab a copy of httrack (or similar)from http://www.httrack.com/ • point it at the bit of the site you want, make sure the filters are correct, and push go... • you now have a local copy of the site, to munge as you see fit
  • 14. extraction #6: hacked search • get an API key from Yahoo! • use it to search within a domain • script a standard download script to pick out each page and download it • hack that mumma • (variation on a theme: build a simple spider...)
  • 15. now you’ve got your data.. • once you’ve got your data, you usually need to munge it...
  • 16. munging #1: regex! • I’m terrible at regex • ([A-PR-UWYZ0-9][A-HK-Y0-9] [AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2} [0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA) • but it’s incredibly powerful... output
  • 17. munging #2: find/replace • use whatever scripting language you work best with • (even Word...) • you’ll find that replace double space, replace weird characters, replace paragraph marks are about the most common needs
  • 18. munging #3: mail merge! • for rapid builds of html, javascript or xml • have a source document (often extracted or munged from other sites) in Excel • you can use filters to effectively grab the data you need • build the merge in Word, using the “directory” option • copy and paste the result out
  • 19. munging #4: html removal • have a function handy that you can pass a block of html • it is handy to have a script where you can define which particular tags to remove or leave in place
  • 20. munging #5: html tidy • grab a copy of html tidy from http://tidy.sourceforge.net/ • tidy is available as a downloadable .exe or a component that you can pass data to in your code
  • 21. processing #1: Open Calais • a service from Reuters for analysing blocks of text for semantic “meaning” • get an API key from Open Calais • send data via a POST to the REST service • retrieve results from the RDF • OR...just paste your text into http://sws.clearforest.com/calaisviewer/ output
  • 22. processing #2: Yahoo! TE • a webservice for grabbing tags/terms from blocks of text • sign up for a Yahoo! API key • pass your block of text using POST • grab the results.. output
  • 23. processing #3: geo! • go to http://developer.yahoo.com/geo !
  • 24. the ugly sisters • Access • Excel (!)
  • 25. the last resorts • FOI (frankie!) • OCR (me)
  • 26. the very last resort.. • re-type it... • (or use Amazon Mechanical Turk)