SlideShare a Scribd company logo
Fun Learning
Web Scraping
queensjs.s02e02.mp4
Danny Garcia
@buzzedword
• Director of Engineering/Operations at
ClassPass
• Full stack engineer, DevOps day to day
• First love is Javascript
Screen Scraping
You should know about it
Simple Scraping
Fun Learning Web Scraping - QueensJS - 9/2/2015
• Layout dependent
• Difficult to traverse
• Error prone
• RegEx not made for XML
• Doesn’t support AJAX
Advanced Scraping
• Layout dependent
• Doesn’t support AJAX
Browser Emulation
• Layout dependent
• Slow
Fun Learning Web Scraping - QueensJS - 9/2/2015
Screen Scraping
You can beat it
Don’t make it easy
• Are you using IDs?
• Are you HTML templating?
• How identifiable are your components?
• Randomly break your layout to stop scripts
All about AJAX
• Defeats most RegEx scrapers
• Defeats most XMLParsers
• Browsers can render AJAX
• Don’t use easy to access AJAX routes
• https://queensjs.com/page/2
• use https://queensjs.com/_ajax/pagination&?q=1
Know your enemy!
• CSRF tokens for fields
• User authentication required
• Use a paywall to discourage anonymity
• Trace IP addresses
• DMCA takedown request
Fun Learning Web Scraping - QueensJS - 9/2/2015
Screen Scraping
…it’ll still happen
Why?
• Your data is amazing
• You don’t have an API
• Backdoor feature request
• Hackers!
Scraping happens
but who does it?
Fun Learning Web Scraping - QueensJS - 9/2/2015
Fun Learning Web Scraping - QueensJS - 9/2/2015
Fun Learning Web Scraping - QueensJS - 9/2/2015
Fun Learning Web Scraping - QueensJS - 9/2/2015
Fun Learning Web Scraping - QueensJS - 9/2/2015
Fun Learning Web Scraping - QueensJS - 9/2/2015
Fun Learning Web Scraping - QueensJS - 9/2/2015
Is it worth it?
Costs
• Development time
• Anchoring bad features
• Never stops
• Alienating good engineers
What can you do?
• No, really don’t use IDs.
• Layout changes do break scraping
• Integrate attack vectors as new product features
• Don’t panic
• Scraping on an individual level is a query
• Scraping in a cluster is an attack
Thank you!
Danny Garcia
@buzzedword
ClassPass
https://classpass.com/jobs

More Related Content

PPTX
GCC 11-13-15
PDF
Web Fundamentals Crash Course
PDF
Rich Toker's "Things I carry: tools for success"
ZIP
Lecture 4
PPTX
Bisnestreffit valli 18.1.2013
PPT
Spnd 456 second weekend simmons 2012 math ef assessment
PDF
Certificate- City&Guilds L3
GCC 11-13-15
Web Fundamentals Crash Course
Rich Toker's "Things I carry: tools for success"
Lecture 4
Bisnestreffit valli 18.1.2013
Spnd 456 second weekend simmons 2012 math ef assessment
Certificate- City&Guilds L3

Viewers also liked (13)

PDF
Proteus Project : Arduino programming for LED
PPTX
Proto-GIS and the Birth of Digital Mapping
PDF
[How digital changes Advertising industry] Digital driven cases in Integrated...
PPTX
DSP - General insights of digital medium - Version 1
PPTX
Identiti warisan negeri
PDF
mobile retail
PDF
Teaching Cartography during the Geospatial Revolution
PDF
Cellfast katalog-2016
PPTX
Sistema respiratório
PPTX
Hakikat nilai moral dalam kehidupan manusia
PPT
Pakaian tradisional murid
PPTX
Sistema Circulatório
KEY
Design week - Chris Blow
Proteus Project : Arduino programming for LED
Proto-GIS and the Birth of Digital Mapping
[How digital changes Advertising industry] Digital driven cases in Integrated...
DSP - General insights of digital medium - Version 1
Identiti warisan negeri
mobile retail
Teaching Cartography during the Geospatial Revolution
Cellfast katalog-2016
Sistema respiratório
Hakikat nilai moral dalam kehidupan manusia
Pakaian tradisional murid
Sistema Circulatório
Design week - Chris Blow
Ad

Similar to Fun Learning Web Scraping - QueensJS - 9/2/2015 (20)

PDF
The ultimate guide to web scraping 2018
PPT
Web crawlingchapter
PPTX
Scraping talk public
PDF
Intro to-html-backbone-angular
PDF
Intro to mobile web application development
PDF
What are the different types of web scraping approaches
PPTX
Web Scraping Services.pptx
PPTX
Web Scraping Technologies
PPTX
Web Development Introduction to jQuery
PDF
How to scrape data as economics student
PDF
jQuery quick tips
PDF
A introduction to Scraperwiki (for not developers)
PDF
What is Web-scraping?
PPTX
Web scraping
PPTX
Browsers. Magic is inside.
PDF
Intro javascript build a scraper (3:22)
PPTX
This is why we can't have nice things…
PPTX
GCC 11-13-15
PDF
Progressive Enhancement with JavaScript and Ajax
The ultimate guide to web scraping 2018
Web crawlingchapter
Scraping talk public
Intro to-html-backbone-angular
Intro to mobile web application development
What are the different types of web scraping approaches
Web Scraping Services.pptx
Web Scraping Technologies
Web Development Introduction to jQuery
How to scrape data as economics student
jQuery quick tips
A introduction to Scraperwiki (for not developers)
What is Web-scraping?
Web scraping
Browsers. Magic is inside.
Intro javascript build a scraper (3:22)
This is why we can't have nice things…
GCC 11-13-15
Progressive Enhancement with JavaScript and Ajax
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Unlocking AI with Model Context Protocol (MCP)
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Fun Learning Web Scraping - QueensJS - 9/2/2015

Editor's Notes

  • #11: What’s different here? The previous example recreates the DOM from an XMLHTTPRequest and parses it using JSDOM. This example uses X-ray, which by default uses Request, but can be backed by PhantomJS driver. This spawns a headless browser, then drives it remotely. This is SLOW.
  • #26: Site crawling
  • #27: Well meaning individuals
  • #28: Hackers!