SlideShare a Scribd company logo
Web Scraping
Chapter 1
Harvesting
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Chapter 2
Retribution
Legal standpoint
• As long robots.txt prohibit scraping - it's illegal

• As long terms of service prohibit scraping - it's illegal
• As long as you're abusing the servers - it's illegal
• As long as you're using the data without crediting the source - it's illegal
Ethic standpoint
• Be reasonable with timeouts and threads

• Let the website know you're bot through the user agent

• Agree the most suitable time for parsing

• Be reasonable with scope
Please, avoid
being an
asshole
Chapter 3
Provenance
Web Scraping
Fetching data
• Curl, fetch, request, etc.

• phantomjs, puppeteer
What can we do here?
• Selective crawling

• URL prediction

• Duplicate request prevention (FS / DB access is cheaper than network)

• Smart scheduling
Chapter 4
Parse
Web Scraping
HTML
• Clean RegExp is a mistake in a long run 

• https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

• Building AST tree is the default approach (see parse5, himalaya)
Walking through AST
• Cheerio

• jsdom

• x-ray

• traverse through AST manually
Tips #1
• Write set of useful helpers/wrappers upfront

• Keep the parsers granular and reusable

• Spend time to make it fault tolerant

• Always verify the block correctness

• Write tests for target markup

• Keep logs
Tips #2
• Keep the reference of parsed data easily accessible

• Permanently eject parsing results

• Be reasonable. RAM is cheap, time is expensive

• Store image hash sums and get rid of duplicates

• Retain the data even if you don't know how to use it now

• File system is fast, but DB is cheaper "online" updates
Dynamic content
• API

• User emulation (puppeteer)
Chapter 5
Normalize
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
Web Scraping
The problem
• Data taken from multiple sources

• Data which was initially dirty

• Content submitted by customers

• Complex data which can be simplified
Steps
• Trim, lowercase

• Remove noise symbols with regular expressions

• Identify and remove noise data

• Mark some dataset as reference and go with string similarity algorithms

• Machine learning classification algorithms
Steps
• Trim, lowercase

• Remove noise symbols with regular expressions

• Identify and remove noise data

• Mark some dataset as reference and go with string similarity algorithms

• Machine learning classification algorithms
String similarity algorithms
• Levenshtein distance

• Sørensen–Dice coefficient

• Hamming distance

• Longest Common Substring distance
String similarity algorithms
• Levenshtein distance

• Sørensen–Dice coefficient (string-similarity)

• Hamming distance (fuzzyset.js)

• Longest Common Substring distance
Tips #3
• Strings proximity calculation is expensive operation. Split it.

• Shortening strings dramatically increases performance

• Identify the common differences and handle them with condition upfront

• Think of file formats and DB normalization

• Go for mutability while working with a big data structures (In memory
calculations)
Tips #4
• Allow garbage collector to take the data which isn't used anymore (In
memory calculations)

• Go for transducers (Avoid x.filter().map().map().filter())

• Use schedulers

• Be creative
References
• Pictures are taken from unsplash.com

• Good article regarding transducers https://medium.com/@roman01la/understanding-transducers-in-javascript-3500d3bd9624

• Libraries:

• https://github.com/cheeriojs/cheerio

• https://github.com/GoogleChrome/puppeteer

• https://github.com/matthewmueller/x-ray

• https://github.com/request/request-promise

• https://github.com/jsdom/jsdom

• https://github.com/inikulin/parse5

• https://github.com/aceakash/string-similarity

• https://glench.github.io/fuzzyset.js/

• https://www.npmjs.com/package/node-schedule
Thank you!
Questions?
Oleksandr Tryshchenko

@tryshchenko github / twitter

tryshchenko.com

More Related Content

PPTX
Finding internet evidence
PPTX
Intro to Vectorization Concepts - GaTech cse6242
PPTX
Vectorization - Georgia Tech - CSE6242 - March 2015
PDF
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
PPT
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
PPTX
Taming the resource tiger
PPTX
Taming the resource tiger
PPTX
Hard Coding as a design approach
Finding internet evidence
Intro to Vectorization Concepts - GaTech cse6242
Vectorization - Georgia Tech - CSE6242 - March 2015
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
Taming the resource tiger
Taming the resource tiger
Hard Coding as a design approach

Similar to Web Scraping (20)

PDF
Practical Malware Analysis Ch 14: Malware-Focused Network Signatures
PDF
Building Big Data Streaming Architectures
PDF
Presto: Fast SQL on Everything
PDF
Measuring CDN performance and why you're doing it wrong
PDF
Bringing Concurrency to Ruby - RubyConf India 2014
PDF
Top ten-list
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PPT
Lucene Bootcamp - 2
PDF
Performance and Abstractions
PPT
Performance optimization - JavaScript
PPTX
Introduction to Computer Networking
PDF
rspamd-fosdem
PPTX
Putting Kafka Into Overdrive
PPTX
Static Analysis Primer
PPTX
Meek and domain fronting public
KEY
Fixing Twitter Velocity2009
PDF
Building data intensive applications
Practical Malware Analysis Ch 14: Malware-Focused Network Signatures
Building Big Data Streaming Architectures
Presto: Fast SQL on Everything
Measuring CDN performance and why you're doing it wrong
Bringing Concurrency to Ruby - RubyConf India 2014
Top ten-list
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Lucene Bootcamp - 2
Performance and Abstractions
Performance optimization - JavaScript
Introduction to Computer Networking
rspamd-fosdem
Putting Kafka Into Overdrive
Static Analysis Primer
Meek and domain fronting public
Fixing Twitter Velocity2009
Building data intensive applications
Ad

More from Oleksandr Tryshchenko (11)

PDF
PWA to React Native migration
PDF
PPTX
Mobile Applications with Angular 4 and Ionic 3
PDF
20 000 Leagues Under The Angular 4
PDF
Front end architecture patterns
PDF
How To Tweak Angular 2 Performance (JavaScript Frameworks Day 2017 Kiev)
PPTX
Angular 2 On Production (IT Talk in Dnipro)
PDF
ES6 Generators On Koa.js Example
PDF
Angular 2 On Production
PDF
How To Tweak Angular 2 Performance
PDF
PWA to React Native migration
Mobile Applications with Angular 4 and Ionic 3
20 000 Leagues Under The Angular 4
Front end architecture patterns
How To Tweak Angular 2 Performance (JavaScript Frameworks Day 2017 Kiev)
Angular 2 On Production (IT Talk in Dnipro)
ES6 Generators On Koa.js Example
Angular 2 On Production
How To Tweak Angular 2 Performance
Ad

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Getting Started with Data Integration: FME Form 101
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Machine Learning_overview_presentation.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Heart disease approach using modified random forest and particle swarm optimi...
Machine learning based COVID-19 study performance prediction
Getting Started with Data Integration: FME Form 101
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectroscopy.pptx food analysis technology
Machine Learning_overview_presentation.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Per capita expenditure prediction using model stacking based on satellite ima...
1. Introduction to Computer Programming.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf

Web Scraping

  • 14. Legal standpoint • As long robots.txt prohibit scraping - it's illegal • As long terms of service prohibit scraping - it's illegal • As long as you're abusing the servers - it's illegal • As long as you're using the data without crediting the source - it's illegal
  • 15. Ethic standpoint • Be reasonable with timeouts and threads • Let the website know you're bot through the user agent • Agree the most suitable time for parsing • Be reasonable with scope
  • 19. Fetching data • Curl, fetch, request, etc. • phantomjs, puppeteer
  • 20. What can we do here? • Selective crawling • URL prediction • Duplicate request prevention (FS / DB access is cheaper than network) • Smart scheduling
  • 23. HTML • Clean RegExp is a mistake in a long run • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) • Building AST tree is the default approach (see parse5, himalaya)
  • 24. Walking through AST • Cheerio • jsdom • x-ray • traverse through AST manually
  • 25. Tips #1 • Write set of useful helpers/wrappers upfront • Keep the parsers granular and reusable • Spend time to make it fault tolerant • Always verify the block correctness • Write tests for target markup • Keep logs
  • 26. Tips #2 • Keep the reference of parsed data easily accessible • Permanently eject parsing results • Be reasonable. RAM is cheap, time is expensive • Store image hash sums and get rid of duplicates • Retain the data even if you don't know how to use it now • File system is fast, but DB is cheaper "online" updates
  • 27. Dynamic content • API • User emulation (puppeteer)
  • 35. The problem • Data taken from multiple sources • Data which was initially dirty • Content submitted by customers • Complex data which can be simplified
  • 36. Steps • Trim, lowercase • Remove noise symbols with regular expressions • Identify and remove noise data • Mark some dataset as reference and go with string similarity algorithms • Machine learning classification algorithms
  • 37. Steps • Trim, lowercase • Remove noise symbols with regular expressions • Identify and remove noise data • Mark some dataset as reference and go with string similarity algorithms • Machine learning classification algorithms
  • 38. String similarity algorithms • Levenshtein distance • Sørensen–Dice coefficient • Hamming distance • Longest Common Substring distance
  • 39. String similarity algorithms • Levenshtein distance • Sørensen–Dice coefficient (string-similarity) • Hamming distance (fuzzyset.js) • Longest Common Substring distance
  • 40. Tips #3 • Strings proximity calculation is expensive operation. Split it. • Shortening strings dramatically increases performance • Identify the common differences and handle them with condition upfront • Think of file formats and DB normalization • Go for mutability while working with a big data structures (In memory calculations)
  • 41. Tips #4 • Allow garbage collector to take the data which isn't used anymore (In memory calculations) • Go for transducers (Avoid x.filter().map().map().filter()) • Use schedulers • Be creative
  • 42. References • Pictures are taken from unsplash.com • Good article regarding transducers https://medium.com/@roman01la/understanding-transducers-in-javascript-3500d3bd9624 • Libraries: • https://github.com/cheeriojs/cheerio • https://github.com/GoogleChrome/puppeteer • https://github.com/matthewmueller/x-ray • https://github.com/request/request-promise • https://github.com/jsdom/jsdom • https://github.com/inikulin/parse5 • https://github.com/aceakash/string-similarity • https://glench.github.io/fuzzyset.js/ • https://www.npmjs.com/package/node-schedule