SlideShare a Scribd company logo
Crawl the entire web
in 10 minutes...
Copyright ©: 2015 OnPage.org GmbH
Using AWS-EMR, AWS-S3, PIG, CommonCrawl
...and just 100 €
Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData solutions
About Me
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- You want to crawl > 1 Billion URLs
Solution ?
Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URLs
- Over 1.000 TB total since 2009
- URL seeding list from Blekko: https://blekko.com
Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Wayback Machine)
- Partitionable with S3 Object Prefix possibility
- 100MB-1GB file Sizes (gzip) -> Hadoop size
Nice Data Format
Store the raw crawl data.
Format 1:
WARC
Store only the
Meta-Information
as JSON
Format 2:
WAT
Store only the
Plain Text Content
Format 3:
WET
Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warcexamples
- https://github.com/mortardata/mortar-examples
- https://github.com/matpalm/common-crawl
PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";
-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";
%default OUTPUT_PATH "s3://example-bucket/out";
pages = LOAD '$INPUT_PATH'
USING FileLoaderClass
AS (url, html);
meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;
filtered = FILTER meta_titles BY meta_title IS NOT NULL;
STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pay per Hour to scale as much as posible
It‘s Demo Time!
Let's cross fingers now
That‘s it!
Customer:
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
And: We are hiring!
https://de.onpage.org/about/jobs/

More Related Content

PPT
OWL-XML-Summer-School-09
PDF
Tmc mastering bitcoins ppt
PPTX
What is Web 3,0?
PPTX
The future internet web 3.0
PPTX
Spring Boot & WebSocket
PPTX
Blockchain in food industry
PPTX
Cloudstone - Sharpening Your Weapons Through Big Data
PDF
Analyzing Web Archives
OWL-XML-Summer-School-09
Tmc mastering bitcoins ppt
What is Web 3,0?
The future internet web 3.0
Spring Boot & WebSocket
Blockchain in food industry
Cloudstone - Sharpening Your Weapons Through Big Data
Analyzing Web Archives

Similar to Crawl the entire web in 10 minutes...and just 100€ (20)

PDF
Using the whole web as your dataset
PPTX
Building a Scalable Web Crawler with Hadoop
PPTX
Common crawlpresentation
PDF
Cloud based Web Intelligence
PDF
Internet content as research data
PDF
Web Crawling with Apache Nutch
PDF
Crawling and Processing the Italian Corporate Web
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
PPTX
Web Archives and the dream of the Personal Search Engine
PDF
PPT
Web Crawling and Data Gathering with Apache Nutch
PPT
Web Crawler
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PDF
Insight_150115_Demo
PDF
The original vision of Nutch, 14 years later: Building an open source search ...
PDF
Mining a Large Web Corpus
PDF
Design and Implementation of a High- Performance Distributed Web Crawler
PDF
Ipres2019 sn-stormcrawler
PPT
Arcomem training Specifying Crawls Beginners
PDF
Smart Crawler Automation with RMI
Using the whole web as your dataset
Building a Scalable Web Crawler with Hadoop
Common crawlpresentation
Cloud based Web Intelligence
Internet content as research data
Web Crawling with Apache Nutch
Crawling and Processing the Italian Corporate Web
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Web Archives and the dream of the Personal Search Engine
Web Crawling and Data Gathering with Apache Nutch
Web Crawler
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
Insight_150115_Demo
The original vision of Nutch, 14 years later: Building an open source search ...
Mining a Large Web Corpus
Design and Implementation of a High- Performance Distributed Web Crawler
Ipres2019 sn-stormcrawler
Arcomem training Specifying Crawls Beginners
Smart Crawler Automation with RMI
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
MYSQL Presentation for SQL database connectivity
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
Ad

Crawl the entire web in 10 minutes...and just 100€

  • 1. Crawl the entire web in 10 minutes... Copyright ©: 2015 OnPage.org GmbH Using AWS-EMR, AWS-S3, PIG, CommonCrawl ...and just 100 €
  • 2. Since 2011 in Munich Work at OnPage.org Interested in Webcrawling and BigData Frameworks Build low cost scalable BigData solutions About Me Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org
  • 3. Do you want to build your own Search- Engine? - High Hardware / Cloud Costs - Nutch needs ~ 1 Hour for 1 million URLs - You want to crawl > 1 Billion URLs
  • 5. Don‘t Crawl! - Use Common-Crawl : https://commoncrawl.org - Non-Profit-Organisation - ~Monthly over 2 Billions Crawled URLs - Over 1.000 TB total since 2009 - URL seeding list from Blekko: https://blekko.com
  • 6. Don‘t Crawl! – Use Common Crawl! - Scalably stored on Amazon AWS S3 - Hadoop compatible format powered by Archive.org (Wayback Machine) - Partitionable with S3 Object Prefix possibility - 100MB-1GB file Sizes (gzip) -> Hadoop size
  • 8. Store the raw crawl data. Format 1: WARC
  • 10. Store only the Plain Text Content Format 3: WET
  • 11. Choose the right format - WARC (Raw HTML): 1.000 MB - WAT (Meta data as JSON) : 450 MB - WET (Plain Text): 150 MB
  • 12. Processing - Pure Hadoop with MapReduce - Input Classes: http://commoncrawl.org/the-data/get-started/
  • 13. Processing - High Level ETL-Layer like PIG: http://pig.apache.org - Example Stuff : - https://github.com/norvigaward/warcexamples - https://github.com/mortardata/mortar-examples - https://github.com/matpalm/common-crawl
  • 14. PIG Example REGISTER file:/home/hadoop/lib/pig/piggybank.jar DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader(); %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz"; -- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/"; %default OUTPUT_PATH "s3://example-bucket/out"; pages = LOAD '$INPUT_PATH' USING FileLoaderClass AS (url, html); meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title; filtered = FILTER meta_titles BY meta_title IS NOT NULL; STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
  • 15. Hadoop & PIG on AWS - Support new Hadoop releases - PIG Integration - Replace HDFS with S3 - Easy UI to start quickly - Pay per Hour to scale as much as posible
  • 16. It‘s Demo Time! Let's cross fingers now
  • 17. That‘s it! Customer: Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org And: We are hiring! https://de.onpage.org/about/jobs/

Editor's Notes

  • #5: Screenshot austauschen + shclecht lesbar