SlideShare a Scribd company logo
Crawling for names
A search through 150 terabytes of data
Common Crawl
Large repository of archival web page data on the Internet
November 2015 crawl has more than 150 terabytes of data
(150,000,000,000,000 bytes, 1.2 billion URLs)
Used to broadly gauge brand awareness, name recognition?
➔ Cons: False positives (e.g., same name, different person)
➔ Pros: Dataset is large and fairly complete
Data pipeline
Common Crawl on
Goals
Functional:
● Parse through data, counting websites that mention Donald Trump, Ted
Cruz, Hillary Clinton, Bernie Sanders
Engineering:
● Do this as fast and efficiently as possible on the entire corpus
● Learn Scala
Challenges
● 35,700 zipped text files of modest sizes on Amazon S3
● Each file on average holds data from 34,000 URIs
● Data from one URI (one record) spans multiple lines
Sample Common Crawl record
Parsing Common Crawl text file
Header / URI
Find these names
Coding challenges
1. Spark prefers to ingest files in which one record spans single line
➔ sc.textFile(filename)
2. For multi-line records, must use
➔ config.set(“textinputformat.record.delimiter”, “WARC-Target-URI: “)
➔ ingestMe = sc.newAPIHadoopFile(filename, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], config)
First method allows bulk loading of files; second method limited to single file at
one time
How much time?
Original estimate: 21 days
Helpful Not so helpful
✔ Eliminate debug printlns
✔ Limit “filter”, “map”
functions
✔✔ Union RDD (data sets)
triggered distributed
computing
❗ Pool database calls
(Must use sparingly)
❌ Multiple Spark-Submit jobs
(Held promise but resource
intensive; crashed JVM)
Revised estimate: 18-35 hours
Results: How the candidates stack up
Check out which candidate got the most mentions:
http://namecrawler.xyz
About me
Most recently news reporter.
Background in computer science
Avid cook, baker
What else would have helped boost speed
● Amp up cluster computing power
○ Upgrade from m4large (8GB RAM) to r3large (15.25GB) or r3xlarge (30.5GB)
● Concatenate files prior to processing
○ Eliminates having to manually join datasets
○ Pros: Java libraries exist to do so
○ Cons: Must make room for 150 terabytes of files
● Split batch processing into multiple jobs
Optimizations: Union data
// Grab file off Amazon's S3
val hdFile = sc.newAPIHadoopFile(fullCrawlName, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], localConfig)
// Hold on to file until there are enough for a trio
hdFiles(i-1) = hdFile
if (i % 3 == 0) { // Act only on batches of three RDDs
val hdFile = hdFiles(i-3).union(hdFiles(i-2).union(hdFiles(i-1)))
// Send the three-large RDD for saving
saveCrawlData(crawlFileID, hdFile)
// Reset batch counter
i=0
}

More Related Content

PDF
Using the whole web as your dataset
PPT
Analytics and Access to the UK web archive
PDF
useR! 2012 Talk
PPTX
London HUG
PPTX
Try It The Google Way .
PPSX
The Web of data and web data commons
PDF
Mining a Large Web Corpus
PDF
Cenitpede: Analyzing Webcrawl
Using the whole web as your dataset
Analytics and Access to the UK web archive
useR! 2012 Talk
London HUG
Try It The Google Way .
The Web of data and web data commons
Mining a Large Web Corpus
Cenitpede: Analyzing Webcrawl

What's hot (19)

PDF
Overview of Dan Olteanu's Research presentation
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
PPTX
Data Science Stack with MongoDB and RStudio
PPTX
LD4KD 2015 - Demos and tools
PDF
A Data Ecosystem to Support Machine Learning in Materials Science
PPT
Mapreduce in Search
PDF
Big data analysis in python @ PyCon.tw 2013
PDF
The Real-time Web in the Age of Agents
PDF
Graph Analysis over JSON, Larus
PDF
Congressional PageRank: Graph Analytics of US Congress With Neo4j
PDF
Using MongoDB + Hadoop Together
PPTX
Querying the Web of Data
PDF
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
PPTX
Python pandas Library
PPTX
Connecting Stream Reasoners on the Web
PDF
Introduction to data analysis using R
PPTX
Introduction to Graph Databases
PDF
Triplewave: a step towards RDF Stream Processing on the Web
Overview of Dan Olteanu's Research presentation
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Data Science Stack with MongoDB and RStudio
LD4KD 2015 - Demos and tools
A Data Ecosystem to Support Machine Learning in Materials Science
Mapreduce in Search
Big data analysis in python @ PyCon.tw 2013
The Real-time Web in the Age of Agents
Graph Analysis over JSON, Larus
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Using MongoDB + Hadoop Together
Querying the Web of Data
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Python pandas Library
Connecting Stream Reasoners on the Web
Introduction to data analysis using R
Introduction to Graph Databases
Triplewave: a step towards RDF Stream Processing on the Web
Ad

Viewers also liked (10)

PDF
Is Crawling Legal? Web Crawling legal Policies
PPTX
Common Crawl: An Open Repository of Web Data
PDF
PPTX
The future of Big Data tooling
PDF
Gephi Consortium Presentation
PDF
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
PPT
Enterprise Data World 2016 and CDO Vision Mural Summary
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Gephi Quick Start
Is Crawling Legal? Web Crawling legal Policies
Common Crawl: An Open Repository of Web Data
The future of Big Data tooling
Gephi Consortium Presentation
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Enterprise Data World 2016 and CDO Vision Mural Summary
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Parallelizing Existing R Packages with SparkR
Gephi Quick Start
Ad

Similar to Insight Data Engineering project (20)

PPTX
Vitalii Kotliarenko “Data processing pipelines with Apache Spark: from protot...
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
InternReport
PPTX
Building a Scalable Web Crawler with Hadoop
PPTX
Common crawlpresentation
PPTX
Zaharia spark-scala-days-2012
PPTX
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
PDF
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
PDF
Big data-and-the-web
PDF
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
PDF
The Analytics Frontier of the Hadoop Eco-System
PDF
OCF.tw's talk about "Introduction to spark"
PPTX
Big dataarchitecturesandecosystem+nosql
PPTX
Apache Spark
PDF
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PPTX
Cloudstone - Sharpening Your Weapons Through Big Data
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Big Data for QAs
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Vitalii Kotliarenko “Data processing pipelines with Apache Spark: from protot...
AI與大數據數據處理 Spark實戰(20171216)
InternReport
Building a Scalable Web Crawler with Hadoop
Common crawlpresentation
Zaharia spark-scala-days-2012
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big data-and-the-web
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
The Analytics Frontier of the Hadoop Eco-System
OCF.tw's talk about "Introduction to spark"
Big dataarchitecturesandecosystem+nosql
Apache Spark
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
Cloudstone - Sharpening Your Weapons Through Big Data
Simplifying Big Data Analytics with Apache Spark
Big Data for QAs
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Well-logging-methods_new................
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
composite construction of structures.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Sustainable Sites - Green Building Construction
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Well-logging-methods_new................
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
composite construction of structures.pdf
Lecture Notes Electrical Wiring System Components
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mechanical Engineering MATERIALS Selection
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Internet of Things (IOT) - A guide to understanding
Geodesy 1.pptx...............................................
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Model Code of Practice - Construction Work - 21102022 .pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Sustainable Sites - Green Building Construction
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Foundation to blockchain - A guide to Blockchain Tech

Insight Data Engineering project

  • 1. Crawling for names A search through 150 terabytes of data
  • 2. Common Crawl Large repository of archival web page data on the Internet November 2015 crawl has more than 150 terabytes of data (150,000,000,000,000 bytes, 1.2 billion URLs) Used to broadly gauge brand awareness, name recognition? ➔ Cons: False positives (e.g., same name, different person) ➔ Pros: Dataset is large and fairly complete
  • 4. Goals Functional: ● Parse through data, counting websites that mention Donald Trump, Ted Cruz, Hillary Clinton, Bernie Sanders Engineering: ● Do this as fast and efficiently as possible on the entire corpus ● Learn Scala
  • 5. Challenges ● 35,700 zipped text files of modest sizes on Amazon S3 ● Each file on average holds data from 34,000 URIs ● Data from one URI (one record) spans multiple lines
  • 7. Parsing Common Crawl text file Header / URI Find these names
  • 8. Coding challenges 1. Spark prefers to ingest files in which one record spans single line ➔ sc.textFile(filename) 2. For multi-line records, must use ➔ config.set(“textinputformat.record.delimiter”, “WARC-Target-URI: “) ➔ ingestMe = sc.newAPIHadoopFile(filename, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], config) First method allows bulk loading of files; second method limited to single file at one time
  • 9. How much time? Original estimate: 21 days Helpful Not so helpful ✔ Eliminate debug printlns ✔ Limit “filter”, “map” functions ✔✔ Union RDD (data sets) triggered distributed computing ❗ Pool database calls (Must use sparingly) ❌ Multiple Spark-Submit jobs (Held promise but resource intensive; crashed JVM) Revised estimate: 18-35 hours
  • 10. Results: How the candidates stack up Check out which candidate got the most mentions: http://namecrawler.xyz
  • 11. About me Most recently news reporter. Background in computer science Avid cook, baker
  • 12. What else would have helped boost speed ● Amp up cluster computing power ○ Upgrade from m4large (8GB RAM) to r3large (15.25GB) or r3xlarge (30.5GB) ● Concatenate files prior to processing ○ Eliminates having to manually join datasets ○ Pros: Java libraries exist to do so ○ Cons: Must make room for 150 terabytes of files ● Split batch processing into multiple jobs
  • 13. Optimizations: Union data // Grab file off Amazon's S3 val hdFile = sc.newAPIHadoopFile(fullCrawlName, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], localConfig) // Hold on to file until there are enough for a trio hdFiles(i-1) = hdFile if (i % 3 == 0) { // Act only on batches of three RDDs val hdFile = hdFiles(i-3).union(hdFiles(i-2).union(hdFiles(i-1))) // Send the three-large RDD for saving saveCrawlData(crawlFileID, hdFile) // Reset batch counter i=0 }