SlideShare a Scribd company logo
Common Crawl forCommon Crawl for
Start-upsStart-ups
July 20, 2015July 20, 2015
Data Science Summit & Dato ConferenceData Science Summit & Dato Conference
It's a non-profit that makesIt's a non-profit that makes
web dataweb data
freely accessible tofreely accessible to anyoneanyone
Each crawl archive is billions of pages:Each crawl archive is billions of pages:
May crawl archive isMay crawl archive is
2.052.05 web pagesweb pagesbillionbillion
uncompresseduncompressed~159 terabytes~159 terabytes
ReleasedReleased
(lives on Amazon Public Data Sets)(lives on Amazon Public Data Sets)
totally freetotally free
without additionalwithout additional
intellectual property restrictionsintellectual property restrictions
Origins of Common CrawlOrigins of Common Crawl
Common Crawl founded in 2007Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)by Gil Elbaz (Applied Semantics / Factual)
Google and Microsoft were the powerhousesGoogle and Microsoft were the powerhouses
DataData powerspowers the algorithms in our fieldthe algorithms in our field
Goal:Goal: Democratize and simplify access toDemocratize and simplify access to
"the"the web as a dataset"web as a dataset"
Benefits for start-upsBenefits for start-ups
Performing your own large-scale crawling isPerforming your own large-scale crawling is
expensiveexpensive andand challengingchallenging
Innovation occurs byInnovation occurs by usingusing the data,the data,
rarely through novel collection methodsrarely through novel collection methods
+ Lower the barrier of entry for new start-ups+ Lower the barrier of entry for new start-ups
+ Create a communal pool of knowledge & data+ Create a communal pool of knowledge & data
Common Crawl File FormatsCommon Crawl File Formats
WARC (as downloaded)WARC (as downloaded)
+ Raw HTTP response headers+ Raw HTTP response headers
+ Raw HTTP responses+ Raw HTTP responses
WAT (metadata)WAT (metadata)
+ HTML head data+ HTML head data
+ HTTP header fields+ HTTP header fields
+ Extracted links / script tags+ Extracted links / script tags
WET (only text)WET (only text)
+ Extracted text+ Extracted text
Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
Analytics at ScaleAnalytics at Scale
Imagine you areImagine you are interestedinterested in ...in ...
++ Javascript library usageJavascript library usage
+ HTML / HTML5+ HTML / HTML5 usageusage
+ Web server types and age+ Web server types and age
++ RDFa, Microdata, and Microformat Data SetsRDFa, Microdata, and Microformat Data Sets
With Common Crawl you can analyseWith Common Crawl you can analyse
billions of pages in an afternoon!billions of pages in an afternoon!
Analyzing Web Domain VulnsAnalyzing Web Domain Vulns
Sietse T. Au and Wing Lung NgaiSietse T. Au and Wing Lung Ngai
WDC Hyperlink GraphWDC Hyperlink Graph
Largest freely available real world graph dataset:Largest freely available real world graph dataset:
3.6 billion pages, 128 billion links3.6 billion pages, 128 billion links
http://webdatacommons.org/hyperlinkgraph/
Fast and easy analysis usingFast and easy analysis using on aon a
singlesingle EC2 r3.8xlarge instanceEC2 r3.8xlarge instance
(under 10 minutes per PageRank iteration)(under 10 minutes per PageRank iteration)
Dato GraphLabDato GraphLab
Using the whole web as your dataset
Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
N-gram Counts & Language ModelsN-gram Counts & Language Models
from the Common Crawlfrom the Common Crawl
Christian BuckChristian Buck, Kenneth Heafield, Kenneth Heafield, Bas van Ooyen, Bas van Ooyen
Processed all the text of Common Crawl to produceProcessed all the text of Common Crawl to produce
975 billion deduplicated tokens975 billion deduplicated tokens
(similar size to the Google N-gram Dataset)(similar size to the Google N-gram Dataset)
Project data was released atProject data was released at
http://statmt.org/ngramshttp://statmt.org/ngrams
Deduped text split by languageDeduped text split by language
Resulting language modelsResulting language models
GloVe: Global Vectors for WordGloVe: Global Vectors for Word
RepresentationRepresentation
Jeffrey Pennington, Richard Socher, Christopher D. ManningJeffrey Pennington, Richard Socher, Christopher D. Manning
Word vector representations:Word vector representations:
king - queen = man - womanking - queen = man - woman
king - man + woman = queenking - man + woman = queen
(produces dimensions of meaning)(produces dimensions of meaning)
Using the whole web as your dataset
Using the whole web as your dataset
GloVe On Various CorporaGloVe On Various Corpora
Semantic: "Athens is to Greece as Berlin is to _?"Semantic: "Athens is to Greece as Berlin is to _?"
Syntactic: "Dance is to dancing as fly is to _?"Syntactic: "Dance is to dancing as fly is to _?"
GloVe over Big DataGloVe over Big Data
GloVe and word2vec (competing algorithm) can scaleGloVe and word2vec (competing algorithm) can scale
to hundreds of billions of tokensto hundreds of billions of tokens
Trained on the Common Crawl n-gram data:Trained on the Common Crawl n-gram data:
http://statmt.org/ngramshttp://statmt.org/ngrams
Source code and pre-trained models atSource code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/http://www-nlp.stanford.edu/projects/glove/
Web-Scale Parallel TextWeb-Scale Parallel Text
Dirt Cheap Web-Scale Parallel Text from the CommonDirt Cheap Web-Scale Parallel Text from the Common
Crawl (Smith et al.)Crawl (Smith et al.)
Processed all text from URLs of the style:Processed all text from URLs of the style:
website.com/[langcode]/website.com/[langcode]/
[w.com/en/tesla |[w.com/en/tesla | w.com/fr/tesla]w.com/fr/tesla]
"...nothing more than a set of common two-letter"...nothing more than a set of common two-letter
language codes ... [we] mined 32 terabytes ... in justlanguage codes ... [we] mined 32 terabytes ... in just
under a day"under a day"
Web-Scale Parallel TextWeb-Scale Parallel Text
Manual inspection across three languages:Manual inspection across three languages:
80% of the data contained good translations80% of the data contained good translations
(source = foreign language, target = English)(source = foreign language, target = English)
Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
Web Data Commons Web TablesWeb Data Commons Web Tables
Extracted 11.2 billion tables from WARC files,Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifierfiltered to keep relational tables via trained classifier
Only 1.3% of the original data was kept,Only 1.3% of the original data was kept,
yet it still remains hugely valuableyet it still remains hugely valuable
Resulting dataset:Resulting dataset:
11.2 billion tables => 147 million relational web tables11.2 billion tables => 147 million relational web tables
Web Data Commons Web TablesWeb Data Commons Web Tables
Popular column headers:Popular column headers: name, title, artist, location,name, title, artist, location,
model, manufacturer, country ...model, manufacturer, country ...
Released atReleased at webdatacommons.org/webtables/webdatacommons.org/webtables/
Extracting US Phone NumbersExtracting US Phone Numbers
""Let's use Common Crawl to help match businessesLet's use Common Crawl to help match businesses
from Yelp's database to the possible web pages forfrom Yelp's database to the possible web pages for
those businesses on the Internet."those businesses on the Internet."
Yelp extracted ~748 million US phone numbers fromYelp extracted ~748 million US phone numbers from
the Common Crawl December 2014 datasetthe Common Crawl December 2014 dataset
Regular expression over extracted text (WET files)Regular expression over extracted text (WET files)
Extracting US Phone NumbersExtracting US Phone Numbers
Total complexity:Total complexity: 134 lines of Python134 lines of Python
Total time:Total time: 1 hour (20 ×1 hour (20 × c3.8xlarge)c3.8xlarge)
Total cost:Total cost: $10.60 USD (Python using EMR)$10.60 USD (Python using EMR)
Matched against Yelp's database:Matched against Yelp's database:
48% had exact URL matches48% had exact URL matches
61% had matching domains61% had matching domains
More details (and full code) on Yelp's blog post:More details (and full code) on Yelp's blog post:
Analyzing the Web For the Price of a SandwichAnalyzing the Web For the Price of a Sandwich
Common Crawl's Derived DatasetsCommon Crawl's Derived Datasets
Natural language processing:Natural language processing:
(975 bln tokens)(975 bln tokens)
WDCWDC (3.5 bln)(3.5 bln)
Large scale web analysis:Large scale web analysis:
(128 bln edges(128 bln edges))
-- Wikipedia in-links analysisWikipedia in-links analysis
and a million more use cases!and a million more use cases!
Parallel text for machine translationParallel text for machine translation
N-gram & language modelsN-gram & language models
Web tablesWeb tables
WDC Hyperlink GraphsWDC Hyperlink Graphs
WikiReverse.orgWikiReverse.org
Why am I so excited..?Why am I so excited..?
Open data is catching on!Open data is catching on!
Even playing field for academia and start-upsEven playing field for academia and start-ups
Google Web 1T =>Google Web 1T =>
Google's Wikilinks =>Google's Wikilinks =>
Google's Sets =>Google's Sets =>
Buck et al.'s N-gramsBuck et al.'s N-grams
WikiReverseWikiReverse
WDC Web TablesWDC Web Tables
Common Crawl releases their datasetCommon Crawl releases their dataset
and brilliant people build on top of itand brilliant people build on top of it
Read more atRead more at
commoncrawl.orgcommoncrawl.org
Stephen MerityStephen Merity
stephen@commoncrawl.orgstephen@commoncrawl.org
commoncrawl.orgcommoncrawl.org

More Related Content

PDF
Insight Data Engineering project
PDF
Mining a Large Web Corpus
PPSX
The Web of data and web data commons
PPTX
London HUG
PDF
Cenitpede: Analyzing Webcrawl
PPT
Analytics and Access to the UK web archive
PDF
A Data Ecosystem to Support Machine Learning in Materials Science
PPTX
Try It The Google Way .
Insight Data Engineering project
Mining a Large Web Corpus
The Web of data and web data commons
London HUG
Cenitpede: Analyzing Webcrawl
Analytics and Access to the UK web archive
A Data Ecosystem to Support Machine Learning in Materials Science
Try It The Google Way .

What's hot (19)

PDF
nanopub-java: A Java Library for Nanopublications
PDF
The RDF Report Card: Beyond the Triple Count
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PDF
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
PDF
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
PDF
The Nature.com ontologies portal - Linked Science 2015
PPTX
The nature.com ontologies portal: nature.com/ontologies
PDF
The Real-time Web in the Age of Agents
PDF
Big data analysis in python @ PyCon.tw 2013
PPT
SPARQL Query Forms
PPT
The Power of Semantic Technologies to Explore Linked Open Data
PPTX
2014 moore-ddd
PPT
Benchmarking graph databases on the problem of community detection
PPTX
Research Automation for Data-Driven Discovery
PDF
Drupal and the Semantic Web - ESIP Webinar
PPTX
RDTF Metadata Guidelines: an update
PPT
Graph database
PDF
Illuminating DSpace's Linked Data Support
PPTX
"Web Archive services framework for tighter integration between the past and ...
nanopub-java: A Java Library for Nanopublications
The RDF Report Card: Beyond the Triple Count
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
The Nature.com ontologies portal - Linked Science 2015
The nature.com ontologies portal: nature.com/ontologies
The Real-time Web in the Age of Agents
Big data analysis in python @ PyCon.tw 2013
SPARQL Query Forms
The Power of Semantic Technologies to Explore Linked Open Data
2014 moore-ddd
Benchmarking graph databases on the problem of community detection
Research Automation for Data-Driven Discovery
Drupal and the Semantic Web - ESIP Webinar
RDTF Metadata Guidelines: an update
Graph database
Illuminating DSpace's Linked Data Support
"Web Archive services framework for tighter integration between the past and ...
Ad

Viewers also liked (6)

PPTX
Common Crawl: An Open Repository of Web Data
PDF
Is Crawling Legal? Web Crawling legal Policies
PDF
Gephi Consortium Presentation
PPT
Enterprise Data World 2016 and CDO Vision Mural Summary
PDF
Gephi Quick Start
PPTX
Building a Scalable Web Crawler with Hadoop
Common Crawl: An Open Repository of Web Data
Is Crawling Legal? Web Crawling legal Policies
Gephi Consortium Presentation
Enterprise Data World 2016 and CDO Vision Mural Summary
Gephi Quick Start
Building a Scalable Web Crawler with Hadoop
Ad

Similar to Using the whole web as your dataset (20)

PDF
Text Analytics Online Knowledge Base / Database
KEY
Big data and APIs for PHP developers - SXSW 2011
PPT
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
PDF
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
PPT
Introduction to question answering for linked data & big data
PPTX
Semantic Web, e-commerce
PPT
Web Search And Mining (Ntuim)
PDF
Small, Medium and Big Data
PDF
Fishing Graphs in a Hadoop Data Lake
PDF
Schema.org Structured data the What, Why, & How
PPTX
Basic Sentiment Analysis using Hive
PDF
Contextual Computing: Laying a Global Data Foundation
PDF
A fresh new look into Information Gathering - OWASP Spain
PDF
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
DOC
Fyp ideas
PPTX
Linked data for Enterprise Data Integration
PPT
Searching the Internet
PPT
The Internet and Law Enforcement
PPT
Semantic Web Science
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Text Analytics Online Knowledge Base / Database
Big data and APIs for PHP developers - SXSW 2011
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Introduction to question answering for linked data & big data
Semantic Web, e-commerce
Web Search And Mining (Ntuim)
Small, Medium and Big Data
Fishing Graphs in a Hadoop Data Lake
Schema.org Structured data the What, Why, & How
Basic Sentiment Analysis using Hive
Contextual Computing: Laying a Global Data Foundation
A fresh new look into Information Gathering - OWASP Spain
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
Fyp ideas
Linked data for Enterprise Data Integration
Searching the Internet
The Internet and Law Enforcement
Semantic Web Science
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering

More from Turi, Inc. (20)

PPTX
Webinar - Analyzing Video
PDF
Webinar - Patient Readmission Risk
PPTX
Webinar - Know Your Customer - Arya (20160526)
PPTX
Webinar - Product Matching - Palombo (20160428)
PPTX
Webinar - Pattern Mining Log Data - Vega (20160426)
PPTX
Webinar - Fraud Detection - Palombo (20160428)
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PDF
Pattern Mining: Extracting Value from Log Data
PPTX
Intelligent Applications with Machine Learning Toolkits
PPTX
Text Analysis with Machine Learning
PPTX
Machine Learning with GraphLab Create
PPTX
Machine Learning in Production with Dato Predictive Services
PPTX
Machine Learning in 2016: Live Q&A with Carlos Guestrin
PDF
Scalable data structures for data science
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
PDF
Introduction to Recommender Systems
PDF
Machine learning in production
PPTX
Overview of Machine Learning and Feature Engineering
PPTX
SFrame
PPT
Building Personalized Data Products with Dato
Webinar - Analyzing Video
Webinar - Patient Readmission Risk
Webinar - Know Your Customer - Arya (20160526)
Webinar - Product Matching - Palombo (20160428)
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Fraud Detection - Palombo (20160428)
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Pattern Mining: Extracting Value from Log Data
Intelligent Applications with Machine Learning Toolkits
Text Analysis with Machine Learning
Machine Learning with GraphLab Create
Machine Learning in Production with Dato Predictive Services
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Scalable data structures for data science
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Recommender Systems
Machine learning in production
Overview of Machine Learning and Feature Engineering
SFrame
Building Personalized Data Products with Dato

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Using the whole web as your dataset

  • 1. Common Crawl forCommon Crawl for Start-upsStart-ups July 20, 2015July 20, 2015 Data Science Summit & Dato ConferenceData Science Summit & Dato Conference
  • 2. It's a non-profit that makesIt's a non-profit that makes web dataweb data freely accessible tofreely accessible to anyoneanyone
  • 3. Each crawl archive is billions of pages:Each crawl archive is billions of pages: May crawl archive isMay crawl archive is 2.052.05 web pagesweb pagesbillionbillion uncompresseduncompressed~159 terabytes~159 terabytes
  • 4. ReleasedReleased (lives on Amazon Public Data Sets)(lives on Amazon Public Data Sets) totally freetotally free without additionalwithout additional intellectual property restrictionsintellectual property restrictions
  • 5. Origins of Common CrawlOrigins of Common Crawl Common Crawl founded in 2007Common Crawl founded in 2007 by Gil Elbaz (Applied Semantics / Factual)by Gil Elbaz (Applied Semantics / Factual) Google and Microsoft were the powerhousesGoogle and Microsoft were the powerhouses DataData powerspowers the algorithms in our fieldthe algorithms in our field Goal:Goal: Democratize and simplify access toDemocratize and simplify access to "the"the web as a dataset"web as a dataset"
  • 6. Benefits for start-upsBenefits for start-ups Performing your own large-scale crawling isPerforming your own large-scale crawling is expensiveexpensive andand challengingchallenging Innovation occurs byInnovation occurs by usingusing the data,the data, rarely through novel collection methodsrarely through novel collection methods + Lower the barrier of entry for new start-ups+ Lower the barrier of entry for new start-ups + Create a communal pool of knowledge & data+ Create a communal pool of knowledge & data
  • 7. Common Crawl File FormatsCommon Crawl File Formats WARC (as downloaded)WARC (as downloaded) + Raw HTTP response headers+ Raw HTTP response headers + Raw HTTP responses+ Raw HTTP responses WAT (metadata)WAT (metadata) + HTML head data+ HTML head data + HTTP header fields+ HTTP header fields + Extracted links / script tags+ Extracted links / script tags WET (only text)WET (only text) + Extracted text+ Extracted text
  • 8. Web Data at ScaleWeb Data at Scale AnalyticsAnalytics + Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata Machine LearningMachine Learning + Language models based upon billions of tokens+ Language models based upon billions of tokens Filtering and AggregationFiltering and Aggregation + Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
  • 9. Analytics at ScaleAnalytics at Scale Imagine you areImagine you are interestedinterested in ...in ... ++ Javascript library usageJavascript library usage + HTML / HTML5+ HTML / HTML5 usageusage + Web server types and age+ Web server types and age ++ RDFa, Microdata, and Microformat Data SetsRDFa, Microdata, and Microformat Data Sets With Common Crawl you can analyseWith Common Crawl you can analyse billions of pages in an afternoon!billions of pages in an afternoon!
  • 10. Analyzing Web Domain VulnsAnalyzing Web Domain Vulns Sietse T. Au and Wing Lung NgaiSietse T. Au and Wing Lung Ngai
  • 11. WDC Hyperlink GraphWDC Hyperlink Graph Largest freely available real world graph dataset:Largest freely available real world graph dataset: 3.6 billion pages, 128 billion links3.6 billion pages, 128 billion links http://webdatacommons.org/hyperlinkgraph/ Fast and easy analysis usingFast and easy analysis using on aon a singlesingle EC2 r3.8xlarge instanceEC2 r3.8xlarge instance (under 10 minutes per PageRank iteration)(under 10 minutes per PageRank iteration) Dato GraphLabDato GraphLab
  • 13. Web Data at ScaleWeb Data at Scale AnalyticsAnalytics + Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata Machine LearningMachine Learning + Language models based upon billions of tokens+ Language models based upon billions of tokens Filtering and AggregationFiltering and Aggregation + Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
  • 14. N-gram Counts & Language ModelsN-gram Counts & Language Models from the Common Crawlfrom the Common Crawl Christian BuckChristian Buck, Kenneth Heafield, Kenneth Heafield, Bas van Ooyen, Bas van Ooyen Processed all the text of Common Crawl to produceProcessed all the text of Common Crawl to produce 975 billion deduplicated tokens975 billion deduplicated tokens (similar size to the Google N-gram Dataset)(similar size to the Google N-gram Dataset) Project data was released atProject data was released at http://statmt.org/ngramshttp://statmt.org/ngrams Deduped text split by languageDeduped text split by language Resulting language modelsResulting language models
  • 15. GloVe: Global Vectors for WordGloVe: Global Vectors for Word RepresentationRepresentation Jeffrey Pennington, Richard Socher, Christopher D. ManningJeffrey Pennington, Richard Socher, Christopher D. Manning Word vector representations:Word vector representations: king - queen = man - womanking - queen = man - woman king - man + woman = queenking - man + woman = queen (produces dimensions of meaning)(produces dimensions of meaning)
  • 18. GloVe On Various CorporaGloVe On Various Corpora Semantic: "Athens is to Greece as Berlin is to _?"Semantic: "Athens is to Greece as Berlin is to _?" Syntactic: "Dance is to dancing as fly is to _?"Syntactic: "Dance is to dancing as fly is to _?"
  • 19. GloVe over Big DataGloVe over Big Data GloVe and word2vec (competing algorithm) can scaleGloVe and word2vec (competing algorithm) can scale to hundreds of billions of tokensto hundreds of billions of tokens Trained on the Common Crawl n-gram data:Trained on the Common Crawl n-gram data: http://statmt.org/ngramshttp://statmt.org/ngrams Source code and pre-trained models atSource code and pre-trained models at http://www-nlp.stanford.edu/projects/glove/http://www-nlp.stanford.edu/projects/glove/
  • 20. Web-Scale Parallel TextWeb-Scale Parallel Text Dirt Cheap Web-Scale Parallel Text from the CommonDirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)Crawl (Smith et al.) Processed all text from URLs of the style:Processed all text from URLs of the style: website.com/[langcode]/website.com/[langcode]/ [w.com/en/tesla |[w.com/en/tesla | w.com/fr/tesla]w.com/fr/tesla] "...nothing more than a set of common two-letter"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in justlanguage codes ... [we] mined 32 terabytes ... in just under a day"under a day"
  • 21. Web-Scale Parallel TextWeb-Scale Parallel Text Manual inspection across three languages:Manual inspection across three languages: 80% of the data contained good translations80% of the data contained good translations (source = foreign language, target = English)(source = foreign language, target = English)
  • 22. Web Data at ScaleWeb Data at Scale AnalyticsAnalytics + Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata Machine LearningMachine Learning + Language models based upon billions of tokens+ Language models based upon billions of tokens Filtering and AggregationFiltering and Aggregation + Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers
  • 23. Web Data Commons Web TablesWeb Data Commons Web Tables Extracted 11.2 billion tables from WARC files,Extracted 11.2 billion tables from WARC files, filtered to keep relational tables via trained classifierfiltered to keep relational tables via trained classifier Only 1.3% of the original data was kept,Only 1.3% of the original data was kept, yet it still remains hugely valuableyet it still remains hugely valuable Resulting dataset:Resulting dataset: 11.2 billion tables => 147 million relational web tables11.2 billion tables => 147 million relational web tables
  • 24. Web Data Commons Web TablesWeb Data Commons Web Tables Popular column headers:Popular column headers: name, title, artist, location,name, title, artist, location, model, manufacturer, country ...model, manufacturer, country ... Released atReleased at webdatacommons.org/webtables/webdatacommons.org/webtables/
  • 25. Extracting US Phone NumbersExtracting US Phone Numbers ""Let's use Common Crawl to help match businessesLet's use Common Crawl to help match businesses from Yelp's database to the possible web pages forfrom Yelp's database to the possible web pages for those businesses on the Internet."those businesses on the Internet." Yelp extracted ~748 million US phone numbers fromYelp extracted ~748 million US phone numbers from the Common Crawl December 2014 datasetthe Common Crawl December 2014 dataset Regular expression over extracted text (WET files)Regular expression over extracted text (WET files)
  • 26. Extracting US Phone NumbersExtracting US Phone Numbers Total complexity:Total complexity: 134 lines of Python134 lines of Python Total time:Total time: 1 hour (20 ×1 hour (20 × c3.8xlarge)c3.8xlarge) Total cost:Total cost: $10.60 USD (Python using EMR)$10.60 USD (Python using EMR) Matched against Yelp's database:Matched against Yelp's database: 48% had exact URL matches48% had exact URL matches 61% had matching domains61% had matching domains More details (and full code) on Yelp's blog post:More details (and full code) on Yelp's blog post: Analyzing the Web For the Price of a SandwichAnalyzing the Web For the Price of a Sandwich
  • 27. Common Crawl's Derived DatasetsCommon Crawl's Derived Datasets Natural language processing:Natural language processing: (975 bln tokens)(975 bln tokens) WDCWDC (3.5 bln)(3.5 bln) Large scale web analysis:Large scale web analysis: (128 bln edges(128 bln edges)) -- Wikipedia in-links analysisWikipedia in-links analysis and a million more use cases!and a million more use cases! Parallel text for machine translationParallel text for machine translation N-gram & language modelsN-gram & language models Web tablesWeb tables WDC Hyperlink GraphsWDC Hyperlink Graphs WikiReverse.orgWikiReverse.org
  • 28. Why am I so excited..?Why am I so excited..? Open data is catching on!Open data is catching on! Even playing field for academia and start-upsEven playing field for academia and start-ups Google Web 1T =>Google Web 1T => Google's Wikilinks =>Google's Wikilinks => Google's Sets =>Google's Sets => Buck et al.'s N-gramsBuck et al.'s N-grams WikiReverseWikiReverse WDC Web TablesWDC Web Tables Common Crawl releases their datasetCommon Crawl releases their dataset and brilliant people build on top of itand brilliant people build on top of it
  • 29. Read more atRead more at commoncrawl.orgcommoncrawl.org Stephen MerityStephen Merity stephen@commoncrawl.orgstephen@commoncrawl.org commoncrawl.orgcommoncrawl.org