Using the whole web as your dataset

Common Crawl forCommon Crawl for
Start-upsStart-ups
July 20, 2015July 20, 2015
Data Science Summit & Dato ConferenceData Science Summit & Dato Conference

It's a non-profit that makesIt's a non-profit that makes
web dataweb data
freely accessible tofreely accessible to anyoneanyone

Each crawl archive is billions of pages:Each crawl archive is billions of pages:
May crawl archive isMay crawl archive is
2.052.05 web pagesweb pagesbillionbillion
uncompresseduncompressed~159 terabytes~159 terabytes

ReleasedReleased
(lives on Amazon Public Data Sets)(lives on Amazon Public Data Sets)
totally freetotally free
without additionalwithout additional
intellectual property restrictionsintellectual property restrictions

Origins of Common CrawlOrigins of Common Crawl
Common Crawl founded in 2007Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)by Gil Elbaz (Applied Semantics / Factual)
Google and Microsoft were the powerhousesGoogle and Microsoft were the powerhouses
DataData powerspowers the algorithms in our fieldthe algorithms in our field
Goal:Goal: Democratize and simplify access toDemocratize and simplify access to
"the"the web as a dataset"web as a dataset"

Benefits for start-upsBenefits for start-ups
Performing your own large-scale crawling isPerforming your own large-scale crawling is
expensiveexpensive andand challengingchallenging
Innovation occurs byInnovation occurs by usingusing the data,the data,
rarely through novel collection methodsrarely through novel collection methods
+ Lower the barrier of entry for new start-ups+ Lower the barrier of entry for new start-ups
+ Create a communal pool of knowledge & data+ Create a communal pool of knowledge & data

Common Crawl File FormatsCommon Crawl File Formats
WARC (as downloaded)WARC (as downloaded)
+ Raw HTTP response headers+ Raw HTTP response headers
+ Raw HTTP responses+ Raw HTTP responses
WAT (metadata)WAT (metadata)
+ HTML head data+ HTML head data
+ HTTP header fields+ HTTP header fields
+ Extracted links / script tags+ Extracted links / script tags
WET (only text)WET (only text)
+ Extracted text+ Extracted text

Web Data at ScaleWeb Data at Scale
AnalyticsAnalytics
+ Usage of servers, libraries, and metadata+ Usage of servers, libraries, and metadata
Machine LearningMachine Learning
+ Language models based upon billions of tokens+ Language models based upon billions of tokens
Filtering and AggregationFiltering and Aggregation
+ Analyzing tables, Wikipedia, phone numbers+ Analyzing tables, Wikipedia, phone numbers

Analytics at ScaleAnalytics at Scale
Imagine you areImagine you are interestedinterested in ...in ...
++ Javascript library usageJavascript library usage
+ HTML / HTML5+ HTML / HTML5 usageusage
+ Web server types and age+ Web server types and age
++ RDFa, Microdata, and Microformat Data SetsRDFa, Microdata, and Microformat Data Sets
With Common Crawl you can analyseWith Common Crawl you can analyse
billions of pages in an afternoon!billions of pages in an afternoon!

Analyzing Web Domain VulnsAnalyzing Web Domain Vulns
Sietse T. Au and Wing Lung NgaiSietse T. Au and Wing Lung Ngai

WDC Hyperlink GraphWDC Hyperlink Graph
Largest freely available real world graph dataset:Largest freely available real world graph dataset:
3.6 billion pages, 128 billion links3.6 billion pages, 128 billion links
http://webdatacommons.org/hyperlinkgraph/
Fast and easy analysis usingFast and easy analysis using on aon a
singlesingle EC2 r3.8xlarge instanceEC2 r3.8xlarge instance
(under 10 minutes per PageRank iteration)(under 10 minutes per PageRank iteration)
Dato GraphLabDato GraphLab

Using the whole web as your dataset

N-gram Counts & Language ModelsN-gram Counts & Language Models
from the Common Crawlfrom the Common Crawl
Christian BuckChristian Buck, Kenneth Heafield, Kenneth Heafield, Bas van Ooyen, Bas van Ooyen
Processed all the text of Common Crawl to produceProcessed all the text of Common Crawl to produce
975 billion deduplicated tokens975 billion deduplicated tokens
(similar size to the Google N-gram Dataset)(similar size to the Google N-gram Dataset)
Project data was released atProject data was released at
http://statmt.org/ngramshttp://statmt.org/ngrams
Deduped text split by languageDeduped text split by language
Resulting language modelsResulting language models

GloVe: Global Vectors for WordGloVe: Global Vectors for Word
RepresentationRepresentation
Jeffrey Pennington, Richard Socher, Christopher D. ManningJeffrey Pennington, Richard Socher, Christopher D. Manning
Word vector representations:Word vector representations:
king - queen = man - womanking - queen = man - woman
king - man + woman = queenking - man + woman = queen
(produces dimensions of meaning)(produces dimensions of meaning)

GloVe On Various CorporaGloVe On Various Corpora
Semantic: "Athens is to Greece as Berlin is to _?"Semantic: "Athens is to Greece as Berlin is to _?"
Syntactic: "Dance is to dancing as fly is to _?"Syntactic: "Dance is to dancing as fly is to _?"

GloVe over Big DataGloVe over Big Data
GloVe and word2vec (competing algorithm) can scaleGloVe and word2vec (competing algorithm) can scale
to hundreds of billions of tokensto hundreds of billions of tokens
Trained on the Common Crawl n-gram data:Trained on the Common Crawl n-gram data:
http://statmt.org/ngramshttp://statmt.org/ngrams
Source code and pre-trained models atSource code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/http://www-nlp.stanford.edu/projects/glove/

Web-Scale Parallel TextWeb-Scale Parallel Text
Dirt Cheap Web-Scale Parallel Text from the CommonDirt Cheap Web-Scale Parallel Text from the Common
Crawl (Smith et al.)Crawl (Smith et al.)
Processed all text from URLs of the style:Processed all text from URLs of the style:
website.com/[langcode]/website.com/[langcode]/
[w.com/en/tesla |[w.com/en/tesla | w.com/fr/tesla]w.com/fr/tesla]
"...nothing more than a set of common two-letter"...nothing more than a set of common two-letter
language codes ... [we] mined 32 terabytes ... in justlanguage codes ... [we] mined 32 terabytes ... in just
under a day"under a day"

Web-Scale Parallel TextWeb-Scale Parallel Text
Manual inspection across three languages:Manual inspection across three languages:
80% of the data contained good translations80% of the data contained good translations
(source = foreign language, target = English)(source = foreign language, target = English)

Web Data Commons Web TablesWeb Data Commons Web Tables
Extracted 11.2 billion tables from WARC files,Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifierfiltered to keep relational tables via trained classifier
Only 1.3% of the original data was kept,Only 1.3% of the original data was kept,
yet it still remains hugely valuableyet it still remains hugely valuable
Resulting dataset:Resulting dataset:
11.2 billion tables => 147 million relational web tables11.2 billion tables => 147 million relational web tables

Web Data Commons Web TablesWeb Data Commons Web Tables
Popular column headers:Popular column headers: name, title, artist, location,name, title, artist, location,
model, manufacturer, country ...model, manufacturer, country ...
Released atReleased at webdatacommons.org/webtables/webdatacommons.org/webtables/

Extracting US Phone NumbersExtracting US Phone Numbers
""Let's use Common Crawl to help match businessesLet's use Common Crawl to help match businesses
from Yelp's database to the possible web pages forfrom Yelp's database to the possible web pages for
those businesses on the Internet."those businesses on the Internet."
Yelp extracted ~748 million US phone numbers fromYelp extracted ~748 million US phone numbers from
the Common Crawl December 2014 datasetthe Common Crawl December 2014 dataset
Regular expression over extracted text (WET files)Regular expression over extracted text (WET files)

Extracting US Phone NumbersExtracting US Phone Numbers
Total complexity:Total complexity: 134 lines of Python134 lines of Python
Total time:Total time: 1 hour (20 ×1 hour (20 × c3.8xlarge)c3.8xlarge)
Total cost:Total cost: $10.60 USD (Python using EMR)$10.60 USD (Python using EMR)
Matched against Yelp's database:Matched against Yelp's database:
48% had exact URL matches48% had exact URL matches
61% had matching domains61% had matching domains
More details (and full code) on Yelp's blog post:More details (and full code) on Yelp's blog post:
Analyzing the Web For the Price of a SandwichAnalyzing the Web For the Price of a Sandwich

Common Crawl's Derived DatasetsCommon Crawl's Derived Datasets
Natural language processing:Natural language processing:
(975 bln tokens)(975 bln tokens)
WDCWDC (3.5 bln)(3.5 bln)
Large scale web analysis:Large scale web analysis:
(128 bln edges(128 bln edges))
-- Wikipedia in-links analysisWikipedia in-links analysis
and a million more use cases!and a million more use cases!
Parallel text for machine translationParallel text for machine translation
N-gram & language modelsN-gram & language models
Web tablesWeb tables
WDC Hyperlink GraphsWDC Hyperlink Graphs
WikiReverse.orgWikiReverse.org

Why am I so excited..?Why am I so excited..?
Open data is catching on!Open data is catching on!
Even playing field for academia and start-upsEven playing field for academia and start-ups
Google Web 1T =>Google Web 1T =>
Google's Wikilinks =>Google's Wikilinks =>
Google's Sets =>Google's Sets =>
Buck et al.'s N-gramsBuck et al.'s N-grams
WikiReverseWikiReverse
WDC Web TablesWDC Web Tables
Common Crawl releases their datasetCommon Crawl releases their dataset
and brilliant people build on top of itand brilliant people build on top of it

Read more atRead more at
commoncrawl.orgcommoncrawl.org
Stephen MerityStephen Merity
stephen@commoncrawl.orgstephen@commoncrawl.org
commoncrawl.orgcommoncrawl.org

Using the whole web as your dataset

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Using the whole web as your dataset (20)

More from Turi, Inc. (20)

Recently uploaded (20)

Using the whole web as your dataset