AWS Summit Berlin 2012 Talk on Web Data Commons

Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group

AWS Summit 2012 | Berlin

Our Starting Point

2

Our Starting Point
• Websites now embed structured data in HTML

2

Our Starting Point

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

2

Our Starting Point



• Various Encoding Formats possible

• μFormats, RDFa, Microdata

2

Our Starting Point



• Various Encoding Formats possible

• μFormats, RDFa, Microdata

Question: How are Vocabularies and Formats used?
2

Web Indices

• To answer our question, we need to access to raw Web data.

3

Web Indices


• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

3

Web Indices


• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

• Google and Bing have indices, but do not let outsiders in

3

• Non-Profit Organization

4


• Runs crawler and provides HTML dumps

4



• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

4



• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

• Available on AWS Public Data Sets

4

Why AWS?
• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

5

Why AWS?
• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

• Preliminary analysis: 1 GB / hour / CPU possible

• 8-CPU Desktop: 8 months

• 64-CPU Server: 1 month

• 100 8-CPU EC2-Instances: ~ 3 days

5

Common Crawl
Dataset Size
1 CPU, 1 h

Common Crawl
Dataset Size
1 CPU, 1 h

1000 € PC, 1 h

Common Crawl
Dataset Size
1 CPU, 1 h

1000 € PC, 1 h

5000 € Server, 1 h

Common Crawl
Dataset Size
1 CPU, 1 h

1000 € PC, 1 h

5000 € Server, 1 h

17 € EC2 Instances, 1 h

AWS Setup
• Data Input: Read Index Splits from S3

7

AWS Setup

• Job Coordination: SQS Message Queue

7

AWS Setup


• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

7

AWS Setup



• Result Output: Write to S3

7

AWS Setup



• Result Output: Write to S3

• Logging: SDB

7

SQS • Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

42

...

EC2

42 43 ... R42 R43 ...
CC WDC
S3

Results - Types of Data
Microdata 02/2012
RDFa 02/2012 Website Structure 23 %
5e+06

RDFa 2009/2010
Microdata 2009/2010
Products, Reviews 19 %
Entity Count (log)

5e+05

Movies, Music, ... 15 %
5e+04

Geodata 8 %
5e+03

People, Organizations 7 %
0 50 100 150 200 2012 Microdata Breakdown
Type

9

Microdata 02/2012
5e+06

RDFa 2009/2010
Microdata 2009/2010
Entity Count (log)

5e+05

5e+04

Geodata 8 %
5e+03

Type

• Available data largely determined by major player support

9

Microdata 02/2012
5e+06

RDFa 2009/2010
Microdata 2009/2010
Entity Count (log)

5e+05

5e+04

Geodata 8 %
5e+03

Type

• Available data largely determined by major player support

• “If Google consumes it, we will publish it”
9

Results - Formats

2009/2010

•

4
02−2012
URLs with embedded Data: +6%

Percentage of URLs

3
2
1
0
RDFa Microdata geo hcalendar hcard hreview XFN

Format

10

Results - Formats

2009/2010

•

4
02−2012

Percentage of URLs

3
• Microdata +14% (schema.org?)

2
1
0

Format

10

Results - Formats

2009/2010

•

4
02−2012

Percentage of URLs

3
• Microdata +14% (schema.org?)

2
•

1
RDFa +26% (Facebook?)

0

Format

10

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

11




• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

11




• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

• Have a look!

11

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

12

AWS Costs



• Cost for other services negligible *

12

AWS Costs



• Cost for other services negligible *

• * At first, we underestimated SDB cost

12

Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available

13

Takeaways

• Large-Scale Web Analysis now possible with Common Crawl
datasets

13

Takeaways

datasets

• AWS great for massive ad-hoc computing power and
complexity reduction

13

Takeaways

datasets

• AWS great for massive ad-hoc computing power and
complexity reduction

• Choose your architecture wisely, test by experiment, for us
EMR was too expensive.

13

Thank You!
Questions?
Want to hire me?

Web Resources: http://webdatacommons.org
http://hannes.muehleisen.org

AWS Summit Berlin 2012 Talk on Web Data Commons

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to AWS Summit Berlin 2012 Talk on Web Data Commons (20)

Recently uploaded (20)

AWS Summit Berlin 2012 Talk on Web Data Commons