SlideShare a Scribd company logo
Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group




AWS Summit 2012 | Berlin
Our Starting Point




        2
Our Starting Point
•   Websites now embed structured data in HTML




                             2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...




                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata



                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata


Question: How are Vocabularies and Formats used?
                                 2
Web Indices

•   To answer our question, we need to access to raw Web data.




                               3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)




                                 3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)

•   Google and Bing have indices, but do not let outsiders in



                                 3
•   Non-Profit Organization




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)



                                  4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)

•   Available on AWS Public Data Sets

                                  4
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)




                               5
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

•   Preliminary analysis: 1 GB / hour / CPU possible

    •   8-CPU Desktop: 8 months

    •   64-CPU Server: 1 month

    •   100 8-CPU EC2-Instances: ~ 3 days

                                 5
Common Crawl
 Dataset Size
Common Crawl
              Dataset Size
1 CPU, 1 h
Common Crawl
                   Dataset Size
     1 CPU, 1 h

1000 € PC, 1 h
Common Crawl
                         Dataset Size
           1 CPU, 1 h

      1000 € PC, 1 h

5000 € Server, 1 h
Common Crawl
                               Dataset Size
                 1 CPU, 1 h

           1000 € PC, 1 h

     5000 € Server, 1 h




17 € EC2 Instances, 1 h
AWS Setup
•   Data Input: Read Index Splits from S3




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3




                                 7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3

•   Logging: SDB


                                 7
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
Results - Types of Data
                                                     Microdata 02/2012
                                                     RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                     RDFa 2009/2010
                                                     Microdata 2009/2010
                                                                            Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                             Movies, Music, ...              15 %
                     5e+04




                                                                                 Geodata                     8 %
                     5e+03




                                                                           People, Organizations             7 %
                             0   50     100    150                  200           2012 Microdata Breakdown
                                        Type




                                                                   9
Results - Types of Data
                                                            Microdata 02/2012
                                                            RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                            RDFa 2009/2010
                                                            Microdata 2009/2010
                                                                                   Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                    Movies, Music, ...              15 %
                     5e+04




                                                                                        Geodata                     8 %
                     5e+03




                                                                                  People, Organizations             7 %
                             0       50      100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support


                                                                          9
Results - Types of Data
                                                             Microdata 02/2012
                                                             RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                             RDFa 2009/2010
                                                             Microdata 2009/2010
                                                                                    Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                     Movies, Music, ...              15 %
                     5e+04




                                                                                         Geodata                     8 %
                     5e+03




                                                                                   People, Organizations             7 %
                             0       50       100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support

                                 •   “If Google consumes it, we will publish it”
                                                                           9
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
•

                                                         1
    RDFa +26% (Facebook?)




                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

•   Have a look!



                                11
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *

•   * At first, we underestimated SDB cost



                                 12
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction

•   Choose your architecture wisely, test by experiment, for us
    EMR was too expensive.

                                13
Thank You!
              Questions?
            Want to hire me?


Web Resources: http://webdatacommons.org
     http://hannes.muehleisen.org

More Related Content

PDF
Implementing High Availability Caching with Memcached
PPTX
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
PPS
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
PPTX
MongoDB Ops Manager and Kubernetes - James Broadhead
PDF
Java Enterprise Edition Concurrency Misconceptions
ZIP
Constructing Web APIs with Rack, Sinatra and MongoDB
PDF
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
PPT
HTTP Session Replication with Oracle Coherence, GlassFish, WebLogic
Implementing High Availability Caching with Memcached
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
MongoDB Ops Manager and Kubernetes - James Broadhead
Java Enterprise Edition Concurrency Misconceptions
Constructing Web APIs with Rack, Sinatra and MongoDB
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
HTTP Session Replication with Oracle Coherence, GlassFish, WebLogic

What's hot (20)

PPTX
Aioug connection poolsizingconcepts
PPTX
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
PDF
12-Step Program for Scaling Web Applications on PostgreSQL
PDF
An Elastic Metadata Store for eBay’s Media Platform
PPTX
New lessons in connection management
PPTX
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
PDF
PayPal Big Data and MySQL Cluster
KEY
From 100s to 100s of Millions
PPTX
Queues, Pools, Caches
PPT
Hazelcast
PDF
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
PDF
Wt unit 1 ppts web development process
KEY
Drupal In The Cloud
PPTX
Denver SQL Saturday The Next Frontier
PDF
Why we love pgpool-II and why we hate it!
PDF
Scalable Web Architectures: Common Patterns and Approaches
PDF
PDF
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
PDF
MySQL Cluster (NDB) - Best Practices Percona Live 2017
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
Aioug connection poolsizingconcepts
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
12-Step Program for Scaling Web Applications on PostgreSQL
An Elastic Metadata Store for eBay’s Media Platform
New lessons in connection management
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
PayPal Big Data and MySQL Cluster
From 100s to 100s of Millions
Queues, Pools, Caches
Hazelcast
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
Wt unit 1 ppts web development process
Drupal In The Cloud
Denver SQL Saturday The Next Frontier
Why we love pgpool-II and why we hate it!
Scalable Web Architectures: Common Patterns and Approaches
Azure Boot Camp 21.04.2018 SQL Server in Azure Iaas PaaS on-prem Lars Platzdasch
MySQL Cluster (NDB) - Best Practices Percona Live 2017
From cache to in-memory data grid. Introduction to Hazelcast.
Ad

Viewers also liked (19)

PDF
2013 11 mobile eating the world
PPTX
An introduction to mixi Graph API
PDF
Information Architecture On A Large Scale
PPTX
Web Development for Mobile: GTUG Talk at Google
PPTX
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
PPT
Competitive Intelligence on a Startup Budget
PDF
SeriesC Startup Marketing Budget Survey
PDF
[EN] 7 steps to a successful International PR Campaign
PDF
Supporting The Open Web - OSCON 2008
PPTX
HTML5 Web Forms
PDF
Brilliant PR on a Startup Budget
PPTX
Performance Implications of Mobile Design
PPTX
Startup - Finance and funding 1
PDF
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
PPT
Start Up Finance
PDF
Basics of Startup Financial Planning
PDF
The Ultimate Guide to Startup Marketing
PPT
Business Plan Powerpoint 1
PPTX
SEOmoz Pitch Deck July 2011
2013 11 mobile eating the world
An introduction to mixi Graph API
Information Architecture On A Large Scale
Web Development for Mobile: GTUG Talk at Google
iPhone Web Applications: HTML5, CSS3 & dev tips for iPhone development
Competitive Intelligence on a Startup Budget
SeriesC Startup Marketing Budget Survey
[EN] 7 steps to a successful International PR Campaign
Supporting The Open Web - OSCON 2008
HTML5 Web Forms
Brilliant PR on a Startup Budget
Performance Implications of Mobile Design
Startup - Finance and funding 1
Financial Planning for the startup CEO - Entrepreneurship 101 (2012/2013)
Start Up Finance
Basics of Startup Financial Planning
The Ultimate Guide to Startup Marketing
Business Plan Powerpoint 1
SEOmoz Pitch Deck July 2011
Ad

Similar to AWS Summit Berlin 2012 Talk on Web Data Commons (20)

PPTX
Cloud-based Data Lake for Analytics and AI
PDF
Migrating Netflix from Datacenter Oracle to Global Cassandra
PPTX
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PPTX
Scality S3 Server: Node js Meetup Presentation
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PPTX
Curriculum Associates Strata NYC 2017
PPTX
Curriculum Associates Strata NYC 2017
PPTX
Curriculum Associates Strata NYC 2017
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
PDF
Transactional writes to cloud storage with Eric Liang
PDF
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
PDF
MongoDB in FS
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
AWS to Bare Metal: Motivation, Pitfalls, and Results
PDF
Database as a Service on the Oracle Database Appliance Platform
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
PDF
Intro to Joyent's Manta Object Storage Service
Cloud-based Data Lake for Analytics and AI
Migrating Netflix from Datacenter Oracle to Global Cassandra
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Scality S3 Server: Node js Meetup Presentation
Meetup#2: Building responsive Symbology & Suggest WebService
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Transactional writes to cloud storage with Eric Liang
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
MongoDB in FS
Jump Start with Apache Spark 2.0 on Databricks
AWS to Bare Metal: Motivation, Pitfalls, and Results
Database as a Service on the Oracle Database Appliance Platform
IBM Cloud Day January 2021 Data Lake Deep Dive
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
Intro to Joyent's Manta Object Storage Service

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release

AWS Summit Berlin 2012 Talk on Web Data Commons

  • 1. Large-Scale Analysis of Web Pages − on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
  • 3. Our Starting Point • Websites now embed structured data in HTML 2
  • 4. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • 5. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • 6. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata Question: How are Vocabularies and Formats used? 2
  • 7. Web Indices • To answer our question, we need to access to raw Web data. 3
  • 8. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • 9. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) • Google and Bing have indices, but do not let outsiders in 3
  • 10. Non-Profit Organization 4
  • 11. Non-Profit Organization • Runs crawler and provides HTML dumps 4
  • 12. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • 13. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) • Available on AWS Public Data Sets 4
  • 14. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • 15. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) • Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • 17. Common Crawl Dataset Size 1 CPU, 1 h
  • 18. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h
  • 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h
  • 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h 17 € EC2 Instances, 1 h
  • 21. AWS Setup • Data Input: Read Index Splits from S3 7
  • 22. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue 7
  • 23. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • 24. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 7
  • 25. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 • Logging: SDB 7
  • 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • 32. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 33. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 34. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 • 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 35. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org 11
  • 36. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • 37. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) • Have a look! 11
  • 38. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • 39. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * 12
  • 40. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * • * At first, we underestimated SDB cost 12
  • 41. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • 42. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • 43. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction 13
  • 44. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction • Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • 45. Thank You! Questions? Want to hire me? Web Resources: http://webdatacommons.org http://hannes.muehleisen.org