Focused Crawling 
for Structured Data 
Robert Meusel, Peter Mika, 
and Roi Blanco
HTML pages embed directly 
markup languages to annotate 
items using different vocabularies 
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
2._:node1 <http://schema.org/Product/name> "Predator 
2 
Markup Languages in HTML Pages 
<html> 
… 
<body> 
… 
<div id="main-section" class="performance left" data-sku=" 
M17242_580“> 
580" itemscope 
itemtype="http://schema.org/Product"> 
h1 itemprop="name"> Predator Instinct FG Fußballschuh 
<h1> Predator Instinct FG Fußballschuh 
</h1> 
<div> 
div itemscope itemtype="http://schema.org/Offer" 
itemprop="offers"> 
type> <http://schema.org/Product> . 
itemprop="priceCurrency" content="EUR"> 
itemprop="price" data-sale-price=" 
219.95">219,95</span> 
<meta content="EUR"> 
<span 
data-sale-price="219.95">219,95</span> 
… 
</body> 
</html> 
Instinct FG Fußballschuh"@de . 
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
type> <http://schema.org/Offer> . 
4._:node1 <http://schema.org/Offer/price> 
"219,95"@de . 
5._:node1 <http://schema.org/Offer/priceCurrency> 
"EUR" . 
6.… 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
3 
Deployment of Markup Languages 
14% of all sites use markup languages to annotate 
their data (status 2013) [Meusel2014] 
• Broad topical variations from Articles over Products to 
Recipe [Bizer2013] 
• Multiple strong drivers pushing the deployment 
• Search engine companies initiative on Schema.org 
• Open Graph Protocol used by Facebook 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
4 
Motivation 
• Existing datasets/crawls do not focus on structured data 
• Common Crawl Foundation uses PageRank and Breadth-First Search 
• Datasets, as the WebDataCommons corpus extracted from these 
corpora, are likely to miss large amounts of data [Meusel2014] 
• Structured information 
• Hundreds of million pages 
• Up-to-date information 
• Publicly available 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
5 
Main Idea 
• Adapting the idea of focused crawling 
• Similarities: 
• Evaluation of content based on a objective function 
• Differences: 
• Typically focused by topic, not quality/amount of data collected 
• Because of that, typically no direct feedback about crawled pages 
available 
Possibility to incorporate the feedback directly into 
our system to improve classification of newly 
discovered URLs. 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
6 
Online Learning for Focused Crawling 
• Capability to incorporates real-time feedback 
• Improves performance 
• Adapts to concept drifts 
• Possible features 
• URL-based features; mainly tokens from the URL-String itself 
• Features describing information from the parent(s) of the URL 
• Features describing information from the siblings of the URL 
• Free open-source software available (e.g. Massive Online 
Analysis Library by Bifet et al.) 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
7 
Exploration vs. Exploitation 
Selecting the page with the highest confidence for 
supporting our objective, might not always be the best 
choice 
• Decision/Classification is based on gathered knowledge 
• Knowledge can be incomplete 
• Crawled too few pages 
• Knowledge can get invalid 
• Reaching part of the Web with 
different behavior 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
8 
Bandit-Based Selection 
• Bin each URL to the host it belongs to 
• Each host represents one bandit 
• Calculate the expected score for each 
bandit based on a scoring function 
• Select the degree of randomness λ 
• λ between 0 and 1 
• For each turn draw a random number z 
• z > λ: select the bandit with highest score 
• else: select a random bandit 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
9 
Scoring Functions 
Incorporate knowledge in score calculation for bandit/host: 
• Best Score (Pure classification-based selection) 
• Negative Absolute Bad 
• Success Rate 
• Absolute Good · Best Score 
• Success Rate · Best Score 
• Thompson Sampling 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
10 
System Workflow 
Online 
Classifier 
Bandits 
Crawler 
URL 
Parser 
Semantic 
Parser 
Classified 
URL 
URL 
HTML 
Page 
URLs 
Feedback 
Seeds 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
11 
Setup for Experiments 
• Data originates from the Common Crawl Corpus 2012 
• including over 3.5 billion HTML pages 
• Extracted a subset of 5.5 million linked pages 
• Including 450k different hosts 
• Identified all pages within the subset containing at least one 
markup language (using the WebDataCommons corpus) 
• 27.5% of all pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
12 
Experiment Description 
Measure: Number of relevant pages retrieved within the first 1 
million pages crawled. 
1. Online vs. batch-based classification with 100K, 250K, and 1M 
pages 
2. Pure online classification vs. enhanced with bandit-based 
selection (λ=0) 
3. Improvements with different λ 
4. Improvements with decaying λ 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
13 
Results: Online vs. Offline 
• Both methods outperform Breadth-First Search (BFS) 
• Static approach: 340K 
• Adaptive approach: 539K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
14 
Results: Pure Online Classification vs. +Bandit-based 
• Success rate based scoring functions show most promising results 
• Negative absolute bad scoring performs like BFS 
• Success rate 
function: 628K 
• Pure online-classification: 
539K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
15 
Results: λ > 0 
• Including randomness seems not to have an effect 
• Beneficial effect of λ > 0 is shown e.g. for the success rate 
function within the first 400K crawled pages 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
16 
Results: Decaying λ 
Decaying λ over time, means the reduction of randomness while 
crawling more pages. 
• Success rate function with decaying λ = 0.5: 673K 
• Static λ: 628K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
17 
Adaptation to more specific Objective 
• General objective is narrowed down to: 
• Pages making use of the markup language Microdata and 
• Include at least five marked up statements 
• Example: 
1. A page including information about a movie 
2. The movie has the name Se7en 
3. with a rating of 8.7 out of 10 
4. and it was released in 1995 
5. This information is maintained by imdb.com 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
18 
Results: Adaptation to more specific Objective 
• 3.5% of pages include such information 
• In general: Observation of beneficial effects using our approach 
• Static 
λ = 0.2: 120K 
• Decaying 
λ = 0.5: 108K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
19 
Conclusion 
• Improvement by 26% in comparison to pure online 
classification-based selection strategy for general objective 
• Improvement by 66% for the more specific objective 
• Success rate based scoring functions shows most promising 
results for objectives 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
20 
Open Challenges 
• Expand the approach to exploit results from one bandit to the 
other bandits (contextual bandits) 
• Introduce a more fine grained grading of the crawled pages 
(multi-class problem) 
• Take into account the quality of gathered information (beside 
richness) 
• Adapt the process to traditional topical focused crawling 
• Publishing of code and data to the community 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
21 
More Information 
• Paper accepted at ACM International Conference on 
Information and Knowledge Management in Shanghai, China 
• ACM Digital Library: Focused Crawling for Structured Data 
• Detailed Descriptions and Source Code: 
• Anthelion Webpage 
• Datasets: 
• Common Crawl Foundation Corpora 
• WebDataCommons Corpora 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

More Related Content

PDF
SharePoint User Group Meeting- SharePoint 2013 Search
PPTX
Google search vs Solr search for Enterprise search
PPT
Building Search Systems for the Enterprise
PPTX
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
PPTX
Webinar: How to Drive Business Value in Financial Services with MongoDB
PPTX
Share Point2007 Best Practices Final
PDF
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
PDF
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
SharePoint User Group Meeting- SharePoint 2013 Search
Google search vs Solr search for Enterprise search
Building Search Systems for the Enterprise
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Webinar: How to Drive Business Value in Financial Services with MongoDB
Share Point2007 Best Practices Final
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE

What's hot (6)

PDF
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
PPTX
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
PDF
Ontos NLP Stack, Sep. 2016
PPTX
Session 21 E-marketing - 26 Oct 10
PDF
International Journal of Engineering Research and Development (IJERD)
PPTX
Calculating ROI with Innovative eCommerce Platforms
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
Ontos NLP Stack, Sep. 2016
Session 21 E-marketing - 26 Oct 10
International Journal of Engineering Research and Development (IJERD)
Calculating ROI with Innovative eCommerce Platforms
Ad

Similar to Focused Crawling for Structured Data (20)

PPTX
33 Tactics to Engage and Retain More Customers - IRCE 2016
PPTX
Web Analytics: Challenges in Data Modeling
PPTX
33 Tactics to Engage and Retain More Customers- IRCE 2016
PPTX
When to Use MongoDB...and When You Should Not...
PDF
Phishing Website Detection by Machine Learning Techniques Presentation.pdf
PPTX
Web Mining.pptx
PPTX
MongoDB Partner Program Update - November 2013
PDF
Alizeh: A Radiant Icon Among Pakistani Clothing Brands for Women’s Ethnic Fas...
PPTX
Scoping a Successful SharePoint 2016 Hybrid Search Implementation
PPT
Search Engine Optimization (SEO)
PPTX
SEO for recruitment career microsite and beyond gi group v1
PDF
Search Engine Optimization (SEO) 101
PPTX
Semantic Search at Yahoo
PPTX
The AI-Powered Internal Linking Playbook.pptx
PDF
Search Engine Optimization (Seo) for Developers
PDF
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
PPTX
SPConnections - Search Administration in SharePoint 2013
PPT
Performing an SEO Audit- Pubcon Vegas 2013
PDF
Disrupting Data Discovery
PPT
Seo Beginners Guide SriG Systems
33 Tactics to Engage and Retain More Customers - IRCE 2016
Web Analytics: Challenges in Data Modeling
33 Tactics to Engage and Retain More Customers- IRCE 2016
When to Use MongoDB...and When You Should Not...
Phishing Website Detection by Machine Learning Techniques Presentation.pdf
Web Mining.pptx
MongoDB Partner Program Update - November 2013
Alizeh: A Radiant Icon Among Pakistani Clothing Brands for Women’s Ethnic Fas...
Scoping a Successful SharePoint 2016 Hybrid Search Implementation
Search Engine Optimization (SEO)
SEO for recruitment career microsite and beyond gi group v1
Search Engine Optimization (SEO) 101
Semantic Search at Yahoo
The AI-Powered Internal Linking Playbook.pptx
Search Engine Optimization (Seo) for Developers
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
SPConnections - Search Administration in SharePoint 2013
Performing an SEO Audit- Pubcon Vegas 2013
Disrupting Data Discovery
Seo Beginners Guide SriG Systems
Ad

Recently uploaded (20)

PPTX
Microbes in human welfare class 12 .pptx
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
gene cloning powerpoint for general biology 2
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
Probability.pptx pearl lecture first year
PPTX
endocrine - management of adrenal incidentaloma.pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
limit test definition and all limit tests
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Mutation in dna of bacteria and repairss
PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Understanding the Circulatory System……..
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PPTX
perinatal infections 2-171220190027.pptx
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPT
Computional quantum chemistry study .ppt
Microbes in human welfare class 12 .pptx
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
gene cloning powerpoint for general biology 2
Presentation1 INTRODUCTION TO ENZYMES.pptx
Probability.pptx pearl lecture first year
endocrine - management of adrenal incidentaloma.pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx
limit test definition and all limit tests
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Mutation in dna of bacteria and repairss
Introcution to Microbes Burton's Biology for the Health
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Understanding the Circulatory System……..
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
perinatal infections 2-171220190027.pptx
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
Computional quantum chemistry study .ppt

Focused Crawling for Structured Data

  • 1. Focused Crawling for Structured Data Robert Meusel, Peter Mika, and Roi Blanco
  • 2. HTML pages embed directly markup languages to annotate items using different vocabularies 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 2._:node1 <http://schema.org/Product/name> "Predator 2 Markup Languages in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data-sku=" M17242_580“> 580" itemscope itemtype="http://schema.org/Product"> h1 itemprop="name"> Predator Instinct FG Fußballschuh <h1> Predator Instinct FG Fußballschuh </h1> <div> div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> type> <http://schema.org/Product> . itemprop="priceCurrency" content="EUR"> itemprop="price" data-sale-price=" 219.95">219,95</span> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 3. 3 Deployment of Markup Languages 14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014] • Broad topical variations from Articles over Products to Recipe [Bizer2013] • Multiple strong drivers pushing the deployment • Search engine companies initiative on Schema.org • Open Graph Protocol used by Facebook Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 4. 4 Motivation • Existing datasets/crawls do not focus on structured data • Common Crawl Foundation uses PageRank and Breadth-First Search • Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014] • Structured information • Hundreds of million pages • Up-to-date information • Publicly available Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 5. 5 Main Idea • Adapting the idea of focused crawling • Similarities: • Evaluation of content based on a objective function • Differences: • Typically focused by topic, not quality/amount of data collected • Because of that, typically no direct feedback about crawled pages available Possibility to incorporate the feedback directly into our system to improve classification of newly discovered URLs. Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 6. 6 Online Learning for Focused Crawling • Capability to incorporates real-time feedback • Improves performance • Adapts to concept drifts • Possible features • URL-based features; mainly tokens from the URL-String itself • Features describing information from the parent(s) of the URL • Features describing information from the siblings of the URL • Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.) Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 7. 7 Exploration vs. Exploitation Selecting the page with the highest confidence for supporting our objective, might not always be the best choice • Decision/Classification is based on gathered knowledge • Knowledge can be incomplete • Crawled too few pages • Knowledge can get invalid • Reaching part of the Web with different behavior Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 8. 8 Bandit-Based Selection • Bin each URL to the host it belongs to • Each host represents one bandit • Calculate the expected score for each bandit based on a scoring function • Select the degree of randomness λ • λ between 0 and 1 • For each turn draw a random number z • z > λ: select the bandit with highest score • else: select a random bandit Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 9. 9 Scoring Functions Incorporate knowledge in score calculation for bandit/host: • Best Score (Pure classification-based selection) • Negative Absolute Bad • Success Rate • Absolute Good · Best Score • Success Rate · Best Score • Thompson Sampling Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 10. 10 System Workflow Online Classifier Bandits Crawler URL Parser Semantic Parser Classified URL URL HTML Page URLs Feedback Seeds Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 11. 11 Setup for Experiments • Data originates from the Common Crawl Corpus 2012 • including over 3.5 billion HTML pages • Extracted a subset of 5.5 million linked pages • Including 450k different hosts • Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus) • 27.5% of all pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 12. 12 Experiment Description Measure: Number of relevant pages retrieved within the first 1 million pages crawled. 1. Online vs. batch-based classification with 100K, 250K, and 1M pages 2. Pure online classification vs. enhanced with bandit-based selection (λ=0) 3. Improvements with different λ 4. Improvements with decaying λ Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 13. 13 Results: Online vs. Offline • Both methods outperform Breadth-First Search (BFS) • Static approach: 340K • Adaptive approach: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 14. 14 Results: Pure Online Classification vs. +Bandit-based • Success rate based scoring functions show most promising results • Negative absolute bad scoring performs like BFS • Success rate function: 628K • Pure online-classification: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 15. 15 Results: λ > 0 • Including randomness seems not to have an effect • Beneficial effect of λ > 0 is shown e.g. for the success rate function within the first 400K crawled pages Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 16. 16 Results: Decaying λ Decaying λ over time, means the reduction of randomness while crawling more pages. • Success rate function with decaying λ = 0.5: 673K • Static λ: 628K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 17. 17 Adaptation to more specific Objective • General objective is narrowed down to: • Pages making use of the markup language Microdata and • Include at least five marked up statements • Example: 1. A page including information about a movie 2. The movie has the name Se7en 3. with a rating of 8.7 out of 10 4. and it was released in 1995 5. This information is maintained by imdb.com Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 18. 18 Results: Adaptation to more specific Objective • 3.5% of pages include such information • In general: Observation of beneficial effects using our approach • Static λ = 0.2: 120K • Decaying λ = 0.5: 108K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 19. 19 Conclusion • Improvement by 26% in comparison to pure online classification-based selection strategy for general objective • Improvement by 66% for the more specific objective • Success rate based scoring functions shows most promising results for objectives Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 20. 20 Open Challenges • Expand the approach to exploit results from one bandit to the other bandits (contextual bandits) • Introduce a more fine grained grading of the crawled pages (multi-class problem) • Take into account the quality of gathered information (beside richness) • Adapt the process to traditional topical focused crawling • Publishing of code and data to the community Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 21. 21 More Information • Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China • ACM Digital Library: Focused Crawling for Structured Data • Detailed Descriptions and Source Code: • Anthelion Webpage • Datasets: • Common Crawl Foundation Corpora • WebDataCommons Corpora Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai