Focused Crawling for Structured Data

Focused Crawling
for Structured Data
Robert Meusel, Peter Mika,
and Roi Blanco

HTML pages embed directly
markup languages to annotate
items using different vocabularies
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#
2._:node1 <http://schema.org/Product/name> "Predator
2
Markup Languages in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-sku="
M17242_580“>
580" itemscope
itemtype="http://schema.org/Product">
h1 itemprop="name"> Predator Instinct FG Fußballschuh
<h1> Predator Instinct FG Fußballschuh
</h1>
<div>
div itemscope itemtype="http://schema.org/Offer"
itemprop="offers">
type> <http://schema.org/Product> .
itemprop="priceCurrency" content="EUR">
itemprop="price" data-sale-price="
219.95">219,95</span>
<meta content="EUR">
<span
data-sale-price="219.95">219,95</span>
…
</body>
</html>
Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#
type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price>
"219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency>
"EUR" .
6.…
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

3
Deployment of Markup Languages
14% of all sites use markup languages to annotate
their data (status 2013) [Meusel2014]
• Broad topical variations from Articles over Products to
Recipe [Bizer2013]
• Multiple strong drivers pushing the deployment
• Search engine companies initiative on Schema.org
• Open Graph Protocol used by Facebook

4
Motivation
• Existing datasets/crawls do not focus on structured data
• Common Crawl Foundation uses PageRank and Breadth-First Search
• Datasets, as the WebDataCommons corpus extracted from these
corpora, are likely to miss large amounts of data [Meusel2014]
• Structured information
• Hundreds of million pages
• Up-to-date information
• Publicly available

5
Main Idea
• Adapting the idea of focused crawling
• Similarities:
• Evaluation of content based on a objective function
• Differences:
• Typically focused by topic, not quality/amount of data collected
• Because of that, typically no direct feedback about crawled pages
available
Possibility to incorporate the feedback directly into
our system to improve classification of newly
discovered URLs.

6
Online Learning for Focused Crawling
• Capability to incorporates real-time feedback
• Improves performance
• Adapts to concept drifts
• Possible features
• URL-based features; mainly tokens from the URL-String itself
• Features describing information from the parent(s) of the URL
• Features describing information from the siblings of the URL
• Free open-source software available (e.g. Massive Online
Analysis Library by Bifet et al.)

7
Exploration vs. Exploitation
Selecting the page with the highest confidence for
supporting our objective, might not always be the best
choice
• Decision/Classification is based on gathered knowledge
• Knowledge can be incomplete
• Crawled too few pages
• Knowledge can get invalid
• Reaching part of the Web with
different behavior

8
Bandit-Based Selection
• Bin each URL to the host it belongs to
• Each host represents one bandit
• Calculate the expected score for each
bandit based on a scoring function
• Select the degree of randomness λ
• λ between 0 and 1
• For each turn draw a random number z
• z > λ: select the bandit with highest score
• else: select a random bandit

9
Scoring Functions
Incorporate knowledge in score calculation for bandit/host:
• Best Score (Pure classification-based selection)
• Negative Absolute Bad
• Success Rate
• Absolute Good · Best Score
• Success Rate · Best Score
• Thompson Sampling

10
System Workflow
Online
Classifier
Bandits
Crawler
URL
Parser
Semantic
Parser
Classified
URL
URL
HTML
Page
URLs
Feedback
Seeds

11
Setup for Experiments
• Data originates from the Common Crawl Corpus 2012
• including over 3.5 billion HTML pages
• Extracted a subset of 5.5 million linked pages
• Including 450k different hosts
• Identified all pages within the subset containing at least one
markup language (using the WebDataCommons corpus)
• 27.5% of all pages

12
Experiment Description
Measure: Number of relevant pages retrieved within the first 1
million pages crawled.
1. Online vs. batch-based classification with 100K, 250K, and 1M
pages
2. Pure online classification vs. enhanced with bandit-based
selection (λ=0)
3. Improvements with different λ
4. Improvements with decaying λ

13
Results: Online vs. Offline
• Both methods outperform Breadth-First Search (BFS)
• Static approach: 340K
• Adaptive approach: 539K
Percentage of relevant pages
Fetched web pages

14
Results: Pure Online Classification vs. +Bandit-based
• Success rate based scoring functions show most promising results
• Negative absolute bad scoring performs like BFS
• Success rate
function: 628K
• Pure online-classification:
539K
Fetched web pages

15
Results: λ > 0
• Including randomness seems not to have an effect
• Beneficial effect of λ > 0 is shown e.g. for the success rate
function within the first 400K crawled pages
Fetched web pages

16
Results: Decaying λ
Decaying λ over time, means the reduction of randomness while
crawling more pages.
• Success rate function with decaying λ = 0.5: 673K
• Static λ: 628K
Fetched web pages

17
Adaptation to more specific Objective
• General objective is narrowed down to:
• Pages making use of the markup language Microdata and
• Include at least five marked up statements
• Example:
1. A page including information about a movie
2. The movie has the name Se7en
3. with a rating of 8.7 out of 10
4. and it was released in 1995
5. This information is maintained by imdb.com

18
Results: Adaptation to more specific Objective
• 3.5% of pages include such information
• In general: Observation of beneficial effects using our approach
• Static
λ = 0.2: 120K
• Decaying
λ = 0.5: 108K
Fetched web pages

19
Conclusion
• Improvement by 26% in comparison to pure online
classification-based selection strategy for general objective
• Improvement by 66% for the more specific objective
• Success rate based scoring functions shows most promising
results for objectives

20
Open Challenges
• Expand the approach to exploit results from one bandit to the
other bandits (contextual bandits)
• Introduce a more fine grained grading of the crawled pages
(multi-class problem)
• Take into account the quality of gathered information (beside
richness)
• Adapt the process to traditional topical focused crawling
• Publishing of code and data to the community

21
More Information
• Paper accepted at ACM International Conference on
Information and Knowledge Management in Shanghai, China
• ACM Digital Library: Focused Crawling for Structured Data
• Detailed Descriptions and Source Code:
• Anthelion Webpage
• Datasets:
• Common Crawl Foundation Corpora
• WebDataCommons Corpora

Focused Crawling for Structured Data

More Related Content

What's hot (6)

Similar to Focused Crawling for Structured Data (20)

Recently uploaded (20)

Focused Crawling for Structured Data