Web crawler


 Why is web crawler required?
 How does web crawler work?
 Crawling strategies
Breadth first search traversal
depth first search traversal
 Architecture of web crawler
 Crawling policies
 Distributed crawling

 The process or program used by search engines to
download pages from the web for later processing by a
search engine that will index the downloaded pages to
provide fast searches.

 A program or automated script which browses the World
Wide Web in a methodical, automated manner

 also known as web spiders and web robots.

 less used names- ants, bots and worms.

 What is a web crawler?


Internet has a
wide expanse of
Information.
 Finding
relevant
information
requires an
efficient
mechanism.
Web Crawlers
provide that
scope to the
search engine.

 It starts with a list of URLs to visit, called the
seeds . As the crawler visits these URLs, it
identifies all the hyperlinks in the page and adds
them to the list of visited URLs, called the crawl
frontier
 URLs from the frontier are recursively visited
according to a set of policies.

New url’s can be
specified here. This is
google’s web Crawler.

Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
exit loop.
If already visited L, continue loop(get next url).
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded)
exit loop, else.
Index P (e.g. add to inverted index or store cached copy).
Parse P to obtain list of new links N.
Append N to the end of Q.

Alternate way of looking at the problem.

 Web is a huge directed graph, with
documents as vertices and hyperlinks as
edges.
 Need to explore the graph using a suitable
graph traversal algorithm.
 W.r.t. previous ex: nodes are represented
by rectangles and directed edges are
drawn as arrows.

Given any graph and a set of seeds at which to start, the
graph can be traversed using the algorithm

1. Put all the given seeds into the queue;
2. Prepare to keep a list of “visited” nodes (initially
empty);
3. As long as the queue is not empty:
a. Remove the first node from the queue;
b. Append that node to the list of “visited” nodes
c. For each edge starting at that node:
i. If the node at the end of the edge already appears on
the list of “visited” nodes or it is already in the queue,
then do nothing more with that edge;
ii. Otherwise, append the node at the end of the edge
to the end of the queue.

Depth first search traversal
 Parallel crawling

Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
page
• Visit link and get 1st non-visited link
• Repeat above step till no non-visited links
• Go to next non-visited link in the previous
level and repeat 2nd step

 depth-first goes off into one branch until it
reaches a leaf node
 not good if the goal node is on another branch
 neither complete nor optimal
 uses much less space than breadth-first
 much fewer visited nodes to keep track of
 smaller fringe

 breadth-first is more careful by checking all
alternatives
 complete and optimal
 very memory-intensive

Doc Robots URL
Fingerprint templates set

DNS

Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch

URL Frontier

 URL Frontier: containing URLs yet to be fetches
in the current crawl. At first, a seed set is stored
in URL Frontier, and a crawler begins by taking a
URL from the seed set.
 DNS: domain name service resolution. Look up IP
address for domain names.
 Fetch: generally use the http protocol to fetch
the URL.
 Parse: the page is parsed. Texts (images, videos,
and etc.) and Links are extracted.
 Content Seen?: test whether a web page
with the same content has already been seen
at another URL. Need to develop a way to
measure the fingerprint of a web page.

 URL Filter:
 Whether the extracted URL should be excluded
from the frontier (robots.txt).
 URL should be normalized (relative encoding).
 en.wikipedia.org/wiki/Main_Page
 <a href="/wiki/Wikipedia:General_disclaimer"
title="Wikipedia:General
disclaimer">Disclaimers</a>
 Dup URL Elim: the URL is checked for duplicate
elimination.

 Selection Policy that states which pages to
download.
 Re-visit Policy that states when to check for
changes to the pages.
 Politeness Policy that states how to avoid
overloading Web sites.
 Parallelization Policy that states how to
coordinate distributed Web crawlers.

 Search engines covers only a fraction of Internet.
 This requires download of relevant pages, hence a
good selection policy is very important.
 Common Selection policies:
Restricting followed links
Path-ascending crawling
Focused crawling
Crawling the Deep Web

 Web is dynamic; crawling takes a long time.
 Cost factors play important role in crawling.
 Freshness and Age- commonly used cost functions.
 Objective of crawler- high average freshness;
low average age of web pages.
 Two re-visit policies:
Uniform policy
Proportional policy

 Crawlers can have a crippling impact on the
overall performance of a site.
 The costs of using Web crawlers include:
Network resources
Server overload
Server/ router crashes
Network and server disruption
 A partial solution to these problems is the robots
exclusion protocol.

 How to control those robots!
Web sites and pages can specify that robots
should not crawl/index certain areas.
Two components:
 Robots Exclusion Protocol (robots.txt): Site wide
specification of excluded directories.
 Robots META Tag: Individual document tag to
exclude indexing or following links.

 Site administrator puts a “robots.txt” file at
the root of the host’s web directory.
 http://www.ebay.com/robots.txt
 http://www.cnn.com/robots.txt
 http://clgiles.ist.psu.edu/robots.txt
 File is a list of excluded directories for a
given robot (user-agent).
Exclude all robots from the entire site:
User-agent: *
Disallow: /
New Allow:

 Find some interesting robots.txt

 Exclude specific directories:
User-agent: *
Disallow: /tmp/
Disallow: /cgi-bin/
Disallow: /users/paranoid/
 Exclude a specific robot:
User-agent: GoogleBot
Disallow: /
 Allow a specific robot:
User-agent: GoogleBot
Disallow:

User-agent: *
Disallow: /

 Only use blank lines to separate different
User-agent disallowed directories.
 One directory per “Disallow” line.
 No regex (regular expression) patterns in
directories.

 The crawler runs multiple processes in parallel.
 The goal is:
To maximize the download rate.
To minimize the overhead from parallelization.
To avoid repeated downloads of the same page.

 The crawling system requires a policy for assigning
the new URLs discovered during the crawling
process.

 Mechanism used

 A distributed computing technique whereby
search engines employ many computers to index
the Internet via web crawling.

 The idea is to spread out the required resources
of computation and bandwidth to many
computers and networks.

 Types of distributed web crawling:
1. Dynamic Assignment
2. Static Assignment

 With this, a central server assigns new URLs to
different crawlers dynamically. This allows the
central server dynamically balance the load of
each crawler.
 Configurations of crawling architectures with
dynamic assignments:
• A small crawler configuration, in which there is
a central DNS resolver and central queues per
Web site, and distributed down loaders.
• A large crawler configuration, in which the DNS
resolver and the queues are also distributed.

• Here a fixed rule is stated from the beginning of
the crawl that defines how to assign new URLs to
the crawlers.
• A hashing function can be used to transform URLs
into a number that corresponds to the index of
the corresponding crawling process.
• To reduce the overhead due to the exchange of
URLs between crawling processes, when links
switch from one website to another, the
exchange should be done in batch.

 Focused crawling was first introduced by
Chakrabarti.
 A focused crawler ideally would like to download
only web pages that are relevant to a particular
topic and avoid downloading all others.
 It assumes that some labeled examples of
relevant and not relevant pages are available.

 A focused crawler predict the probability that a
link to a particular page is relevant before
actually downloading the page. A possible
predictor is the anchor text of links.

 In another approach, the relevance of a page is
determined after downloading its content.
Relevant pages are sent to content indexing and
their contained URLs are added to the crawl
frontier; pages that fall below a relevance
threshold are discarded.

 Yahoo! Slurp: Yahoo Search crawler.
 Msnbot: Microsoft's Bing web crawler.
 Googlebot : Google’s web crawler.
 WebCrawler : Used to build the first publicly-
available full-text index of a subset of the Web.
 World Wide Web Worm : Used to build a simple
index of document titles and URLs.
 Web Fountain: Distributed, modular crawler
written in C++.
 Slug: Semantic web crawler

1)Draw a neat labeled diagram to explain how does a
web crawler work?
2)What is the function of crawler?
3)How does the crawler knows if it can crawl and index
data from website? Explain.
4)Write a note on robot.txt.
5)Discuss the architecture of a search engine.
7)Explain difference between crawler and focused
crawler.

Web crawler

More Related Content

What's hot (20)

Similar to Web crawler (20)

Recently uploaded (20)

Web crawler