SlideShare a Scribd company logo
IN THE NAME OF GOD
1
WEB CRAWLING
PRESENTED BY:
Amir Masoud Sefidian
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
2
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
3
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
4
• Web crawling is the process by which we gather pages from the
Web, in order to index them and support a search.
• Web crawler is a computer program that browses the World Wide
Web in a methodical, automated manner.
• Objective of crawling:
Quickly and efficiently gather as many useful web pages and link structure
that interconnects them.
Creates and repopulates search engines data by navigating the web,
downloading documents and files.
5
Other Names:
web robots
WEB SpidEr
harvester
Bots
Indexers
web agent
wanderer
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
6
• Our goal is not to describe how to build the crawler for a full-scale
commercial web search engine.
• We focus on a range of issues that are generic to crawling from the
student project scale to substantial research projects.
7
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
8
Features a crawler must provide:
• Robustness:
• Crawlers must be designed to be resilient to spider traps:
• Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...
• Pages filled a large number of characters.
• Politeness:
Crawlers should respects Web servers implicit and explicit policies:
• Explicit politeness: specifications from webmasters on what portions of site can be crawled.
• Implicit politeness: even with no specification, avoid hitting any site too often.
Features a crawler should provide:
• Distributed:
execute in a distributed fashion across multiple machines.
• Scalable:
should permit scaling up the crawl rate by adding extra machines and bandwidth.
• Performance and efficiency:
Efficient use of various system resources including processor, storage and network bandwidth.
• Quality:
The crawler should be biased towards fetching “useful” pages first.
• Freshness:
In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of
previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s
index contains a fairly current representation of each indexed web page.
• Extensible:
Crawlers should be designed to be extensible in many ways to cope with new data formats (e.g.
XML-based formats), new fetch protocols (e.g. ftp)and so on. This demands that the crawler architecture be
modular.
9
Desideratafor webcrawlers
Basic properties any non-professional crawler should satisfy:
 1. Only one connection should be open to any given host at a time.
 2. Awaiting time of a few seconds should occur between successive
requests to a host.
 3. Politeness restrictions should be obeyed.
Reference point:
 Fetching a billion pages (a small fraction of the static Web at present) in a
month-long crawl requires fetching several hundred pages each second.
Multi-thread design
The MERCATOR crawler has formed the basis of a number of
research and commercial crawlers.
10
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
11
Basicoperationof any hypertextcrawler
• The crawler begins with one or more URLs that constitute a seed set.
• Picks a URL from seed set, then fetches the web page at that URL.
The fetched page to extract the text and the links from the page.
• The extracted text is fed to a text indexer.
• The extracted links (URLs) are then added to a URL frontier, which at
all times consists of URLs whose corresponding pages have yet to be
fetched by the crawler.
• Initially URL frontier = SEED SET
• As pages are fetched, the corresponding URLs are deleted from the
URL frontier.
• In continuous crawling, the URL of a fetched page is added back to
the frontier for fetching again in the future.
• The entire process may be viewed as traversing the web graph.
12
CRAWLING THE WEB
13
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
14
Crawler Architecture Modules
 Crawling is performed by anywhere from one to potentially hundreds
of threads, each of which loops through the logical cycle.
 threads may be run in a single process, or be partitioned amongst
multiple processes running at different nodes of a distributed system.
15
Web Crawler Cycle
• A crawler taking a URL from the frontier and fetching the web page
at that URL,(generally using the http protocol).
• The fetched page is then written into a temporary store.
• The text is passed on to the indexer.
• Link information including anchor text is also passed on to the
indexer for use in ranking.
• Each extracted link goes through a series of tests(filters) to
determine whether the link should be added to the URL frontier.
16
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
17
URL Tests
• Tests to determine whether the link should be added to the URL
frontier:
• 1) 40% of the pages on the Web are duplicates of other pages. Tests
whether a web page with the same content has already been seen at
another URL. How Test?
• simplest implementation: simple fingerprint such as a checksum (placed in a store
labeled "Doc FP’s" in Figure).
• more sophisticated test: use shingles.
• 2) A URL filter is used to determine whether the extracted URL should be
excluded from the frontier based on one of several tests.
• Crawler may seek to exclude certain domains (say, all .com URLs).
• Test could be inclusive rather than exclusive.
• Off-limits areas to crawling, under a standard known as the Robots Exclusion
Protocol, placing a robots.txt at the root of the URL hierarchy at the site.
• Caching robots.txt
18
URLNormalization& DuplicateElimination
• Often the HTML encoding of a link from a web page p indicates the
target of that link relative to the page p.
• A relative link encoded thus in the HTML of the page
en.wikipedia.org/wiki/Main_Page:
• <a href=“/wiki/Wikipedia:General_disclaimer“ title="Wikipedia:General
disclaimer">Disclaimers</a>
• http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer.
• The URL is checked for duplicate elimination:
• if the URL is already in the frontier or (in the case of a non-continuous
crawl) already crawled, we do not add it to the frontier.
19
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
20
Housekeeping Tasks
Certain housekeeping tasks are typically performed by a
dedicated thread:
Generally is quiescent except that it wakes up once every few
seconds to:
 log crawl progress statistics every few seconds (URLs crawled,
frontier size, etc.)
 Decide whether to terminate the crawl or (once every few hours of
crawling) checkpoint the crawl.
 In checkpointing, a snapshot of the crawler’s state is committed to
disk.
 In the event of a catastrophic crawler failure, the crawl is restarted
from the most recent checkpoint.
21
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
22
Distributing the crawler
• Crawler could run under different processes, each at a different
node of a distributed crawling system:
• is essential for scaling.
• it can also be of use in a geographically distributed crawler system where
each node crawls hosts “near” it.
• Partitioning the hosts being crawled amongst the crawler nodes can
be done by:
• 1) hash function.
• 2) some more specifically tailored policy.
• How do the various nodes of a distributed crawler
communicate and share URLs?
Use a host splitter to dispatch each surviving URL to the crawler node
responsible for the URL.
23
Distributed Crawler Architecture
24
25
Host Splitter
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
26
DNSresolution
Each web server (and indeed any host connected to the internet) has a
unique IP address (sequence of four bytes generally represented as four
integers separated by dots).
DNS(Domain Name Service) resolution or DNS lookup:
Process of translating a URL in textual form to an IP address
www.wikipedia.org  207.142.131.248
Program that wishes to perform this translation (in our case, a component
of the web crawler) contacts a DNS server that returns the translated IP
address.
DNS resolution is a well-known bottleneck in web crawling:
1) DNS resolution may entail multiple requests and round-trips across the
internet, requiring seconds and sometimes even longer.
URLs for which we have recently performed DNS lookups (recently asked
names) are likely to be found in the DNS cache  avoiding the need to go
to the DNS servers on the internet.
Standard remedy CASHING
27
DNSresolution(countinue)
2) lookup implementations in are generally synchronous:
Once a request is made to the Domain Name Service, other crawler
threads at that node are blocked until the first request is completed.
Solution:
• Most web crawlers implement their own DNS resolver as a component
of the crawler.
• Thread i executing the resolver code sends a message to the DNS server
and then performs a timed wait.
• it resumes either when being signaled by another thread or when a set
time quantum expires.
• A single separate thread listens on the standard for incoming response
packets from the name service.
• A crawler thread that resumes because its wait time quantum has
expired retries for a fixed number of attempts, sending out a new
message to the DNS server and performing a timed wait each time.
• The time quantum of the wait increases exponentially with each of
these attempts 28
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
29
TheURLfrontier
Maintains the URLs in the frontier and regurgitates them in some order
whenever a crawler thread seeks a URL.
Two important considerations govern the order in which URLs are returned
by the frontier:
1) Prioritization:
high-quality pages that change frequently should be prioritized for
frequent crawling.
Priority of URLs in URL frontier is function of(combination is necessary):
• Change rate.
• Quality.
2) Politeness:
• Crawler must avoid repeated fetch requests to a host within a short
time span.
• The likelihood of this is exacerbated because of a form of locality of
reference: many URLs link to other URLs at the same host.
• A common heuristic is to insert a gap between successive fetch
requests to a host
30
TheURLfrontier
A polite and prioritizing implementation
of a URL frontier:
1. only one connection is open at a
time to any host.
2. a waiting time of a few seconds
occurs between successive requests
to a host and
3. high-priority pages are crawled
preferentially.
 The two major sub-modules:
 F front queues : implement
prioritization
 B back queues : implement
politeness
 All of queues are FIFO
31
31
FrontQueues
32
 prioritizer assigns an integer priority i between 1 and F based on its
fetch history to the URL(taking into account the rate at which the
web page at this URL has changed between previous crawls).
a document that has more frequent change has higher priority
 URL with assigned priority i, will append to the ith of the front
queues.
BackQueues
 Each of the B back queues maintains the following invariants:
• it is nonempty while the crawl is in progress and
• it only contains URLs from a single host.
 An auxiliary table T map hosts to back queues.
 Whenever a back-queue is empty and is being re-filled from a front-
queue, T must be updated accordingly.
 When one of the Back FIFOs becomes empty:
 The back-end queue router requests a URL from the front-end.
 Back-end queue router checks if there is already a queue for the host?
• True  submit URL to the queue and request another URL from the front-end
• False  submit the URL to the empty queue
33
 This process continues until all of back
queues are non-empty.
# of front queues + policy of assigning
priorities and picking queues
 determines the priority properties.
# of back queues governs the extent to
which we can keep all crawl threads busy
while respecting politeness.
 maintain a heap with one entry for each
back queue
Today’sLectureContent
• What is Web Crawling(Crawler) and Objective of crawling
• Our goal in this presentation
• Listing desiderata for web crawlers
• Basic operation of any hypertext crawler
• Crawler Architecture Modules & Working Cycle
• URL Tests
• Housekeeping Tasks
• Distributed web crawler
• DNS resolution
• URL Frontier details
• Several Types of Crawlers
34
35
Several Types of Crawlers
BFS or DFS Crawling
 Crawl their crawl space, until reaching a certain size or time limit.
Repetitive (Continuous) Crawling
revisiting URL to ensure freshness
Targeted (Focused) Crawling
Attempt to crawl pages pertaining to some topic, while minimizing
number of off topic pages that are collected.
Deep (Hidden) Web Crawling
Private sites(need to login)
Scripted pages
The data that which is present in the data base may only be
downloaded through the medium of appropriate request or forms.
36
37
QUESTION??...

More Related Content

PPTX
Web crawler
PPT
WebCrawler
PPT
Web crawler
PPTX
Web Crawlers
PPTX
Page rank algortihm
PPTX
page ranking algorithm
PPT
“Web crawler”
PPT
Web mining
Web crawler
WebCrawler
Web crawler
Web Crawlers
Page rank algortihm
page ranking algorithm
“Web crawler”
Web mining

What's hot (20)

PPT
Web Crawler
PPT
Pagerank Algorithm Explained
PDF
What is web scraping?
PPTX
PPTX
Web Scraping
PPTX
Google Analytics Ppt Final
PPTX
Web application architecture
PPTX
Information Retrieval
PPT
Web Usage Pattern
PPTX
Crawling and Indexing
PPTX
Web mining (1)
PPTX
Web mining (structure mining)
PDF
What is front-end development ?
ODP
Web Information Retrieval and Mining
PDF
Search engine and web crawler
PPTX
How Google search works ppt
PDF
What is Web-scraping?
PDF
Scraping data from the web and documents
PDF
PageRank_algorithm_Nfaoui_El_Habib
PPTX
Web Scraping using Python | Web Screen Scraping
Web Crawler
Pagerank Algorithm Explained
What is web scraping?
Web Scraping
Google Analytics Ppt Final
Web application architecture
Information Retrieval
Web Usage Pattern
Crawling and Indexing
Web mining (1)
Web mining (structure mining)
What is front-end development ?
Web Information Retrieval and Mining
Search engine and web crawler
How Google search works ppt
What is Web-scraping?
Scraping data from the web and documents
PageRank_algorithm_Nfaoui_El_Habib
Web Scraping using Python | Web Screen Scraping
Ad

Similar to Web Crawling & Crawler (20)

PPT
Webcrawler
PPT
Webcrawler
PPTX
Web crawler
PPTX
webcrawler.pptx
PDF
Brief Introduction on Working of Web Crawler
PPT
Webcrawler
PPTX
4 Web Crawler.pptx
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
PPTX
Scalability andefficiencypres
PDF
Web Crawler For Mining Web Data
PPT
Smart Web Crawling in Search Engine Optimization
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
PDF
Smart Crawler Automation with RMI
PDF
Web crawler
PDF
IRJET- A Two-Way Smart Web Spider
PDF
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
PDF
E017624043
PPTX
Challenges in web crawling
PDF
[LvDuit//Lab] Crawling the web
PDF
IRJET - Review on Search Engine Optimization
Webcrawler
Webcrawler
Web crawler
webcrawler.pptx
Brief Introduction on Working of Web Crawler
Webcrawler
4 Web Crawler.pptx
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Scalability andefficiencypres
Web Crawler For Mining Web Data
Smart Web Crawling in Search Engine Optimization
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Smart Crawler Automation with RMI
Web crawler
IRJET- A Two-Way Smart Web Spider
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
E017624043
Challenges in web crawling
[LvDuit//Lab] Crawling the web
IRJET - Review on Search Engine Optimization
Ad

Recently uploaded (20)

PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
Introduction to Information and Communication Technology
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Testing WebRTC applications at scale.pdf
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Internet___Basics___Styled_ presentation
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
An introduction to the IFRS (ISSB) Stndards.pdf
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Introduction to Information and Communication Technology
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Cloud-Scale Log Monitoring _ Datadog.pdf
522797556-Unit-2-Temperature-measurement-1-1.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
Testing WebRTC applications at scale.pdf
introduction about ICD -10 & ICD-11 ppt.pptx
Internet___Basics___Styled_ presentation
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
RPKI Status Update, presented by Makito Lay at IDNOG 10
Introuction about ICD -10 and ICD-11 PPT.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Introuction about WHO-FIC in ICD-10.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps
Unit-1 introduction to cyber security discuss about how to secure a system
SASE Traffic Flow - ZTNA Connector-1.pdf
Triggering QUIC, presented by Geoff Huston at IETF 123
Slides PPTX World Game (s) Eco Economic Epochs.pptx

Web Crawling & Crawler

  • 1. IN THE NAME OF GOD 1
  • 2. WEB CRAWLING PRESENTED BY: Amir Masoud Sefidian Shahid Rajaee Teacher Training University Faculty of Computer Engineering 2
  • 3. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 3
  • 4. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 4
  • 5. • Web crawling is the process by which we gather pages from the Web, in order to index them and support a search. • Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. • Objective of crawling: Quickly and efficiently gather as many useful web pages and link structure that interconnects them. Creates and repopulates search engines data by navigating the web, downloading documents and files. 5 Other Names: web robots WEB SpidEr harvester Bots Indexers web agent wanderer
  • 6. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 6
  • 7. • Our goal is not to describe how to build the crawler for a full-scale commercial web search engine. • We focus on a range of issues that are generic to crawling from the student project scale to substantial research projects. 7
  • 8. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 8
  • 9. Features a crawler must provide: • Robustness: • Crawlers must be designed to be resilient to spider traps: • Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/... • Pages filled a large number of characters. • Politeness: Crawlers should respects Web servers implicit and explicit policies: • Explicit politeness: specifications from webmasters on what portions of site can be crawled. • Implicit politeness: even with no specification, avoid hitting any site too often. Features a crawler should provide: • Distributed: execute in a distributed fashion across multiple machines. • Scalable: should permit scaling up the crawl rate by adding extra machines and bandwidth. • Performance and efficiency: Efficient use of various system resources including processor, storage and network bandwidth. • Quality: The crawler should be biased towards fetching “useful” pages first. • Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. • Extensible: Crawlers should be designed to be extensible in many ways to cope with new data formats (e.g. XML-based formats), new fetch protocols (e.g. ftp)and so on. This demands that the crawler architecture be modular. 9 Desideratafor webcrawlers
  • 10. Basic properties any non-professional crawler should satisfy:  1. Only one connection should be open to any given host at a time.  2. Awaiting time of a few seconds should occur between successive requests to a host.  3. Politeness restrictions should be obeyed. Reference point:  Fetching a billion pages (a small fraction of the static Web at present) in a month-long crawl requires fetching several hundred pages each second. Multi-thread design The MERCATOR crawler has formed the basis of a number of research and commercial crawlers. 10
  • 11. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 11
  • 12. Basicoperationof any hypertextcrawler • The crawler begins with one or more URLs that constitute a seed set. • Picks a URL from seed set, then fetches the web page at that URL. The fetched page to extract the text and the links from the page. • The extracted text is fed to a text indexer. • The extracted links (URLs) are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. • Initially URL frontier = SEED SET • As pages are fetched, the corresponding URLs are deleted from the URL frontier. • In continuous crawling, the URL of a fetched page is added back to the frontier for fetching again in the future. • The entire process may be viewed as traversing the web graph. 12
  • 14. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 14
  • 15. Crawler Architecture Modules  Crawling is performed by anywhere from one to potentially hundreds of threads, each of which loops through the logical cycle.  threads may be run in a single process, or be partitioned amongst multiple processes running at different nodes of a distributed system. 15
  • 16. Web Crawler Cycle • A crawler taking a URL from the frontier and fetching the web page at that URL,(generally using the http protocol). • The fetched page is then written into a temporary store. • The text is passed on to the indexer. • Link information including anchor text is also passed on to the indexer for use in ranking. • Each extracted link goes through a series of tests(filters) to determine whether the link should be added to the URL frontier. 16
  • 17. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 17
  • 18. URL Tests • Tests to determine whether the link should be added to the URL frontier: • 1) 40% of the pages on the Web are duplicates of other pages. Tests whether a web page with the same content has already been seen at another URL. How Test? • simplest implementation: simple fingerprint such as a checksum (placed in a store labeled "Doc FP’s" in Figure). • more sophisticated test: use shingles. • 2) A URL filter is used to determine whether the extracted URL should be excluded from the frontier based on one of several tests. • Crawler may seek to exclude certain domains (say, all .com URLs). • Test could be inclusive rather than exclusive. • Off-limits areas to crawling, under a standard known as the Robots Exclusion Protocol, placing a robots.txt at the root of the URL hierarchy at the site. • Caching robots.txt 18
  • 19. URLNormalization& DuplicateElimination • Often the HTML encoding of a link from a web page p indicates the target of that link relative to the page p. • A relative link encoded thus in the HTML of the page en.wikipedia.org/wiki/Main_Page: • <a href=“/wiki/Wikipedia:General_disclaimer“ title="Wikipedia:General disclaimer">Disclaimers</a> • http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer. • The URL is checked for duplicate elimination: • if the URL is already in the frontier or (in the case of a non-continuous crawl) already crawled, we do not add it to the frontier. 19
  • 20. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 20
  • 21. Housekeeping Tasks Certain housekeeping tasks are typically performed by a dedicated thread: Generally is quiescent except that it wakes up once every few seconds to:  log crawl progress statistics every few seconds (URLs crawled, frontier size, etc.)  Decide whether to terminate the crawl or (once every few hours of crawling) checkpoint the crawl.  In checkpointing, a snapshot of the crawler’s state is committed to disk.  In the event of a catastrophic crawler failure, the crawl is restarted from the most recent checkpoint. 21
  • 22. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 22
  • 23. Distributing the crawler • Crawler could run under different processes, each at a different node of a distributed crawling system: • is essential for scaling. • it can also be of use in a geographically distributed crawler system where each node crawls hosts “near” it. • Partitioning the hosts being crawled amongst the crawler nodes can be done by: • 1) hash function. • 2) some more specifically tailored policy. • How do the various nodes of a distributed crawler communicate and share URLs? Use a host splitter to dispatch each surviving URL to the crawler node responsible for the URL. 23
  • 26. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 26
  • 27. DNSresolution Each web server (and indeed any host connected to the internet) has a unique IP address (sequence of four bytes generally represented as four integers separated by dots). DNS(Domain Name Service) resolution or DNS lookup: Process of translating a URL in textual form to an IP address www.wikipedia.org  207.142.131.248 Program that wishes to perform this translation (in our case, a component of the web crawler) contacts a DNS server that returns the translated IP address. DNS resolution is a well-known bottleneck in web crawling: 1) DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer. URLs for which we have recently performed DNS lookups (recently asked names) are likely to be found in the DNS cache  avoiding the need to go to the DNS servers on the internet. Standard remedy CASHING 27
  • 28. DNSresolution(countinue) 2) lookup implementations in are generally synchronous: Once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. Solution: • Most web crawlers implement their own DNS resolver as a component of the crawler. • Thread i executing the resolver code sends a message to the DNS server and then performs a timed wait. • it resumes either when being signaled by another thread or when a set time quantum expires. • A single separate thread listens on the standard for incoming response packets from the name service. • A crawler thread that resumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing a timed wait each time. • The time quantum of the wait increases exponentially with each of these attempts 28
  • 29. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 29
  • 30. TheURLfrontier Maintains the URLs in the frontier and regurgitates them in some order whenever a crawler thread seeks a URL. Two important considerations govern the order in which URLs are returned by the frontier: 1) Prioritization: high-quality pages that change frequently should be prioritized for frequent crawling. Priority of URLs in URL frontier is function of(combination is necessary): • Change rate. • Quality. 2) Politeness: • Crawler must avoid repeated fetch requests to a host within a short time span. • The likelihood of this is exacerbated because of a form of locality of reference: many URLs link to other URLs at the same host. • A common heuristic is to insert a gap between successive fetch requests to a host 30
  • 31. TheURLfrontier A polite and prioritizing implementation of a URL frontier: 1. only one connection is open at a time to any host. 2. a waiting time of a few seconds occurs between successive requests to a host and 3. high-priority pages are crawled preferentially.  The two major sub-modules:  F front queues : implement prioritization  B back queues : implement politeness  All of queues are FIFO 31 31
  • 32. FrontQueues 32  prioritizer assigns an integer priority i between 1 and F based on its fetch history to the URL(taking into account the rate at which the web page at this URL has changed between previous crawls). a document that has more frequent change has higher priority  URL with assigned priority i, will append to the ith of the front queues.
  • 33. BackQueues  Each of the B back queues maintains the following invariants: • it is nonempty while the crawl is in progress and • it only contains URLs from a single host.  An auxiliary table T map hosts to back queues.  Whenever a back-queue is empty and is being re-filled from a front- queue, T must be updated accordingly.  When one of the Back FIFOs becomes empty:  The back-end queue router requests a URL from the front-end.  Back-end queue router checks if there is already a queue for the host? • True  submit URL to the queue and request another URL from the front-end • False  submit the URL to the empty queue 33  This process continues until all of back queues are non-empty. # of front queues + policy of assigning priorities and picking queues  determines the priority properties. # of back queues governs the extent to which we can keep all crawl threads busy while respecting politeness.  maintain a heap with one entry for each back queue
  • 34. Today’sLectureContent • What is Web Crawling(Crawler) and Objective of crawling • Our goal in this presentation • Listing desiderata for web crawlers • Basic operation of any hypertext crawler • Crawler Architecture Modules & Working Cycle • URL Tests • Housekeeping Tasks • Distributed web crawler • DNS resolution • URL Frontier details • Several Types of Crawlers 34
  • 35. 35 Several Types of Crawlers BFS or DFS Crawling  Crawl their crawl space, until reaching a certain size or time limit. Repetitive (Continuous) Crawling revisiting URL to ensure freshness Targeted (Focused) Crawling Attempt to crawl pages pertaining to some topic, while minimizing number of off topic pages that are collected. Deep (Hidden) Web Crawling Private sites(need to login) Scripted pages The data that which is present in the data base may only be downloaded through the medium of appropriate request or forms.
  • 36. 36