Crawl

Crawling the Web
Web pages
•Few thousand characters long
•Served through the internet using the hypertext
transport protocol (HTTP)
•Viewed at client end using `browsers’
Crawler
•To fetch the pages to the computer
•At the computer
Automatic programs can analyze hypertext
documents

HTML
 HyperText Markup Language
 Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
 expressedas an anchor tag with a HREF attribute
 HREF names another page using a Uniform
Resource Locator (URL),
• URL =
 protocol field (“HTTP”) +
 a server hostname (“www.cse.iitb.ac.in”) +
 file path (/, the `root' of the published file system).
Mining the Web Chakrabarti and Ramakrishnan 2

HTTP(hypertext transport
protocol)
 Built on top of the Transport Control Protocol
(TCP)
 Steps(from client end)
• resolve the server host name to an Internet address
(IP)
 Use Domain Name Server (DNS)
 DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
• contact the server using TCP
 connect to default HTTP port (80) on the server.
 Enter the HTTP requests header (E.g.: GET)
 Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
 Fetch the HTML page

Crawl “all” Web pages?
 Problem: no catalog of all accessible URLs
on the Web.
 Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new
outlinking URLs
• fetch these pages in turn…..
• Submit the text in page to a text indexing
system
• and so on……….

Crawling procedure
 Simple
• Great deal of engineering goes into industry-
strength crawlers
• Industry crawlers crawl a substantial fraction
of the Web
• E.g.: Alta Vista, Northern Lights, Inktomi
 No guarantee that all accessible Web
pages will be located in this fashion
 Crawler may never halt …….
• pages will be added continually even as it is
running.

Crawling overheads
 Delays involved in
• Resolving the host name in the URL to an IP
address using DNS
• Connecting a socket to the server and sending
the request
• Receiving the requested page in response
 Solution: Overlap the above delays by
• fetching many pages at the same time


Anatomy of a crawler.
 Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been
fetched
 Each page
• stored in compressed form to disk/tape
• scanned for outlinks
 Work pool of outlinks
• maintain network utilization without
overloading it
 Dealt with by load manager
 Continue till he crawler has collected a

Typical anatomy of a large-scale crawler.

Large-scale crawlers: performance

and reliability considerations
Need to fetch many pages at same time
• utilize the network bandwidth
• single page fetch may involve several seconds of
network latency
 Highly concurrent and parallelized DNS lookups
 Use of asynchronous sockets
• Explicit encoding of the state of a fetch context in a
data structure
• Polling socket to check for completion of network
transfers
•
Multi-processing or multi-threading: Impractical
 Care in URL extraction
• Eliminating duplicates to reduce redundant fetches
Mining • Avoiding “spider Chakrabarti”and Ramakrishnan
the Web traps 9

DNS caching, pre-fetching and
resolution
 A customized DNS component with…..
1. Custom client for address resolution
2. Caching server
3. Prefetching client


Custom client for address resolution
 Tailored for concurrent handling of
multiple outstanding requests
 Allows issuing of many resolution requests
together
• polling at a later time for completion of
individual requests
 Facilitates load distribution among many
DNS servers.


Caching server
 With a large cache, persistent across DNS
restarts
 Residing largely in memory if possible.


Prefetching client
• Steps
1. Parse a page that has just been fetched
2. extract host names from HREF targets
3. Make DNS resolution requests to the caching
server
• Usually implemented using UDP
• User Datagram Protocol
• connectionless, packet-based communication
protocol
• does not guarantee packet delivery
• Does not wait for resolution to be
completed.

Multiple concurrent fetches
• Managing multiple concurrent
connections
• A single download may take several seconds
• Open many socket connections to different
HTTP servers simultaneously
• Multi-CPU machines not useful
• crawling performance limited by network
and disk
• Two approaches
1. using multi-threading
2. using non-blocking sockets with event

Multi-threading
• logical threads
• physical thread of control provided by the operating
system (E.g.: pthreads) OR
• concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
• create a client socket
• connect the socket to the HTTP service on a server
• Send the HTTP request header
• read the socket (recv) until
• no more characters are available
• close the socket.
• use blocking system calls

Multi-threading: Problems
• performance penalty
• mutual exclusion
• concurrent access to data structures
• slow disk seeks.
• great deal of interleaved, random input-output
on disk
• Due to concurrent modification of document
repository by multiple threads


Non-blocking sockets and event
handlers
• non-blocking sockets
• connect, send or recv call returns immediately
without waiting for the network operation to
complete.
• poll the status of the network operation separately
• “select” system call
• lets application suspend until more data can be read
from or written to the socket
• timing out after a pre-specified deadline
• Monitor polls several sockets at the same time
• More efficient memory management
• code that completes processing not interrupted by
other completions
• No need for locks and semaphores on the pool

Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
• Avoid multiple fetches of pages known by
different URLs
• many IP addresses
• For load balancing on large sites
• Mirrored contents/contents on same file system
• “Proxy pass“
• Mapping of different host names to a single IP address
• need to publish many logical sites

• Relative URLs
• need to be interpreted w.r.t to a base URL.


Canonical URL
Formed by
• Using a standard string for the protocol
• Canonicalizing the host name
• Adding an explicit port number
• Normalizing and cleaning up the path


Robot exclusion
• Check
• whether the server prohibits crawling a
normalized URL
• In robots.txt file in the HTTP root directory of
the server
• species a list of path prefixes which crawlers should
not attempt to fetch.
• Meant for crawlers only


Eliminating already-visited URLs
 Checking if a URL has already been fetched
• Before adding a new URL to the work pool
• Needs to be very quick.
• Achieved by computing MD5 hash function on the
URL
 Exploiting spatio-temporal locality of access
 Two-level hash function.
– most significant bits (say, 24) derived by hashing the host name
plus port
– lower order bits (say, 40) derived by hashing the path
 concatenated bits use d as a key in a B-tree
 qualifying URLs added to frontier of the crawl.
 hash values added to B-tree.

Spider traps
 Protecting from crashing on
• Ill-formed HTML
 E.g.: page with 68 kB of null characters
• Misleading sites
 indefinite number of pages dynamically generated
by CGI scripts
 paths of arbitrary depth created using soft
directory links and path remapping features in
HTTP server


Spider Traps: Solutions
 No automatic technique can be foolproof
 Check for URL length
 Guards
• Preparing regular crawl statistics
• Adding dominating sites to guard module
• Disable crawling active content such as CGI
form queries
• Eliminate URLs with non-textual data types


Avoiding repeated expansion of
links on duplicate pages
 Reduce redundancy in crawls
 Duplicate detection
• Mirrored Web pages and sites
 Detecting exact duplicates
• Checking against MD5 digests of stored URLs
• Representing a relative link v(relativetoaliasesu1and
u2)as tuples (h(u1);v) and (h(u2);v)
 Detecting near-duplicates
• Even a single altered character will completely change
the digest !
 E.g.: date of update/ name and email of the site
administrator
• Solution : Shingling and Ramakrishnan
Mining the Web Chakrabarti 24

Load monitor
 Keeps track of various system statistics
• Recent performance of the wide area
network (WAN) connection
 E.g.: latency and bandwidth estimates.
• Operator-provided/estimated upper bound
on open sockets for a crawler
• Current number of active sockets.


Thread manager
 Responsible for
 Choosing units of work from frontier
 Scheduling issue of network resources
 Distribution of these requests over multiple
ISPs if appropriate.
 Uses statistics from load monitor


Per-server work queues
 Denial of service (DoS) attacks
 limit the speed or frequency of responses to
any fixed client IP address
 Avoiding DOS
 limit the number of active requests to a given
server IP address at any time
 maintain a queue of requests for each server
 Use the HTTP/1.1 persistent socket capability.
 Distribute attention relatively evenly between
a large number of sites
 Access locality vs. politeness dilemma

Text repository
 Crawler’s last task
 Dumping fetched pages into a repository
 Decoupling crawler from other functions
for efficiency and reliability preferred
 Page-related information stored in two
parts
 meta-data
 page contents.


Storage of page-related information
 Meta-data
 relational in nature
 usually managed by custom software to avoid
relation database system overheads
 text index involves bulk updates
 includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.


Page contents storage
 Typical HTML Web page compresses to 2-
4 kB (using zlib)
 File systems have a 4-8 kB file block size
 Too large !!
 Page storage managed by custom storage
manager
 simple access methods for
 crawler to add pages
 Subsequent programs (Indexer etc) to retrieve
documents


Page Storage
 Small-scale systems
 Repository fitting within the disks of a single
machine
 Use of storage manager (E.g.: Berkeley DB)
 Manage disk-based databases within a single file
 configuration as a hash-table/B-tree for URL
access key
 To handle ordered access of pages
 configuration as a sequential log of page records.
 Since Indexer can handle pages in any order


Page Storage
 Large Scale systems
 Repository distributed over a number of
storage servers
 Storage servers
 Connected to the crawler through a fast local
network (E.g.: Ethernet)
 Hashed by URLs
 `T3' grade leased lines.
 To handle 10 million pages (40 GB) per hour


Large-scale crawlers often use multiple ISPs and a bank of local storage
servers to store the pages crawled.


Refreshing crawled pages
 Search engine's index should be fresh
 Web-scale crawler never `completes' its job
 High variance of rate of page changes
 “If-modified-since” request header with
HTTP protocol
 Impractical for a crawler
 Solution
 At commencement of new crawling round
estimate which pages have changed


Determining page changes
 “Expires” HTTP response header
 For page that come with an expiry date
 Otherwise need to guess if revisiting that
page will yield a modified version.
 Score reflecting probability of page being
modified
 Crawler fetches URLs in decreasing order of
score.
 Assumption : recent past predicts the future


Estimating page change rates
 Brewington and Cybenko & Cho
 Algorithms for maintaining a crawl in which
most pages are fresher than a specified epoch.
 Prerequisite
 average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
 Small scale intermediate crawler runs
 to monitor fast changing sites
 E.g.: current news, weather, etc.
 Patched intermediate indices into master
index

Putting together a crawler
 Reference implementation of the HTTP client
protocol
 World-wide Web Consortium (http://www.w3c.org/
)
 w3c-libwww package


Design of the core components:
Crawler class.
 To copy bytes from network sockets to storage
media
 Three methods to express Crawler's contract
with user
 pushing a URL to be fetched to the Crawler
(fetchPush)
 Termination callback handler (fetchDone) called with
same URL
 Method (start) which starts Crawler's event loop.
 Implementation of Crawler class
 Need for two helper classes called DNS and Fetch


Crawl

More Related Content

What's hot (20)

Similar to Crawl (20)

Recently uploaded (20)

Crawl