4 Web Crawler.pptx

Web Crawling
Submitted By:
Vijay Upadhyay

Beginning
A key motivation for designing Web crawlers
has been to retrieve Web pages and add their
representations to a local repository.

What is the “Web Crawling”?
What are the uses of Web Crawling?
How API are used ?

Web Crawling: -
• A Web crawler (also known as a Web spider, Web
robot, or—especially in the FOAF community—Web
scutter) is a program or automated script that
browses the World Wide Web in a
- methodical
- automated manner.
• Other less frequently used names for Web
crawlers are ants, automatic indexers, bots,
and worms.

What the Crawlers are:-
• Crawlers are computer programs that roam
the Web with the goal of automating specific
tasks related to the Web.
• The role of Crawlers is to collect Web Content.

Basic crawler operation:-
• Begin with known “seed” pages
• Fetch and parse them
• Extract URLs they point to
• Place the extracted URLs on a Queue
• Fetch each URL on the queue and repeat

HT'06 7
Traditional Web Crawler
Init
Download
resource
Extract
URLs
Seed URLs
Frontier
Visited URLs
Web
Repo

Beginning with Web Crawler:
The basic Algorithm :
{
Pick up the next URL
Connect to the server
GET the URL
When the page arrives, get its links
(optionally do other stuff)
REPEAT
}

Uses for crawling:-
• Complete web search engine
Search Engine = Crawler + Indexer/Searcher /(Lucene)
+ GUI
–Find stuff
–Gather stuff
–Check stuff

Several Types of Crawlers:
• Batch Crawlers- Crawl a snapshot of their crawl
space, until reaching a certain size or time limit.
• Incremental Crawlers- Continuously crawl their
crawl space, revisiting URL to ensure freshness.
• Focused Crawlers- Attempt to crawl pages
pertaining to some topic/theme, while
minimizing number of off topic pages that are
collected.

URL normalization
• Crawlers usually perform some type of URL
normalization in order to avoid crawling the
same resource more than once. The term URL
normalization refers to the process of-
modifying
standardizing
A URL in a consistent manner.

The challenges of “Web Crawling”:-
There are three important characteristics of the
Web that make crawling it very difficult:
• Its large volume
• Its fast rate of change
• Dynamic page generation

Examples of Web crawlers
• RBSE
• World Wide Web Worm
• Google Crawler
• WebFountain
• WebRACE

Web 3.0 Crawling
Web 3.0 defines advanced technologies and new
principles for the next generation search
technologies that is summarized in
-Semantic Web
-Website Parse Template
concepts
Web 3.0 crawling and indexing technologies will be
based on
-Human-machine clever associations

© 2005 Denise M. Gosnell. All Rights
Reserved.
How Web API are used ?
What Is a Web API?
• Series or collection of web services
• Sometimes used interchangeably with “web
services”
• Examples: Google API, Amazon.Com APIs

How Do You Call a Web API?
XML web services can be invoked in one of three
ways:
Using REST (HTTP-GET)
URL includes parameters
Example: “ http://search.twitter.com/search.atom?q= “
Using HTTP-POST
You post an XML document
XML document returned
Using SOAP
More complex, allows structured and type information

APIs that deliver information
Web Crawling
and Indexing
Web API
App
Keywords
(Recession, slump)
Structured Queries
(Recession, 22Nov’08, NY),
XML Documents
(Recession, slump)

References
•
•
•
http://en.wikipedia.org/wiki/Web_crawling
www.cs.cmu.edu/~spandey
www.cs.odu.edu/~fmccown/research/lazy/crawling
-policies-ht06.ppt
http://java.sun.com/developer/technicalArticles/ThirdP
www.grub.org
www.filesland.com/companies/Shettysoft-com/web-
crawler.html
www.ciw.cl/recursos/webCrawling.pdf
www.openldap.org/conf/odd-wien-2003/peter.pdf
•
•
•
•
•

4 Web Crawler.pptx

More Related Content

Similar to 4 Web Crawler.pptx (20)

More from DEEPAK948083 (20)

Recently uploaded (20)

4 Web Crawler.pptx