SlideShare a Scribd company logo
Web Search Overview & CrawlingWeb Search Overview & Crawling
By
SATHISHKUMAR G
(sathishsak111@gmail.com)
Web Search and Mining
Web Search Overview & CrawlingWeb Search Overview & Crawling
Algorithmic results.
Paid
Search Ads
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search and Information Retrieval
 Search on the Web is a daily activity for many people
throughout the world
 Search and communication are most popular uses of
the computer
 Applications involving search are everywhere
 The field of computer science that is most involved
with R&D for search is information retrieval (IR)
Search & IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Information Retrieval
 “Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching,
and retrieval of information.” (Salton, 1968)
 General definition that can be applied to many types
of information and search applications
 Primary focus of IR since the 50s has been on text
and documents
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
What is a Document?
 Examples:
 web pages, email, books, news stories, scholarly papers,
text messages, Word™, Powerpoint™, PDF, forum postings,
patents, etc.
 Common properties
 Significant text content
 Some structure (e.g., title, author, date for papers;
subject, sender, destination for email)
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Documents vs. Database Records
 Database records (or tuples in relational databases)
are typically made up of well-defined fields (or
attributes)
 e.g., bank records with account numbers, balances,
names, addresses, social security numbers, dates of birth,
etc.
 Easy to compare fields with well-defined semantics
to queries in order to find matches
 Text is more difficult
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Documents vs. Records
 Example bank database query
 Find records with balance > $50,000 in branches located in
Amherst, MA.
 Matches easily found by comparison with field values of
records
 Example search engine query
 bank scandals in western mass
 This text must be compared to the text of entire news
stories
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Comparing Text
 Comparing the query text to the document text and
determining what is a good match is the core issue of
information retrieval
 Exact matching of words is not enough
 Many different ways to write the same thing in a “natural
language” like English
 e.g., does a news story containing the text “bank director
in Amherst steals funds” match the query?
 Some stories will be better matches than others
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Dimensions of IR
 IR is more than just text, and more than just web
search
 although these are central
 People doing IR work with different media, different
types of search applications, and different tasks
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Other Media
 New applications increasingly involve new media
 e.g., video, photos, music, speech
 Like text, content is difficult to describe and compare
 text may be used to represent them (e.g. tags)
 IR approaches to search and evaluation are
appropriate
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
IR Tasks
 Ad-hoc search
 Find relevant documents for an arbitrary text query
 Filtering
 Identify relevant user profiles for a new document
 Classification
 Identify relevant labels for documents
 Question answering
 Give a specific answer to a question
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Relevance
 What is it?
 Simple (and simplistic) definition:
A relevant document contains the information that a
person was looking for when they submitted a query to
the search engine
 Many factors influence a person’s decision about what is
relevant: e.g., task, context, novelty, style
 Topical relevance (same topic) vs. user relevance
(everything else)
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Relevance
 Retrieval models define a view of relevance
 Ranking algorithms used in search engines are based on
retrieval models
 Most models describe statistical properties of text rather
than linguistic
 i.e., counting simple text features such as words instead of parsing
and analyzing the sentences
 Statistical approach to text processing started with Luhn in the 50s
 Linguistic features can be part of a statistical model
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Evaluation
 Experimental procedures and measures for comparing
system output with user expectations
 Originated in Cranfield experiments in the 60s
 Typically use test collection of documents, queries, and
relevance judgments
 Most commonly used are TREC collections
 Recall and precision are two examples of effectiveness
measures
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Big Issues in IR
 Users and Information Needs
 Search evaluation is user-centered
 Keyword queries are often poor descriptions of actual
information needs
 Interaction and context are important for understanding
user intent
 Query refinement techniques such as query expansion,
query suggestion, relevance feedback improve ranking
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
IR and Search Engines
 A search engine is the practical application of
information retrieval techniques to large scale text
collections
 Web search engines are best-known examples, but
many others
 Open source search engines are important for research
and development
 e.g., Lucene, Lemur/Indri, Galago
 Big issues include main IR issues but also some
others
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
IR and Search Engines
Relevance
-Effective ranking
Evaluation
-Testing and measuring
Information needs
-User interaction
Performance
-Efficient search and indexing
Incorporating new data
-Coverage and freshness
Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Information Retrieval Search Engines
IR
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Performance
 Measuring and improving the efficiency of search
 e.g., reducing response time, increasing query throughput,
increasing indexing speed
 Indexes are data structures designed to improve search
efficiency
 designing and implementing them are major issues for search
engines
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Dynamic data (Incorporating new data)
 The “collection” for most real applications is constantly
changing in terms of updates, additions, deletions
 e.g., web pages
 Acquiring or “crawling” the documents is a major task
 Typical measures are coverage (how much has been indexed) and
freshness (how recently was it indexed)
 Updating the indexes while processing queries is also a
design issue
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Scalability
 Making everything work with millions of users every day,
and many terabytes of documents
 Distributed processing is essential
 Adaptability
 Changing and tuning search engine components such as
ranking algorithm, indexing strategy, interface for
different applications
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Search Engine Issues
 Spam
 For Web search, spam in all its forms is one of the major
issues
 Affects the efficiency of search engines and, more
seriously, the effectiveness of the results
 Many types of spam
 e.g. spamdexing or term spam, link spam, “optimization”
 New subfield called adversarial IR, since spammers are
“adversaries” with different goals
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Architecture of SE
How do search engines like Google work?
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Algorithmic results.
Paid
Search Ads
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Architecture
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com
Web spider
Indexer
Indexes
Search
User
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Indexing Process
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Indexing Process
 Text acquisition
 identifies and stores documents for indexing
 Text transformation
 transforms documents into index terms or features
 Index creation
 takes index terms and creates data structures (indexes) to
support fast searching
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Query Process
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Query Process
 User interaction
 supports creation and refinement of query, display of
results
 Ranking
 uses query and indexes to generate ranked list of
documents
 Evaluation
 monitors and measures effectiveness and efficiency
(primarily offline)
Search Engine
Web Search Overview & CrawlingWeb Search Overview & Crawling
Details: Text Acquisition
 Crawler
 Identifies and acquires documents for search engine
 Many types – web, enterprise, desktop
 Web crawlers follow links to find documents
 Must efficiently find huge numbers of web pages (coverage) and
keep them up-to-date (freshness)
 Single site crawlers for site search
 Topical or focused crawlers for vertical search
 Document crawlers for enterprise and desktop search
 Follow links and scan directories
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Web Crawler
 Starts with a set of seeds, which are a set of URLs given to it
as parameters
 Seeds are added to a URL request queue
 Crawler starts fetching pages from the request queue
 Downloaded pages are parsed to find link tags that might
contain other useful URLs to fetch
 New URLs added to the crawler’s request queue, or frontier
 Continue until no more new URLs or disk full
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Crawling picture
Web
URLs crawled
and parsed
URLs frontier
Unseen Web
Seed
pages
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Crawling the Web
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Acquisition
 Feeds
 Real-time streams of documents
 e.g., web feeds for news, blogs, video, radio, tv
 RSS is common standard
 RSS “reader” can provide new XML documents to search engine
 Conversion
 Convert variety of documents into a consistent text plus
metadata format
 e.g. HTML, XML, Word, PDF, etc. → XML
 Convert text encoding for different languages
 Using a Unicode standard like UTF-8
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Acquisition
 Document data store
 Stores text, metadata, and other related content for
documents
 Metadata is information about document such as type and
creation date
 Other content includes links, anchor text
 Provides fast access to document contents for search
engine components
 e.g. result list generation
 Could use relational database system
 More typically, a simpler, more efficient storage system is used
due to huge numbers of documents
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Parser
 Processing the sequence of text tokens in the document to
recognize structural elements
 e.g., titles, links, headings, etc.
 Tokenizer recognizes “words” in the text
 must consider issues like capitalization, hyphens, apostrophes,
non-alpha characters, separators
 Markup languages such as HTML, XML often used to specify
structure
 Tags used to specify document elements
 E.g., <h2> Overview </h2>
 Document parser uses syntax of markup language (or other
formatting) to identify structure
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Stopping
 Remove common words
 e.g., “and”, “or”, “the”, “in”
 Some impact on efficiency and effectiveness
 Can be a problem for some queries
 Stemming
 Group words derived from a common stem
 e.g., “computer”, “computers”, “computing”, “compute”
 Usually effective, but not for all queries
 Benefits vary for different languages
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Link Analysis
 Makes use of links and anchor text in web pages
 Link analysis identifies popularity and community
information
 e.g., PageRank
 Anchor text can significantly enhance the representation
of pages pointed to by links
 Significant impact on web search
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Text Transformation
 Information Extraction
 Identify classes of index terms that are important for some
applications
 e.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc.
 Classifier
 Identifies class-related metadata for documents
 i.e., assigns labels to documents
 e.g., topics, reading levels, sentiment, genre
 Use depends on application
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Index Creation
 Document Statistics
 Gathers counts and positions of words and other features
 Used in ranking algorithm
 Weighting
 Computes weights for index terms
 Used in ranking algorithm
 e.g., tf.idf weight
 Combination of term frequency in document and inverse
document frequency in the collection
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Index Creation
 Inversion
 Core of indexing process
 Converts document-term information to term-document
for indexing
 Difficult for very large numbers of documents
 Format of inverted file is designed for fast query
processing
 Must also handle updates
 Compression used for efficiency
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Index Creation
 Index Distribution
 Distributes indexes across multiple computers and/or
multiple sites
 Essential for fast query processing with large numbers of
documents
 Many variations
 Document distribution, term distribution, replication
 P2P and distributed IR involve search across multiple sites
Indexing Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
User Interaction
 Query input
 Provides interface and parser for query language
 Most web queries are very simple, other applications may
use forms
 Query language used to describe more complex queries
and results of query transformation
 e.g., Boolean queries
 similar to SQL language used in database applications
 IR query languages also allow content and structure specifications,
but focus on content
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
User Interaction
 Query transformation
 Improves initial query, both before and after initial search
 Includes text transformation techniques used for
documents
 Spell checking and query suggestion provide alternatives
to original query
 Query expansion and relevance feedback modify the
original query with additional terms
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
User Interaction
 Results output
 Constructs the display of ranked documents for a query
 Generates snippets to show how queries match
documents
 Highlights important words and passages
 Retrieves appropriate advertising in many applications
 May provide clustering and other visualization tools
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Ranking
 Scoring
 Calculates scores for documents using a ranking algorithm
 Core component of search engine
 Basic form of score is ∑ qi di
 qi and di are query and document term weights for term i
 Many variations of ranking algorithms and retrieval
models
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Ranking
 Performance optimization
 Designing ranking algorithms for efficient processing
 Term-at-a time vs. document-at-a-time processing
 Safe vs. unsafe optimizations
 Distribution
 Processing queries in a distributed environment
 Query broker distributes queries and assembles results
 Caching is a form of distributed searching
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
Evaluation
 Logging
 Logging user queries and interaction is crucial for
improving search effectiveness and efficiency
 Query logs and clickthrough data used for query
suggestion, spell checking, query caching, ranking,
advertising search, and other components
 Ranking analysis
 Measuring and tuning ranking effectiveness
 Performance analysis
 Measuring and tuning system efficiency
Query Process
Web Search Overview & CrawlingWeb Search Overview & Crawling
How Does It Really Work?
 This course explains these components of a search
engine in more detail
 Often many possible approaches and techniques for a
given component
 Focus is on the most important alternatives
 i.e., explain a small number of approaches in detail rather than
many approaches
 “Importance” based on research results and use in actual search
engines
 Alternatives described in references
Web Search Overview & CrawlingWeb Search Overview & Crawling
Thank you

More Related Content

PDF
CS6007 information retrieval - 5 units notes
PPTX
Greedy Algorithm - Knapsack Problem
PDF
PPTX
Information retrieval introduction
PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
PPT
Data Mining Concepts
PPTX
Data mining
DOC
Infix to-postfix examples
CS6007 information retrieval - 5 units notes
Greedy Algorithm - Knapsack Problem
Information retrieval introduction
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
Data Mining Concepts
Data mining
Infix to-postfix examples

What's hot (20)

PPTX
RECURSIVE DESCENT PARSING
PDF
Data Mining: Association Rules Basics
PPTX
Group By, Having Clause and Order By clause
PPTX
Information retrieval s
PPTX
Truth management system
PPTX
Data Structures and Algorithm - Module 1.pptx
PPT
Association rule mining
PPT
Intermediate code generation (Compiler Design)
PPTX
Distributed database
DOC
CS8391 Data Structures Part B Questions Anna University
PPTX
Information retrieval 7 boolean model
PPTX
Data Mining: Classification and analysis
ODP
Web content mining
PPTX
Introdution and designing a learning system
PPTX
The impact of web on ir
PPTX
Mining Association Rules in Large Database
PPT
Heuristic Search Techniques {Artificial Intelligence}
PDF
Ai lab manual
PDF
Lecture 2 role of algorithms in computing
PDF
Reading Data into R
RECURSIVE DESCENT PARSING
Data Mining: Association Rules Basics
Group By, Having Clause and Order By clause
Information retrieval s
Truth management system
Data Structures and Algorithm - Module 1.pptx
Association rule mining
Intermediate code generation (Compiler Design)
Distributed database
CS8391 Data Structures Part B Questions Anna University
Information retrieval 7 boolean model
Data Mining: Classification and analysis
Web content mining
Introdution and designing a learning system
The impact of web on ir
Mining Association Rules in Large Database
Heuristic Search Techniques {Artificial Intelligence}
Ai lab manual
Lecture 2 role of algorithms in computing
Reading Data into R
Ad

Similar to Web Search and Mining (20)

PPTX
Web Search Engine, Web Crawler, and Semantics Web
PPT
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
PDF
A survey on various architectures, models and methodologies for information r...
PDF
`A Survey on approaches of Web Mining in Varied Areas
PPT
Search Analytics for Fun and Profit
PPT
Business Intelligence Solution Using Search Engine
PDF
CS8080_IRT__UNIT_I_NOTES.pdf
PPTX
PDF
Search Engines Other than Google
DOCX
SEO Basics - SEO Company in India
PDF
Sweeny ux-seo om-cap 2014_v3
PPT
Search Analytics: Conversations with Your Customers
DOC
SEO Tutorial - SEO Company in India
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PDF
Searchland: Search quality for Beginners
PPTX
Search Engine
PPT
Search Analytics: Diagnosing what ails your site
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
PPT
Role of Text Mining in Search Engine
Web Search Engine, Web Crawler, and Semantics Web
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
A survey on various architectures, models and methodologies for information r...
`A Survey on approaches of Web Mining in Varied Areas
Search Analytics for Fun and Profit
Business Intelligence Solution Using Search Engine
CS8080_IRT__UNIT_I_NOTES.pdf
Search Engines Other than Google
SEO Basics - SEO Company in India
Sweeny ux-seo om-cap 2014_v3
Search Analytics: Conversations with Your Customers
SEO Tutorial - SEO Company in India
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Searchland: Search quality for Beginners
Search Engine
Search Analytics: Diagnosing what ails your site
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Role of Text Mining in Search Engine
Ad

More from sathish sak (20)

PPTX
TRANSPARENT CONCRE
PPT
Stationary Waves
PPT
Electrical Activity of the Heart
PPTX
Electrical Activity of the Heart
PPT
Software process life cycles
PPT
Digital Logic Circuits
PPT
Real-Time Scheduling
PPT
Real-Time Signal Processing: Implementation and Application
PPT
DIGITAL SIGNAL PROCESSOR OVERVIEW
PPTX
FRACTAL ROBOTICS
PPTX
Electro bike
PPTX
ROBOTIC SURGERY
PPTX
POWER GENERATION OF THERMAL POWER PLANT
PPT
mathematics application fiels of engineering
PPT
Plastics…
PPTX
ENGINEERING
PPTX
ENVIRONMENTAL POLLUTION
PPTX
RFID TECHNOLOGY
PPT
green chemistry
PPT
NANOTECHNOLOGY
TRANSPARENT CONCRE
Stationary Waves
Electrical Activity of the Heart
Electrical Activity of the Heart
Software process life cycles
Digital Logic Circuits
Real-Time Scheduling
Real-Time Signal Processing: Implementation and Application
DIGITAL SIGNAL PROCESSOR OVERVIEW
FRACTAL ROBOTICS
Electro bike
ROBOTIC SURGERY
POWER GENERATION OF THERMAL POWER PLANT
mathematics application fiels of engineering
Plastics…
ENGINEERING
ENVIRONMENTAL POLLUTION
RFID TECHNOLOGY
green chemistry
NANOTECHNOLOGY

Recently uploaded (20)

PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PPT
tcp ip networks nd ip layering assotred slides
PDF
Paper PDF World Game (s) Great Redesign.pdf
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPTX
SAP Ariba Sourcing PPT for learning material
DOCX
Unit-3 cyber security network security of internet system
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPTX
innovation process that make everything different.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
RPKI Status Update, presented by Makito Lay at IDNOG 10
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Introuction about WHO-FIC in ICD-10.pptx
tcp ip networks nd ip layering assotred slides
Paper PDF World Game (s) Great Redesign.pdf
Tenda Login Guide: Access Your Router in 5 Easy Steps
SAP Ariba Sourcing PPT for learning material
Unit-3 cyber security network security of internet system
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
QR Codes Qr codecodecodecodecocodedecodecode
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
innovation process that make everything different.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
international classification of diseases ICD-10 review PPT.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
introduction about ICD -10 & ICD-11 ppt.pptx
An introduction to the IFRS (ISSB) Stndards.pdf
PptxGenJS_Demo_Chart_20250317130215833.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt

Web Search and Mining

  • 1. Web Search Overview & CrawlingWeb Search Overview & Crawling By SATHISHKUMAR G (sathishsak111@gmail.com) Web Search and Mining
  • 2. Web Search Overview & CrawlingWeb Search Overview & Crawling Algorithmic results. Paid Search Ads
  • 3. Web Search Overview & CrawlingWeb Search Overview & Crawling Search and Information Retrieval  Search on the Web is a daily activity for many people throughout the world  Search and communication are most popular uses of the computer  Applications involving search are everywhere  The field of computer science that is most involved with R&D for search is information retrieval (IR) Search & IR
  • 4. Web Search Overview & CrawlingWeb Search Overview & Crawling Information Retrieval  “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)  General definition that can be applied to many types of information and search applications  Primary focus of IR since the 50s has been on text and documents IR
  • 5. Web Search Overview & CrawlingWeb Search Overview & Crawling What is a Document?  Examples:  web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.  Common properties  Significant text content  Some structure (e.g., title, author, date for papers; subject, sender, destination for email) IR
  • 6. Web Search Overview & CrawlingWeb Search Overview & Crawling Documents vs. Database Records  Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)  e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.  Easy to compare fields with well-defined semantics to queries in order to find matches  Text is more difficult IR
  • 7. Web Search Overview & CrawlingWeb Search Overview & Crawling Documents vs. Records  Example bank database query  Find records with balance > $50,000 in branches located in Amherst, MA.  Matches easily found by comparison with field values of records  Example search engine query  bank scandals in western mass  This text must be compared to the text of entire news stories IR
  • 8. Web Search Overview & CrawlingWeb Search Overview & Crawling Comparing Text  Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval  Exact matching of words is not enough  Many different ways to write the same thing in a “natural language” like English  e.g., does a news story containing the text “bank director in Amherst steals funds” match the query?  Some stories will be better matches than others IR
  • 9. Web Search Overview & CrawlingWeb Search Overview & Crawling Dimensions of IR  IR is more than just text, and more than just web search  although these are central  People doing IR work with different media, different types of search applications, and different tasks IR
  • 10. Web Search Overview & CrawlingWeb Search Overview & Crawling Other Media  New applications increasingly involve new media  e.g., video, photos, music, speech  Like text, content is difficult to describe and compare  text may be used to represent them (e.g. tags)  IR approaches to search and evaluation are appropriate IR
  • 11. Web Search Overview & CrawlingWeb Search Overview & Crawling Dimensions of IR Content Applications Tasks Text Web search Ad hoc search Images Vertical search Filtering Video Enterprise search Classification Scanned docs Desktop search Question answering Audio Forum search Music P2P search Literature search IR
  • 12. Web Search Overview & CrawlingWeb Search Overview & Crawling IR Tasks  Ad-hoc search  Find relevant documents for an arbitrary text query  Filtering  Identify relevant user profiles for a new document  Classification  Identify relevant labels for documents  Question answering  Give a specific answer to a question IR
  • 13. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Relevance  What is it?  Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine  Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style  Topical relevance (same topic) vs. user relevance (everything else) IR
  • 14. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Relevance  Retrieval models define a view of relevance  Ranking algorithms used in search engines are based on retrieval models  Most models describe statistical properties of text rather than linguistic  i.e., counting simple text features such as words instead of parsing and analyzing the sentences  Statistical approach to text processing started with Luhn in the 50s  Linguistic features can be part of a statistical model IR
  • 15. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Evaluation  Experimental procedures and measures for comparing system output with user expectations  Originated in Cranfield experiments in the 60s  Typically use test collection of documents, queries, and relevance judgments  Most commonly used are TREC collections  Recall and precision are two examples of effectiveness measures IR
  • 16. Web Search Overview & CrawlingWeb Search Overview & Crawling Big Issues in IR  Users and Information Needs  Search evaluation is user-centered  Keyword queries are often poor descriptions of actual information needs  Interaction and context are important for understanding user intent  Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking IR
  • 17. Web Search Overview & CrawlingWeb Search Overview & Crawling IR and Search Engines  A search engine is the practical application of information retrieval techniques to large scale text collections  Web search engines are best-known examples, but many others  Open source search engines are important for research and development  e.g., Lucene, Lemur/Indri, Galago  Big issues include main IR issues but also some others IR
  • 18. Web Search Overview & CrawlingWeb Search Overview & Crawling IR and Search Engines Relevance -Effective ranking Evaluation -Testing and measuring Information needs -User interaction Performance -Efficient search and indexing Incorporating new data -Coverage and freshness Scalability -Growing with data and users Adaptability -Tuning for applications Specific problems -e.g. Spam Information Retrieval Search Engines IR
  • 19. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Performance  Measuring and improving the efficiency of search  e.g., reducing response time, increasing query throughput, increasing indexing speed  Indexes are data structures designed to improve search efficiency  designing and implementing them are major issues for search engines Search Engine
  • 20. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Dynamic data (Incorporating new data)  The “collection” for most real applications is constantly changing in terms of updates, additions, deletions  e.g., web pages  Acquiring or “crawling” the documents is a major task  Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed)  Updating the indexes while processing queries is also a design issue Search Engine
  • 21. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Scalability  Making everything work with millions of users every day, and many terabytes of documents  Distributed processing is essential  Adaptability  Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications Search Engine
  • 22. Web Search Overview & CrawlingWeb Search Overview & Crawling Search Engine Issues  Spam  For Web search, spam in all its forms is one of the major issues  Affects the efficiency of search engines and, more seriously, the effectiveness of the results  Many types of spam  e.g. spamdexing or term spam, link spam, “optimization”  New subfield called adversarial IR, since spammers are “adversaries” with different goals Search Engine
  • 23. Web Search Overview & CrawlingWeb Search Overview & Crawling Architecture of SE How do search engines like Google work? Search Engine
  • 24. Web Search Overview & CrawlingWeb Search Overview & Crawling Algorithmic results. Paid Search Ads Search Engine
  • 25. Web Search Overview & CrawlingWeb Search Overview & Crawling Architecture The Web Ad indexes Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com Web spider Indexer Indexes Search User Search Engine
  • 26. Web Search Overview & CrawlingWeb Search Overview & Crawling Indexing Process Search Engine
  • 27. Web Search Overview & CrawlingWeb Search Overview & Crawling Indexing Process  Text acquisition  identifies and stores documents for indexing  Text transformation  transforms documents into index terms or features  Index creation  takes index terms and creates data structures (indexes) to support fast searching Search Engine
  • 28. Web Search Overview & CrawlingWeb Search Overview & Crawling Query Process Search Engine
  • 29. Web Search Overview & CrawlingWeb Search Overview & Crawling Query Process  User interaction  supports creation and refinement of query, display of results  Ranking  uses query and indexes to generate ranked list of documents  Evaluation  monitors and measures effectiveness and efficiency (primarily offline) Search Engine
  • 30. Web Search Overview & CrawlingWeb Search Overview & Crawling Details: Text Acquisition  Crawler  Identifies and acquires documents for search engine  Many types – web, enterprise, desktop  Web crawlers follow links to find documents  Must efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness)  Single site crawlers for site search  Topical or focused crawlers for vertical search  Document crawlers for enterprise and desktop search  Follow links and scan directories Indexing Process
  • 31. Web Search Overview & CrawlingWeb Search Overview & Crawling Web Crawler  Starts with a set of seeds, which are a set of URLs given to it as parameters  Seeds are added to a URL request queue  Crawler starts fetching pages from the request queue  Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch  New URLs added to the crawler’s request queue, or frontier  Continue until no more new URLs or disk full Indexing Process
  • 32. Web Search Overview & CrawlingWeb Search Overview & Crawling Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages Indexing Process
  • 33. Web Search Overview & CrawlingWeb Search Overview & Crawling Crawling the Web Indexing Process
  • 34. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Acquisition  Feeds  Real-time streams of documents  e.g., web feeds for news, blogs, video, radio, tv  RSS is common standard  RSS “reader” can provide new XML documents to search engine  Conversion  Convert variety of documents into a consistent text plus metadata format  e.g. HTML, XML, Word, PDF, etc. → XML  Convert text encoding for different languages  Using a Unicode standard like UTF-8 Indexing Process
  • 35. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Acquisition  Document data store  Stores text, metadata, and other related content for documents  Metadata is information about document such as type and creation date  Other content includes links, anchor text  Provides fast access to document contents for search engine components  e.g. result list generation  Could use relational database system  More typically, a simpler, more efficient storage system is used due to huge numbers of documents Indexing Process
  • 36. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Parser  Processing the sequence of text tokens in the document to recognize structural elements  e.g., titles, links, headings, etc.  Tokenizer recognizes “words” in the text  must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators  Markup languages such as HTML, XML often used to specify structure  Tags used to specify document elements  E.g., <h2> Overview </h2>  Document parser uses syntax of markup language (or other formatting) to identify structure Indexing Process
  • 37. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Stopping  Remove common words  e.g., “and”, “or”, “the”, “in”  Some impact on efficiency and effectiveness  Can be a problem for some queries  Stemming  Group words derived from a common stem  e.g., “computer”, “computers”, “computing”, “compute”  Usually effective, but not for all queries  Benefits vary for different languages Indexing Process
  • 38. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Link Analysis  Makes use of links and anchor text in web pages  Link analysis identifies popularity and community information  e.g., PageRank  Anchor text can significantly enhance the representation of pages pointed to by links  Significant impact on web search Indexing Process
  • 39. Web Search Overview & CrawlingWeb Search Overview & Crawling Text Transformation  Information Extraction  Identify classes of index terms that are important for some applications  e.g., named entity recognizers identify classes such as people, locations, companies, dates, etc.  Classifier  Identifies class-related metadata for documents  i.e., assigns labels to documents  e.g., topics, reading levels, sentiment, genre  Use depends on application Indexing Process
  • 40. Web Search Overview & CrawlingWeb Search Overview & Crawling Index Creation  Document Statistics  Gathers counts and positions of words and other features  Used in ranking algorithm  Weighting  Computes weights for index terms  Used in ranking algorithm  e.g., tf.idf weight  Combination of term frequency in document and inverse document frequency in the collection Indexing Process
  • 41. Web Search Overview & CrawlingWeb Search Overview & Crawling Index Creation  Inversion  Core of indexing process  Converts document-term information to term-document for indexing  Difficult for very large numbers of documents  Format of inverted file is designed for fast query processing  Must also handle updates  Compression used for efficiency Indexing Process
  • 42. Web Search Overview & CrawlingWeb Search Overview & Crawling Index Creation  Index Distribution  Distributes indexes across multiple computers and/or multiple sites  Essential for fast query processing with large numbers of documents  Many variations  Document distribution, term distribution, replication  P2P and distributed IR involve search across multiple sites Indexing Process
  • 43. Web Search Overview & CrawlingWeb Search Overview & Crawling User Interaction  Query input  Provides interface and parser for query language  Most web queries are very simple, other applications may use forms  Query language used to describe more complex queries and results of query transformation  e.g., Boolean queries  similar to SQL language used in database applications  IR query languages also allow content and structure specifications, but focus on content Query Process
  • 44. Web Search Overview & CrawlingWeb Search Overview & Crawling User Interaction  Query transformation  Improves initial query, both before and after initial search  Includes text transformation techniques used for documents  Spell checking and query suggestion provide alternatives to original query  Query expansion and relevance feedback modify the original query with additional terms Query Process
  • 45. Web Search Overview & CrawlingWeb Search Overview & Crawling User Interaction  Results output  Constructs the display of ranked documents for a query  Generates snippets to show how queries match documents  Highlights important words and passages  Retrieves appropriate advertising in many applications  May provide clustering and other visualization tools Query Process
  • 46. Web Search Overview & CrawlingWeb Search Overview & Crawling Ranking  Scoring  Calculates scores for documents using a ranking algorithm  Core component of search engine  Basic form of score is ∑ qi di  qi and di are query and document term weights for term i  Many variations of ranking algorithms and retrieval models Query Process
  • 47. Web Search Overview & CrawlingWeb Search Overview & Crawling Ranking  Performance optimization  Designing ranking algorithms for efficient processing  Term-at-a time vs. document-at-a-time processing  Safe vs. unsafe optimizations  Distribution  Processing queries in a distributed environment  Query broker distributes queries and assembles results  Caching is a form of distributed searching Query Process
  • 48. Web Search Overview & CrawlingWeb Search Overview & Crawling Evaluation  Logging  Logging user queries and interaction is crucial for improving search effectiveness and efficiency  Query logs and clickthrough data used for query suggestion, spell checking, query caching, ranking, advertising search, and other components  Ranking analysis  Measuring and tuning ranking effectiveness  Performance analysis  Measuring and tuning system efficiency Query Process
  • 49. Web Search Overview & CrawlingWeb Search Overview & Crawling How Does It Really Work?  This course explains these components of a search engine in more detail  Often many possible approaches and techniques for a given component  Focus is on the most important alternatives  i.e., explain a small number of approaches in detail rather than many approaches  “Importance” based on research results and use in actual search engines  Alternatives described in references
  • 50. Web Search Overview & CrawlingWeb Search Overview & Crawling Thank you