SlideShare a Scribd company logo
How search engines work Anand Saini
Helping people find what they’re looking for
   Starts with an “information need”
   Convert to a query
   Gets results
In the materials available
   Web pages
   Other formats
   Deep Web
 Search can’t find what’s not there
    The content is hugely important
 Information Architecture is vital
 Usable sites have good navigation and structure
How search engines work Anand Saini
Index ahead of time
  • Find files or records
  • Open each one and read it
  • Store each word in a searchable index
Provide search forms
  • Match the query terms with words in the index
  • Sort documents by relevance
Display results
How search engines work Anand Saini
Like an iceberg,
2/3 below water


                                user
                             interface




                                      search
                   content         functionality
•   Text search works for structured content
•   Keyword search vs. SQL queries
•   Approximate vs. exact match
•   Multiple sources of content
•   Response time and database resources
•   Relevance ranking, very important
•   Works in the real world (e.g. EBay)
Users blame the search engine
   Even when the content is unavailable
Understand the scope of site or intranet
     Kinds of information
     Divided sites: products / corporate info
     Dates
     Languages
     Sources and data silos: CMSs, databases...
     Update processes
Store text to search it later
Many ways to gather text
     Crawl (spider) via HTTP
     Read files on file servers
     Access databases (HTTP or API)
     Data silos via local APIs
     Applications, CMSs, via Web Services
Security and Access Control
How search engines work Anand Saini
 Basic information for document or record
   • File name / URL / record ID
   • Title or equivalent
   • Size, date, MIME type
 Full text of item
 More metadata
   • Product name, picture ID
   • Category, topic, or subject
   • Other attributes, for relevance ranking and display
How search engines work Anand Saini
How search engines work Anand Saini
Stop words
Stemming
Metadata
   Explicit (tags)
   Implicit (context)
Semantics
   CMS and Database fields
   XML tags and attributes
What happens after you click the search button and
 before retrieval starts.
Usually in this order
     Handle character set, maybe language
     Look for operators and organize the query
     Look for field names or metadata
     Extract words (just like the indexer)
     Deal with letter casing
• Retrieval: find files with query terms
• Not the same as relevance ranking
  Recall: find all
   relevant items
  Precision: find only
   relevant items
  Increasing one
   decreases the
   other
Single-word queries
   Find items containing that word
Multi-word queries: combine lists
   Any: every item with any query word
   All: only items with every word
   Phrases: find only items with all words in order
Boolean and complex queries
  – Use algorithm to combine lists
•   Empty search
•   Nothing on the site on that topic (scope)
•   Misspelling or typing mistakes
•   Vocabulary differences
•   Restrictive search defaults
•   Restrictive search choices
•   Software failure
How search engines work Anand Saini
Theory: sort the matching items, so the most
 relevant ones appear first
Can't really know what the user wants
Relevance is hard to define and situational
Short queries tend to be deeply ambiguous
  What do people mean when they type “bank”?
First 10 results are the most important
The more transparent, the better
 Sorting documents on various criteria
 Start with words matching query terms
 Citation and link analysis
   Like old library Citation Indexes
   Ted Nelson - not only hypertext, but the links
   Google PageRank
      Incoming links
      Authority of linkers
 Taxonomies and external metadata
• Term frequency in the item
• Inverse document frequency of term
   Rare words are likely to be more important
   wij = weight of Term Tj in Document Di
   tfij = frequency of Term Tj in Document Dj
   N = number of Documents in collection
   n = number of Documents where term Tj
   occurs at least once

   From Salton 1989
•   Vector space
•   Probabilistic (binary interdependence)
•   Fuzzy set theory
•   Bayesian statistical analysis
•   Latent semantic indexing
•   Neural networks
•   Machine learning
•   All require sophisticated queries
•   See MIR, chapter 2
Heuristics are rules of thumb
  • Not algorithms, not math
Search Relevance Ranking Heuristics
  •   Documents containing all search words
  •   Search words as a phrase
  •   Matches in title tag
  •   Matches in other metadata
Based on real-word user behavior
What users see after they click the Search button
The most visible part of search
Elements of the results page
     Page layout and navigation
     Results header
     List of results items
     Results footer
How search engines work Anand Saini
How search engines work Anand Saini
Human judgment beats algorithms
Great for frequent, ambiguous searches
   Use search log to identify best candidates
Recommend good starting pages
      Product information, FAQs, etc.
Requires human resources
   That means money and time
More static than algorithmic search
How search engines work Anand Saini
How search engines work Anand Saini
How search engines work Anand Saini
How search engines work Anand Saini
 Leverage content structure
    database fields (i.e. cruise amenities)
    document metadata (news article bylines)
 Provide both search and browse
      Support information foraging
      Integrate navigation with results
      Not just subject taxonomies
      Display only fruitful paths, no dead ends
 Supported by academic research
    Marti Hearst, UCB SIMS, flamenco.berkeley.edu
How search engines work Anand Saini
How search engines work Anand Saini
Metrics
     Number of searches
     Number of no-matches searches
     Traffic from search to high-value pages
     Relate search changes to other metrics
Search Log Analysis
   Top 5% searches: phrases and words
   Top no-matches searches
        Use as market research
Search engines can’t read minds
   User queries are short and ambiguous
Some things will help
     Design a usable interface
     Show match words in context
     Keep index current and complete
     Adjust heuristic weighting
     Maintain suggestions and synonyms
     Consider faceted metadata search
Join us
Add: WZ-30-a,Bhagwan Das Nagar
East Punjabi Bagh, Delhi-110026
Tel.: 011 28316148, 3203571, 30538061
Mobile; +91-8010 298 388, 8010 198 388
E-mail: info@seocertification.org.in

More Related Content

PPTX
Internet searchingnewver
PPTX
Review of search and retrieval strategies
PPTX
Legal Research: Honing in on the Right Information
PPTX
Advanced google searching (1)
PPT
COMM1180-Alcock November 2012
PPT
Information searching & retrieving techniques khalid
PPTX
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
PPTX
BE Summer 2014
Internet searchingnewver
Review of search and retrieval strategies
Legal Research: Honing in on the Right Information
Advanced google searching (1)
COMM1180-Alcock November 2012
Information searching & retrieving techniques khalid
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
BE Summer 2014

What's hot (18)

PPT
Business research lec5
PPT
Webpowerpoint
PPTX
Search strategies – subject searching
PDF
Finding and Managing Information
PPT
Lesson Six Researching And The Internet
PPTX
Finding Information in HRM
PPT
Presentation Timo Kouwenhoven FIATIFTA
PPTX
Ws.dowland spring 2015
PPTX
Information retrieval s
PPTX
Tips for searching for information
PPT
Classification, Tagging & Search
PPT
Searching techniques
PPT
Stutoday10
PPT
Keyword Searching: Advanced Techniques
PPT
Eco4132 Spring 2010
PPTX
WsMcManusPt2
PPT
Accessing Information
PPTX
W13 libr250 databases_scholarlyvs_popular
Business research lec5
Webpowerpoint
Search strategies – subject searching
Finding and Managing Information
Lesson Six Researching And The Internet
Finding Information in HRM
Presentation Timo Kouwenhoven FIATIFTA
Ws.dowland spring 2015
Information retrieval s
Tips for searching for information
Classification, Tagging & Search
Searching techniques
Stutoday10
Keyword Searching: Advanced Techniques
Eco4132 Spring 2010
WsMcManusPt2
Accessing Information
W13 libr250 databases_scholarlyvs_popular
Ad

Viewers also liked (17)

DOC
курандын кереметтери. кyrgyz (кыргыз)
DOC
2559 project 602-10
PPT
Slfjaklsd
DOCX
Nasir journalism CV
PDF
RV AAD RL.pdf.
PPT
Jon Purday
PDF
#SEOChat Recap - Conducting SEO Site Audits - August 4, 2016
PPTX
2014 MATC Intern Program: Reid Winkelmann
PPT
Seniors Cruising
DOC
жаныбарлардагы жан аябастыктар жана акылдуу кыймыл аракеттер. кyrgyz (кыргыз)
PDF
Вспомнить все
PDF
Обзор новых продуктов и решений Cisco для для сетевой инфраструктуры ЦОД
PPTX
Rebecca Grant - DRI/ARA(I) Training: Introduction to EAD - Metadata and Metad...
PPT
Regent Seven Seas
PDF
Globalizzazione: opportunità e scelte di business
PPT
Los peces
PDF
fragments of a diary of savouring, to the kindergarten
курандын кереметтери. кyrgyz (кыргыз)
2559 project 602-10
Slfjaklsd
Nasir journalism CV
RV AAD RL.pdf.
Jon Purday
#SEOChat Recap - Conducting SEO Site Audits - August 4, 2016
2014 MATC Intern Program: Reid Winkelmann
Seniors Cruising
жаныбарлардагы жан аябастыктар жана акылдуу кыймыл аракеттер. кyrgyz (кыргыз)
Вспомнить все
Обзор новых продуктов и решений Cisco для для сетевой инфраструктуры ЦОД
Rebecca Grant - DRI/ARA(I) Training: Introduction to EAD - Metadata and Metad...
Regent Seven Seas
Globalizzazione: opportunità e scelte di business
Los peces
fragments of a diary of savouring, to the kindergarten
Ad

Similar to How search engines work Anand Saini (20)

PPT
Searching techniques
PPTX
Understanding How Search Works November 7 2024.pptx
PPTX
Eureka, I found it! - Special Libraries Association 2021 Presentation
PPT
Phrase based Indexing and Information Retrieval
PPT
How search engines work
PPT
Database Searching Basics
PPTX
information retrieval in artificial intelligence
PPT
Search Systems
PDF
Starting a search application
PDF
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
PPT
Evaluation criteria
PPTX
How did you find that?! Optimizing your SharePoint content for search
PPTX
Optimizing Your Content for Search
PPT
Automatic Metadata Generation Charles Duncan
PPT
Using metadata repositories with search
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
ODP
Post conference workshop (xml and structure)
PPTX
SharePoint site admins leverage search
PPT
Practical Approaches to Sharing Information
PPT
Electronic Databases
Searching techniques
Understanding How Search Works November 7 2024.pptx
Eureka, I found it! - Special Libraries Association 2021 Presentation
Phrase based Indexing and Information Retrieval
How search engines work
Database Searching Basics
information retrieval in artificial intelligence
Search Systems
Starting a search application
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
Evaluation criteria
How did you find that?! Optimizing your SharePoint content for search
Optimizing Your Content for Search
Automatic Metadata Generation Charles Duncan
Using metadata repositories with search
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Post conference workshop (xml and structure)
SharePoint site admins leverage search
Practical Approaches to Sharing Information
Electronic Databases

More from Dr,Saini Anand (20)

PPT
Website redesign & seo Anand Saini
PPTX
Seo the soul of web design Anand Saini
PPT
Seo Training By Anand Saini
PPT
Search engine-optimization-tips-within-commonspot
PPTX
Search engine optimization rankings, tactics & trends
PPT
Search engine optimization beyond meta tags
PPT
Promoting website through_search engine
PPT
An intorduction to optimize your web (fil eminimizer)
PPT
Web marketing Anand Saini
PPT
Seo & sem training
PPTX
Search engine marketing
PPT
Search engine marketing current past future (fil eminimizer)
PPTX
Internet marketing
PPTX
PPT
Eternal truths of seo
PPT
Emarketing
PPT
Blog feed-search-seo
PPT
Keyword seo preparation final steps
PPT
Google adwords-use-for-your-business
PPTX
Google adwprds
Website redesign & seo Anand Saini
Seo the soul of web design Anand Saini
Seo Training By Anand Saini
Search engine-optimization-tips-within-commonspot
Search engine optimization rankings, tactics & trends
Search engine optimization beyond meta tags
Promoting website through_search engine
An intorduction to optimize your web (fil eminimizer)
Web marketing Anand Saini
Seo & sem training
Search engine marketing
Search engine marketing current past future (fil eminimizer)
Internet marketing
Eternal truths of seo
Emarketing
Blog feed-search-seo
Keyword seo preparation final steps
Google adwords-use-for-your-business
Google adwprds

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Computing-Curriculum for Schools in Ghana
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Pharma ospi slides which help in ospi learning
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Basic Mud Logging Guide for educational purpose
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
master seminar digital applications in india
PDF
Complications of Minimal Access Surgery at WLH
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Classroom Observation Tools for Teachers
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPH.pptx obstetrics and gynecology in nursing
O7-L3 Supply Chain Operations - ICLT Program
Computing-Curriculum for Schools in Ghana
2.FourierTransform-ShortQuestionswithAnswers.pdf
01-Introduction-to-Information-Management.pdf
human mycosis Human fungal infections are called human mycosis..pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
FourierSeries-QuestionsWithAnswers(Part-A).pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Pharma ospi slides which help in ospi learning
O5-L3 Freight Transport Ops (International) V1.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Basic Mud Logging Guide for educational purpose
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
master seminar digital applications in india
Complications of Minimal Access Surgery at WLH
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Classroom Observation Tools for Teachers

How search engines work Anand Saini

  • 2. Helping people find what they’re looking for  Starts with an “information need”  Convert to a query  Gets results In the materials available  Web pages  Other formats  Deep Web
  • 3.  Search can’t find what’s not there  The content is hugely important  Information Architecture is vital  Usable sites have good navigation and structure
  • 5. Index ahead of time • Find files or records • Open each one and read it • Store each word in a searchable index Provide search forms • Match the query terms with words in the index • Sort documents by relevance Display results
  • 7. Like an iceberg, 2/3 below water user interface search content functionality
  • 8. Text search works for structured content • Keyword search vs. SQL queries • Approximate vs. exact match • Multiple sources of content • Response time and database resources • Relevance ranking, very important • Works in the real world (e.g. EBay)
  • 9. Users blame the search engine  Even when the content is unavailable Understand the scope of site or intranet  Kinds of information  Divided sites: products / corporate info  Dates  Languages  Sources and data silos: CMSs, databases...  Update processes
  • 10. Store text to search it later Many ways to gather text  Crawl (spider) via HTTP  Read files on file servers  Access databases (HTTP or API)  Data silos via local APIs  Applications, CMSs, via Web Services Security and Access Control
  • 12.  Basic information for document or record • File name / URL / record ID • Title or equivalent • Size, date, MIME type  Full text of item  More metadata • Product name, picture ID • Category, topic, or subject • Other attributes, for relevance ranking and display
  • 15. Stop words Stemming Metadata  Explicit (tags)  Implicit (context) Semantics  CMS and Database fields  XML tags and attributes
  • 16. What happens after you click the search button and before retrieval starts. Usually in this order  Handle character set, maybe language  Look for operators and organize the query  Look for field names or metadata  Extract words (just like the indexer)  Deal with letter casing
  • 17. • Retrieval: find files with query terms • Not the same as relevance ranking Recall: find all relevant items Precision: find only relevant items Increasing one decreases the other
  • 18. Single-word queries  Find items containing that word Multi-word queries: combine lists  Any: every item with any query word  All: only items with every word  Phrases: find only items with all words in order Boolean and complex queries – Use algorithm to combine lists
  • 19. Empty search • Nothing on the site on that topic (scope) • Misspelling or typing mistakes • Vocabulary differences • Restrictive search defaults • Restrictive search choices • Software failure
  • 21. Theory: sort the matching items, so the most relevant ones appear first Can't really know what the user wants Relevance is hard to define and situational Short queries tend to be deeply ambiguous What do people mean when they type “bank”? First 10 results are the most important The more transparent, the better
  • 22.  Sorting documents on various criteria  Start with words matching query terms  Citation and link analysis  Like old library Citation Indexes  Ted Nelson - not only hypertext, but the links  Google PageRank  Incoming links  Authority of linkers  Taxonomies and external metadata
  • 23. • Term frequency in the item • Inverse document frequency of term  Rare words are likely to be more important wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Dj N = number of Documents in collection n = number of Documents where term Tj occurs at least once From Salton 1989
  • 24. Vector space • Probabilistic (binary interdependence) • Fuzzy set theory • Bayesian statistical analysis • Latent semantic indexing • Neural networks • Machine learning • All require sophisticated queries • See MIR, chapter 2
  • 25. Heuristics are rules of thumb • Not algorithms, not math Search Relevance Ranking Heuristics • Documents containing all search words • Search words as a phrase • Matches in title tag • Matches in other metadata Based on real-word user behavior
  • 26. What users see after they click the Search button The most visible part of search Elements of the results page  Page layout and navigation  Results header  List of results items  Results footer
  • 29. Human judgment beats algorithms Great for frequent, ambiguous searches  Use search log to identify best candidates Recommend good starting pages  Product information, FAQs, etc. Requires human resources  That means money and time More static than algorithmic search
  • 34.  Leverage content structure  database fields (i.e. cruise amenities)  document metadata (news article bylines)  Provide both search and browse  Support information foraging  Integrate navigation with results  Not just subject taxonomies  Display only fruitful paths, no dead ends  Supported by academic research  Marti Hearst, UCB SIMS, flamenco.berkeley.edu
  • 37. Metrics  Number of searches  Number of no-matches searches  Traffic from search to high-value pages  Relate search changes to other metrics Search Log Analysis  Top 5% searches: phrases and words  Top no-matches searches  Use as market research
  • 38. Search engines can’t read minds  User queries are short and ambiguous Some things will help  Design a usable interface  Show match words in context  Keep index current and complete  Adjust heuristic weighting  Maintain suggestions and synonyms  Consider faceted metadata search
  • 39. Join us Add: WZ-30-a,Bhagwan Das Nagar East Punjabi Bagh, Delhi-110026 Tel.: 011 28316148, 3203571, 30538061 Mobile; +91-8010 298 388, 8010 198 388 E-mail: info@seocertification.org.in