SlideShare a Scribd company logo
How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
Outline Text Search Indexing  Query Processing Relevance Ranking Vector space Model Performance Measures Link Analysis to rank Web pages
Text Search Given query  q “honda car”,  find documents that contain terms  honda  and  car   Text search index structure - Inverted Index Steps for construction of inverted index Document preprocessing  Tokenization Stemming, removal of stop words Indexing Tokenizer Stemming Indexer documents Inverted  Index
Document Preprocessing Includes following steps: Removing all html tags Tokenization: Break document into constituent words or terms  Removal of common stop words such as  the, a, an, at Stemming:  Find and replace words with their root shirts     shirt  Assign a unique token Id to each token Assign a unique document Id to each document
Inverted Index For each term  t , inverted index stores a list of IDs of documents that contain  t Example: Documents Doc1 :  The quick brown fox jumps over the lazy dog Doc2:   Fox News is the number one cable news channel The postings list is sorted by document ID Supports advanced query operators such AND, OR, NOT Postings list fox Doc1 Doc1 Doc2 dog
Query Processing Consider query  honda car Retrieve postings list for  honda Retrieve postings list for  car Merge the two postings list Postings list sorted by doc ID If length of postings list are m and n then it takes  O(m + n) to merge them 1 2 16 4 8 9 3 16 8 honda car 8, 16
Inverted Index Construction Estimate size of index Use integer 32 bits to represent a document ID Average number of unique terms in a document be 100 Any document ID occurs in 100 postings lists on average Index Size = 4 * 100 * number of documents bytes At Web scale, it runs into 100s of GB Clearly, one cannot hold the index structure in RAM
Inverted Index Construction For each document output (term, documentID) pairs to a file on disk Note this file size is same as index size Documents Doc1 :  The quick brown fox jumps over the lazy dog Doc2:   Fox News is the number one cable news channel Sort this file by terms . This uses disk based external sort Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
Inverted Index Construction sort The result is split into dictionary file and postings list file 1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
Relevance Ranking Inverted index returns a list of documents which contain query terms How do we rank these documents ? Use frequency of query terms Use importance / rareness of query terms Do query terms occur in title of the document?
Vector space model Documents are represented as vectors in a multi-dimensional Euclidean space. Each term/word of the vocabulary represents a dimension The weight (co-ordinate) of document  d  along the dimension represented by term  t  is a product of the following Term Frequency TF ( d , t ): The number of times term  t  occurs in document  d Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms.  IDF ( t ) = log ( 1 + |D| / |D t |) where |D| is the total number of documents   |D t | is the number of documents which contain term  t Car d q ө can Computer
Vector space model Queries are also represented in terms of term vectors Documents are ranked by their proximity to query vector Cosine of the angle between document vector and query vector is to measure proximity between two vectors Cos( ө ) = d.q / (|d||q|) The smaller the angle between vectors d and q, the more relevant document  d  is for query  q
Performance Measure Search Engines return a ranked list of result documents for a given query To measure accuracy, we use a set of queries  Q  and manually identified set of relevant documents  Dq  for each query  q .  We define two measures to assess accuracy of search engines. Let Rel(q,k) be number of documents relevant to query q returned in top k positions Recall   for query  q,  at position  k,  is the fraction of all relevant documents  Dq  that are returned in top k postions.  Recall(k)  =  1/|Dq| * Rel(q,k) Precision  for query  q , at position  k , is the fraction of top k results that are relevant   Precision(k) = 1/k * Rel(q,k)
Challenges in ranking Web pages Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries Finding authoritative sources: There are thousands of documents that contain the given query terms.  Example: For query ‘ yahoo ’,  www.yahoo.com   is the most relevant result  Anchor text: gives important information about a document. It is indexed as part of the document
Page Rank Measure of authority or prestige of a Web page It is based on the link structure of the Web graph A Web page which is linked / cited by many other Web pages is popular and has higher PageRank It is a query independent static ranking of Web pages Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant
Page Rank Web pages link to each other through hyperlinks. (hrefs in HTML) Thus, the Web can be visualized as a directed graph where web pages constitute the set of  nodes  N  and hyperlinks constitute the set edges  E Each web page (node) has a measure of  authority  or  prestige  called  PageRank   PageRank  of a page (node)  v  is proportional to sum of  PageRank  of all web pages that  link to it p[v] =  Σ (u,v)  Є  E   p[u] / N u N u  is number of outlinks of node u u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3]  u2 u3 v1 v2 w1
Page Rank Computation Consider N x N Link Matrix  L and Page Rank Vector p L(u, v) = E(u,v) / Nu where E(u,v) = 1 iff there is an edge from u to v Nu = number of outlinks from node u p = L T  p   Page Rank vector is the first eigen vector of link matrix L T Page Rank is computed by power iteration method
References Books and Papers S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data C Manning and P Raghavan . Introduction to Information Retrieval http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 Software Nutch is an open source Java Web crawler.  http://lucene.apache.org/nutch/about.html Lucene is an open source Java text search engine.  http://lucene.apache.org/
Introduction Web Search is the dominant means of online information retrieval More than 200 million searches performed each day in US alone. Aim of a search engine is to find documents relevant to a user query Most search engines try to find and rank documents which contain the query terms
Web Crawlers Fetches Web pages. The basic idea is pretty simple as illustrated below. Add a few seed URLs ( www.yahoo.com ) to a queue While (!queue.isEmpty()) do  URL u = queue.remove() fetch Web page W(u) Extract all hyperlinks from W(u) and add them to the queue Done. To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler A large scale crawler has to be distributed and multi-threaded Nutch is an open source Web crawler.  http://lucene.apache.org/nutch/about.html

More Related Content

PPT
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
PPTX
PPTX
Data science chapter-7,8,9
PPT
Indexing and hashing
PPT
Scalable Data Analysis in R -- Lee Edlefsen
PDF
indexing and hashing
PPTX
Dynamic multi level indexing Using B-Trees And B+ Trees
PDF
Indexing and-hashing
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Data science chapter-7,8,9
Indexing and hashing
Scalable Data Analysis in R -- Lee Edlefsen
indexing and hashing
Dynamic multi level indexing Using B-Trees And B+ Trees
Indexing and-hashing

What's hot (20)

PPT
Zhishi.me - Weaving Chinese Linking Open Data
PDF
Context-Enhanced Adaptive Entity Linking
PPTX
PPTX
Indexing
PPTX
SWT Lecture Session 9 - RDB2RDF direct mapping
PPT
Indexing and Hashing
PPTX
SWT Lecture Session 11 - R2RML part 2
PPTX
SWT Lecture Session 10 R2RML Part 1
PPTX
File Structures(Part 2)
PPTX
Document Classification and Clustering
PDF
Grades nda 2018 - gremlinator demo talk - harsh thakkar
PPTX
Indexing structure for files
PDF
Coling2014:Single Document Keyphrase Extraction Using Label Information
PPTX
Relational Database Management System
PPTX
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
PPTX
Overview of Storage and Indexing ...
PPTX
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
PPT
12. Indexing and Hashing in DBMS
PPTX
Federated Query Formulation and Processing Through BioFed
PPT
File organization 1
Zhishi.me - Weaving Chinese Linking Open Data
Context-Enhanced Adaptive Entity Linking
Indexing
SWT Lecture Session 9 - RDB2RDF direct mapping
Indexing and Hashing
SWT Lecture Session 11 - R2RML part 2
SWT Lecture Session 10 R2RML Part 1
File Structures(Part 2)
Document Classification and Clustering
Grades nda 2018 - gremlinator demo talk - harsh thakkar
Indexing structure for files
Coling2014:Single Document Keyphrase Extraction Using Label Information
Relational Database Management System
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
Overview of Storage and Indexing ...
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
12. Indexing and Hashing in DBMS
Federated Query Formulation and Processing Through BioFed
File organization 1
Ad

Similar to How web searching engines work (20)

PPTX
Anatomy of google
PPT
Web Search Engine
PPT
Working Of Search Engine
PDF
Nutch and lucene_framework
PPTX
How a search engine works slide
DOC
How a search engine works report
PPT
Googling of GooGle
PDF
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
PDF
Comparisons of ranking algorithms
PDF
PDF
Ibm haifa.mq.final
DOCX
Seminar report(rohitsahu cs 17 vth sem)
DOCX
Excel analysis assignment this is an independent assignment me
PPT
Introduction into Search Engines and Information Retrieval
PPT
Annotating Digital Texts in the Brown University Library
PPT
Understanding Seo At A Glance
PDF
RDataMining slides-text-mining-with-r
PDF
Meta documents and query extension to enhance information retrieval process
PPTX
Annotations chicago
Anatomy of google
Web Search Engine
Working Of Search Engine
Nutch and lucene_framework
How a search engine works slide
How a search engine works report
Googling of GooGle
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
Comparisons of ranking algorithms
Ibm haifa.mq.final
Seminar report(rohitsahu cs 17 vth sem)
Excel analysis assignment this is an independent assignment me
Introduction into Search Engines and Information Retrieval
Annotating Digital Texts in the Brown University Library
Understanding Seo At A Glance
RDataMining slides-text-mining-with-r
Meta documents and query extension to enhance information retrieval process
Annotations chicago
Ad

More from VNIT-ACM Student Chapter (12)

PPS
An approach to Programming Contests with C++
PPS
An introduction to Reverse Engineering
PPS
Introduction to the OSI 7 layer model and Data Link Layer
PPTX
Research Opportunities in the United States
PPS
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
PPT
Hadoop Map Reduce
PPS
PPS
Inaugural Session
PPS
Hacking - Web based attacks
PPS
Computers and Algorithms - What can they do and what can they not?
PPS
Foundations of Programming Part II
PPS
Foundations of Programming Part I
An approach to Programming Contests with C++
An introduction to Reverse Engineering
Introduction to the OSI 7 layer model and Data Link Layer
Research Opportunities in the United States
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Hadoop Map Reduce
Inaugural Session
Hacking - Web based attacks
Computers and Algorithms - What can they do and what can they not?
Foundations of Programming Part II
Foundations of Programming Part I

Recently uploaded (20)

PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Classroom Observation Tools for Teachers
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
master seminar digital applications in india
PDF
Pre independence Education in Inndia.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Lesson notes of climatology university.
PPH.pptx obstetrics and gynecology in nursing
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
TR - Agricultural Crops Production NC III.pdf
Computing-Curriculum for Schools in Ghana
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial diseases, their pathogenesis and prophylaxis
Classroom Observation Tools for Teachers
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
master seminar digital applications in india
Pre independence Education in Inndia.pdf
Insiders guide to clinical Medicine.pdf
Cell Types and Its function , kingdom of life
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Module 4: Burden of Disease Tutorial Slides S2 2025
Pharmacology of Heart Failure /Pharmacotherapy of CHF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Lesson notes of climatology university.

How web searching engines work

  • 1. How Web Search Engines Work ? Apurva Jadhav apurvajadhav[at]gmail[dot]com
  • 2. Outline Text Search Indexing Query Processing Relevance Ranking Vector space Model Performance Measures Link Analysis to rank Web pages
  • 3. Text Search Given query q “honda car”, find documents that contain terms honda and car Text search index structure - Inverted Index Steps for construction of inverted index Document preprocessing Tokenization Stemming, removal of stop words Indexing Tokenizer Stemming Indexer documents Inverted Index
  • 4. Document Preprocessing Includes following steps: Removing all html tags Tokenization: Break document into constituent words or terms Removal of common stop words such as the, a, an, at Stemming: Find and replace words with their root shirts  shirt Assign a unique token Id to each token Assign a unique document Id to each document
  • 5. Inverted Index For each term t , inverted index stores a list of IDs of documents that contain t Example: Documents Doc1 : The quick brown fox jumps over the lazy dog Doc2: Fox News is the number one cable news channel The postings list is sorted by document ID Supports advanced query operators such AND, OR, NOT Postings list fox Doc1 Doc1 Doc2 dog
  • 6. Query Processing Consider query honda car Retrieve postings list for honda Retrieve postings list for car Merge the two postings list Postings list sorted by doc ID If length of postings list are m and n then it takes O(m + n) to merge them 1 2 16 4 8 9 3 16 8 honda car 8, 16
  • 7. Inverted Index Construction Estimate size of index Use integer 32 bits to represent a document ID Average number of unique terms in a document be 100 Any document ID occurs in 100 postings lists on average Index Size = 4 * 100 * number of documents bytes At Web scale, it runs into 100s of GB Clearly, one cannot hold the index structure in RAM
  • 8. Inverted Index Construction For each document output (term, documentID) pairs to a file on disk Note this file size is same as index size Documents Doc1 : The quick brown fox jumps over the lazy dog Doc2: Fox News is the number one cable news channel Sort this file by terms . This uses disk based external sort Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  • 9. Inverted Index Construction sort The result is split into dictionary file and postings list file 1 1 2 1 Dictionary file Postings file Term brown fox fox jumps over . . . news number one . Doc ID 1 1 2 1 1 . . 2 2 2 2 . Term brown fox jumps over . . news number one . Postings file offset 0 . . . . Term quick brown fox jumps over . . fox news number one . Doc ID 1 1 1 1 1 . . 2 2 2 2 .
  • 10. Relevance Ranking Inverted index returns a list of documents which contain query terms How do we rank these documents ? Use frequency of query terms Use importance / rareness of query terms Do query terms occur in title of the document?
  • 11. Vector space model Documents are represented as vectors in a multi-dimensional Euclidean space. Each term/word of the vocabulary represents a dimension The weight (co-ordinate) of document d along the dimension represented by term t is a product of the following Term Frequency TF ( d , t ): The number of times term t occurs in document d Inverted document frequency IDF ( t ): All terms are not equally important. IDF captures the importance or rareness of terms. IDF ( t ) = log ( 1 + |D| / |D t |) where |D| is the total number of documents |D t | is the number of documents which contain term t Car d q ө can Computer
  • 12. Vector space model Queries are also represented in terms of term vectors Documents are ranked by their proximity to query vector Cosine of the angle between document vector and query vector is to measure proximity between two vectors Cos( ө ) = d.q / (|d||q|) The smaller the angle between vectors d and q, the more relevant document d is for query q
  • 13. Performance Measure Search Engines return a ranked list of result documents for a given query To measure accuracy, we use a set of queries Q and manually identified set of relevant documents Dq for each query q . We define two measures to assess accuracy of search engines. Let Rel(q,k) be number of documents relevant to query q returned in top k positions Recall for query q, at position k, is the fraction of all relevant documents Dq that are returned in top k postions. Recall(k) = 1/|Dq| * Rel(q,k) Precision for query q , at position k , is the fraction of top k results that are relevant Precision(k) = 1/k * Rel(q,k)
  • 14. Challenges in ranking Web pages Spamming: Many Web page authors resort to spam, ie adding unrelated words, to rank higher in search engines for certain queries Finding authoritative sources: There are thousands of documents that contain the given query terms. Example: For query ‘ yahoo ’, www.yahoo.com is the most relevant result Anchor text: gives important information about a document. It is indexed as part of the document
  • 15. Page Rank Measure of authority or prestige of a Web page It is based on the link structure of the Web graph A Web page which is linked / cited by many other Web pages is popular and has higher PageRank It is a query independent static ranking of Web pages Roughly, given two pages both of which contain the query terms, the page with higher PageRank is more relevant
  • 16. Page Rank Web pages link to each other through hyperlinks. (hrefs in HTML) Thus, the Web can be visualized as a directed graph where web pages constitute the set of nodes N and hyperlinks constitute the set edges E Each web page (node) has a measure of authority or prestige called PageRank PageRank of a page (node) v is proportional to sum of PageRank of all web pages that link to it p[v] = Σ (u,v) Є E p[u] / N u N u is number of outlinks of node u u1 p[v1] = p[u1] + p[u2] / 2 p[v2] = p[u2]/2 + p[u3] u2 u3 v1 v2 w1
  • 17. Page Rank Computation Consider N x N Link Matrix L and Page Rank Vector p L(u, v) = E(u,v) / Nu where E(u,v) = 1 iff there is an edge from u to v Nu = number of outlinks from node u p = L T p Page Rank vector is the first eigen vector of link matrix L T Page Rank is computed by power iteration method
  • 18. References Books and Papers S. Chakrabarti . Mining the Web – Discovering Knowledge From Hypertext Data C Manning and P Raghavan . Introduction to Information Retrieval http://www-csli.stanford.edu/ ~hinrich/information-retrieval-book.html S. Brin and L. Page . Anatomy of a large scale hypertextual Web search engine. WWW7, 1998 Software Nutch is an open source Java Web crawler. http://lucene.apache.org/nutch/about.html Lucene is an open source Java text search engine. http://lucene.apache.org/
  • 19. Introduction Web Search is the dominant means of online information retrieval More than 200 million searches performed each day in US alone. Aim of a search engine is to find documents relevant to a user query Most search engines try to find and rank documents which contain the query terms
  • 20. Web Crawlers Fetches Web pages. The basic idea is pretty simple as illustrated below. Add a few seed URLs ( www.yahoo.com ) to a queue While (!queue.isEmpty()) do URL u = queue.remove() fetch Web page W(u) Extract all hyperlinks from W(u) and add them to the queue Done. To fetch all or a significant percentage of all Web pages (millions) one needs to engineer a large scale crawler A large scale crawler has to be distributed and multi-threaded Nutch is an open source Web crawler. http://lucene.apache.org/nutch/about.html