SlideShare a Scribd company logo
Phrase Based Indexing By Bala Abirami
Introduction of Phrase Based Indexing What is Phrase Based Indexing? Back ground of Invention Summary on Invention Spam Detection
Introduction An information retrieval system uses phrases to index, retrieve, organize and describe documents.  It was a patent application submitted by   the Google Engineer ,  Anna Lynn Patterson to US Application filed: July, 2004  Published: January, 2006
Background of Invention Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet.  A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document.  The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
Cont… Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival".  Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
Summary An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection. 1. Identifying Phrases and Related Phrases 2. Indexing Documents w.r.t Phrases 3. Ranking Documents w.r.t Phrases 4. Creating description for the document 5. Elimination of Duplicate Documents
Identifying Phrase and Related Phrases Based on a phrase's ability to predict the presence of other phrases in a document.  It looks to identify phrases that have frequent and/or distinguished/unique usage  Prediction measure is used for identifying related phrases Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases  Information gain = actual co-occurrence rate : expected co-occurrence rate
Cont… Two Phrases are related to each other when the prediction measure exceeds the prediction threshold. Example:  Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
Indexing documents based on related Phrases An information retrieval system indexes documents in the document collection by the valid or good phrases. Posting List = documents that contain the phrase  Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
Ranking  Ranking documents is based on two factors 1. Ranking Documents based on Contained Phrases 2. Ranking Documents based on Anchor Phrases Document Score = Body Hit Score + Anchor Hit Score For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70 Document Score = 0.30 + 0.70
Phrase Extension The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.  A user may enter an incomplete phrase in a search query, such as "President of the“ Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
Descriptions for Documents Phrase information is used to create description of a document. System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences. Ranks the sentences based on the count. Selects some number of top ranking sentences as description  and includes it in the search results.
Eliminating Duplicate documents Identifying and Eliminating duplicate documents while crawling a document or when processing the search query. The description is stored in association with every document in a hash table. The system concatenates the newly crawled page with that stored hash value  in the Hash table. If it finds a match, then it indicates that the current document is duplicate value. The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
 
Functions of Indexing system Indentifies Phrases in documents Indexing Documents according to the phrases by accessing various websites. Functions of Front End Server Receives queries from a user Provides those queries to the search system
Functions of Searching System Searching for documents relevant to the search query Identifies the phrases in the search query Ranking the documents Functions of Presentation system Modifying the search results including removing of duplicate content. Generating topical descriptions of documents and provides modified
Spam Detection “ Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”. Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
Cont… A phrase based indexing system knows the number of related phrases in a document. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection.  A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
Advantages of Phrase Based Indexing Detecting Duplicate Pages Spam Detection Save time
Other Patent Applications Phrase identification in an information retrieval system  Phrase-based searching in an information retrieval system  Phrase-based generation of document descriptions  Detecting spam documents in a phrase based information retrieval system  Efficient Phrase Based Document Indexing for Document Clustering
According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines  Two-word phrases -- 28.38 percent  Three-word phrases -- 27.15 percent  Four-word phrases -- 16.42 percent  One-word phrase -- 13.48 percent  Five-word phrases -- 8.03 percent  Six-word phrases -- 3.67 percent  Seven-word phrases -- 1.63 percent  Eight-word phrases -- 0.73 percent  Nine-word phrases -- 0.34 percent  Ten-word phrases -- 0.16 percent
Thank you

More Related Content

PPT
Tovek Presentation by Livio Costantini
PPT
Tovek Presentation 2 by Livio Costantini
PDF
Web_Mining_Overview_Nfaoui_El_Habib
PPT
Role of Text Mining in Search Engine
PPTX
Keyword searching idc
PDF
Semantic citation
DOC
Hinari basic course_module_2_workbook_2014_07
PPTX
Text mining
Tovek Presentation by Livio Costantini
Tovek Presentation 2 by Livio Costantini
Web_Mining_Overview_Nfaoui_El_Habib
Role of Text Mining in Search Engine
Keyword searching idc
Semantic citation
Hinari basic course_module_2_workbook_2014_07
Text mining

What's hot (18)

PPTX
Model of information retrieval (3)
ODP
The search engine index
PPT
Semantic search
PPT
Implementing Semantic Search
PPT
Text mining and data mining
PPT
Textmining Introduction
PDF
SA2: Text Mining from User Generated Content
PPTX
Tdm information retrieval
PPTX
Vector space model of information retrieval
PPT
Tesxt mining
PPTX
Text mining
PPTX
Using Technology for Academic Research
PPTX
Lectures 1,2,3
PDF
Konsep Dasar Information Retrieval - Edi faizal
PPTX
Top Academic Search Engines for Research
PPTX
The impact of web on ir
PDF
WT - Web & Working of Search Engine
PDF
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Model of information retrieval (3)
The search engine index
Semantic search
Implementing Semantic Search
Text mining and data mining
Textmining Introduction
SA2: Text Mining from User Generated Content
Tdm information retrieval
Vector space model of information retrieval
Tesxt mining
Text mining
Using Technology for Academic Research
Lectures 1,2,3
Konsep Dasar Information Retrieval - Edi faizal
Top Academic Search Engines for Research
The impact of web on ir
WT - Web & Working of Search Engine
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Ad

Similar to Phrase Based Indexing (20)

PPT
Phrase based Indexing and Information Retrieval
PPT
3_Indexing.ppt
PDF
Shilpa shukla processing_text
PPT
Web indexing finale
PDF
Information storage and Retrieval-Chapter 2 Updated.pdf
PPTX
Retrieval approches
PPTX
lec6
PPTX
Introduction to search engine-building with Lucene
PPTX
Introduction to search engine-building with Lucene
PDF
Chapter 3 Indexing Structure.pdf
PDF
information retrival and text processing
PPTX
Search enabled applications with lucene.net
PPT
6&7-Query Languages & Operations.ppt
PPTX
IRS-Cataloging and Indexing-2.1.pptx
PDF
Chapter 3 Indexing.pdf
PPTX
Lecture 7- Text Statistics and Document Parsing
PPTX
Information storage and retrieval system unit two
PPT
Information Retrieval
PPT
Indexing
PPTX
Information Retrieval-4(inverted index_&_query handling)
Phrase based Indexing and Information Retrieval
3_Indexing.ppt
Shilpa shukla processing_text
Web indexing finale
Information storage and Retrieval-Chapter 2 Updated.pdf
Retrieval approches
lec6
Introduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Chapter 3 Indexing Structure.pdf
information retrival and text processing
Search enabled applications with lucene.net
6&7-Query Languages & Operations.ppt
IRS-Cataloging and Indexing-2.1.pptx
Chapter 3 Indexing.pdf
Lecture 7- Text Statistics and Document Parsing
Information storage and retrieval system unit two
Information Retrieval
Indexing
Information Retrieval-4(inverted index_&_query handling)
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx

Phrase Based Indexing

  • 1. Phrase Based Indexing By Bala Abirami
  • 2. Introduction of Phrase Based Indexing What is Phrase Based Indexing? Back ground of Invention Summary on Invention Spam Detection
  • 3. Introduction An information retrieval system uses phrases to index, retrieve, organize and describe documents. It was a patent application submitted by the Google Engineer , Anna Lynn Patterson to US Application filed: July, 2004 Published: January, 2006
  • 4. Background of Invention Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet. A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  • 5. Cont… Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival". Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
  • 6. Summary An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection. 1. Identifying Phrases and Related Phrases 2. Indexing Documents w.r.t Phrases 3. Ranking Documents w.r.t Phrases 4. Creating description for the document 5. Elimination of Duplicate Documents
  • 7. Identifying Phrase and Related Phrases Based on a phrase's ability to predict the presence of other phrases in a document. It looks to identify phrases that have frequent and/or distinguished/unique usage Prediction measure is used for identifying related phrases Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases Information gain = actual co-occurrence rate : expected co-occurrence rate
  • 8. Cont… Two Phrases are related to each other when the prediction measure exceeds the prediction threshold. Example: Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
  • 9. Indexing documents based on related Phrases An information retrieval system indexes documents in the document collection by the valid or good phrases. Posting List = documents that contain the phrase Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
  • 10. Ranking Ranking documents is based on two factors 1. Ranking Documents based on Contained Phrases 2. Ranking Documents based on Anchor Phrases Document Score = Body Hit Score + Anchor Hit Score For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70 Document Score = 0.30 + 0.70
  • 11. Phrase Extension The information retrieval system is also adapted to use the phrases when searching for documents in response to a query. A user may enter an incomplete phrase in a search query, such as "President of the“ Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
  • 12. Descriptions for Documents Phrase information is used to create description of a document. System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences. Ranks the sentences based on the count. Selects some number of top ranking sentences as description and includes it in the search results.
  • 13. Eliminating Duplicate documents Identifying and Eliminating duplicate documents while crawling a document or when processing the search query. The description is stored in association with every document in a hash table. The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value. The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
  • 14.  
  • 15. Functions of Indexing system Indentifies Phrases in documents Indexing Documents according to the phrases by accessing various websites. Functions of Front End Server Receives queries from a user Provides those queries to the search system
  • 16. Functions of Searching System Searching for documents relevant to the search query Identifies the phrases in the search query Ranking the documents Functions of Presentation system Modifying the search results including removing of duplicate content. Generating topical descriptions of documents and provides modified
  • 17. Spam Detection “ Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”. Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
  • 18. Cont… A phrase based indexing system knows the number of related phrases in a document. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
  • 19. Advantages of Phrase Based Indexing Detecting Duplicate Pages Spam Detection Save time
  • 20. Other Patent Applications Phrase identification in an information retrieval system Phrase-based searching in an information retrieval system Phrase-based generation of document descriptions Detecting spam documents in a phrase based information retrieval system Efficient Phrase Based Document Indexing for Document Clustering
  • 21. According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent