SlideShare a Scribd company logo
Boolean retrieval
Information retrieval is defined,,, To find material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).  The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents.  Ex) Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories
Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales.  1. In  web search   the system has to provide search over billions of documents  stored on millions of computers.  2. personal information retrieval   3. enterprise, institutional, and domain-specific search   for collections such as a corporation's internal documents, a  database of patents, or research articles on biochemistry.
RETRIEVAL  example grepping   simple retrieval sort of linear scan through documents With modern computers, for simple querying of modest collections , you really need nothing more.  But for many purposes, you do need more:  1. To process large document collections quickly  2. To allow more flexible matching operations  3. To allow ranked retrieval
The way to avoid linearly scanning the texts for each query is to  index  the documents in advance
term-document  incidence matrix
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND:  110100 AND 110111 AND 101111 = 100100  The answers for this query are thus Anthony and Cleopatra and Hamlet
two key statistics Precision  :   What fraction of the returned results are relevant to the  information need?  Recall  :   What fraction of the relevant documents in the collection were  returned by the system?
We now cannot build a term-document matrix in a naive way.  A 500k * 1 M  matrix has half-a-trillion 0's and 1's - too many to fit in a computer's memory.  A much better representation is to record only the things that do occur, that is, the 1 positions.      it is the  inverted index
dictionary   we use  dictionary  for the data structure and  vocabulary  for the set of terms this is commonly kept in memory posting   for each term, we have a list that records which documents the term occurs in. Each item in the list - which records that a term appeared in a document (and, later, often, the positions in the document)  this Is stored on disk
To gain the speed benefits of indexing at retrieval time, we have to build the index in advance  1. Collect the documents to be indexed:  2. Tokenize the text, turning each document into a list of tokens:    3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms:  4.Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
・ The sequence of terms in each document is sorted alphabetically ・ Instances of the same term are then grouped by word and  then by  documentID ・ The terms and documentID are then separated out ・ The dictionary stores the terms, and has pointer to the postings list for each term
Processing Boolean queries simple conjunctive query   Locate Brutus in the Dictionary  Retrieve its postings  Locate Calpurnia in the Dictionary  Retrieve its postings  Intersect the two postings lists The  intersection  is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms.
Algorithm for the intersection of two postings lists  p1 and  p2.
For complex queries
four additional things we would like to be able to do  1 We would like to better determine the set of terms in the  dictionary and to provide retrieval that is tolerant to spelling  mistakes and inconsistent choice of words.  2  “operating system “ ,” Gates near Microsoft”  To answer such queries , the index has to be augmented to  capture  the proximities of terms in documents.  3 to be able  to accumulate evidence, we need  term frequency   information in postings lists.  4  Boolean queries just retrieve a set of matching documents, but  commonly we wish to have an effective method to order (or  ``rank'') the returned results. This requires having a mechanism  for determining a document score which encapsulates how  good a match a document is for a query.

More Related Content

PDF
Evaluation in Information Retrieval
PPTX
Vector space model of information retrieval
PPTX
Text categorization
PPTX
Boolean,vector space retrieval Models
PPTX
Automatic indexing
PPT
Information Retrieval Models
PPT
Inverted index
PDF
Information Storage and Retrieval : A Case Study
Evaluation in Information Retrieval
Vector space model of information retrieval
Text categorization
Boolean,vector space retrieval Models
Automatic indexing
Information Retrieval Models
Inverted index
Information Storage and Retrieval : A Case Study

What's hot (20)

PPTX
Lectures 1,2,3
PPTX
Model of information retrieval (3)
PPTX
Tdm information retrieval
PPT
6&7-Query Languages & Operations.ppt
PPTX
Information retrieval introduction
PPTX
PPTX
Information retrieval 7 boolean model
PPTX
Signature files
PPT
Data preprocessing in Data Mining
PPTX
Information Retrieval Evaluation
PPT
PPTX
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
PPTX
Lecture 3 general problem solver
PDF
CS6007 information retrieval - 5 units notes
PPT
3. mining frequent patterns
PPTX
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
PPTX
Query processing in Distributed Database System
PDF
CLUSTERING IN DATA MINING.pdf
PPTX
Overfitting & Underfitting
PDF
Information Extraction
Lectures 1,2,3
Model of information retrieval (3)
Tdm information retrieval
6&7-Query Languages & Operations.ppt
Information retrieval introduction
Information retrieval 7 boolean model
Signature files
Data preprocessing in Data Mining
Information Retrieval Evaluation
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
Lecture 3 general problem solver
CS6007 information retrieval - 5 units notes
3. mining frequent patterns
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Query processing in Distributed Database System
CLUSTERING IN DATA MINING.pdf
Overfitting & Underfitting
Information Extraction
Ad

Viewers also liked (11)

PPT
Presentation1adri
PPT
Tema 3. mecanismos
PDF
Guia Rapido Finance Desktop A B R I L
ODP
Etwinning Test
PPT
PPS
Artenacozinha
PPT
15 Ways To Cure Email Addiction Peggy Duncan Email Overload Expert
ODP
Novela Anterior Ala Guerra!Meritxell&Mari
PPT
2. Ottoman Changes
PPTX
Ms word
PPTX
10 Shortcuts to Selecting Text in Word (with video)
Presentation1adri
Tema 3. mecanismos
Guia Rapido Finance Desktop A B R I L
Etwinning Test
Artenacozinha
15 Ways To Cure Email Addiction Peggy Duncan Email Overload Expert
Novela Anterior Ala Guerra!Meritxell&Mari
2. Ottoman Changes
Ms word
10 Shortcuts to Selecting Text in Word (with video)
Ad

Similar to Boolean Retrieval (20)

PPTX
PPTX
01 IRS to upload the data according to the.pptx
PPTX
01 IRS-1 (1) document upload the link to
PDF
Chapter 1: Introduction to Information Storage and Retrieval
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PDF
International Journal of Engineering Research and Development
PDF
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
PPTX
PPT
Inverted Files for Text Search Engin.ppt
PDF
Inverted files for text search engines
PPTX
master prepare seminar for computer science.pptx
PPTX
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
PDF
Context Based Web Indexing For Semantic Web
PDF
Searching and Analyzing Qualitative Data on Personal Computer
PDF
Text databases and information retrieval
PDF
An evaluation and overview of indices
PDF
N017249497
PDF
Context Based Indexing in Search Engines Using Ontology: Review
01 IRS to upload the data according to the.pptx
01 IRS-1 (1) document upload the link to
Chapter 1: Introduction to Information Storage and Retrieval
Information_Retrieval_Models_Nfaoui_El_Habib
14. Michael Oakes (UoW) Natural Language Processing for Translation
International Journal of Engineering Research and Development
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
Inverted Files for Text Search Engin.ppt
Inverted files for text search engines
master prepare seminar for computer science.pptx
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
Context Based Web Indexing For Semantic Web
Searching and Analyzing Qualitative Data on Personal Computer
Text databases and information retrieval
An evaluation and overview of indices
N017249497
Context Based Indexing in Search Engines Using Ontology: Review

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
A Presentation on Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
A Presentation on Artificial Intelligence

Boolean Retrieval

  • 2. Information retrieval is defined,,, To find material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Ex) Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories
  • 3. Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. 1. In web search the system has to provide search over billions of documents stored on millions of computers. 2. personal information retrieval 3. enterprise, institutional, and domain-specific search for collections such as a corporation's internal documents, a database of patents, or research articles on biochemistry.
  • 4. RETRIEVAL example grepping simple retrieval sort of linear scan through documents With modern computers, for simple querying of modest collections , you really need nothing more. But for many purposes, you do need more: 1. To process large document collections quickly 2. To allow more flexible matching operations 3. To allow ranked retrieval
  • 5. The way to avoid linearly scanning the texts for each query is to index the documents in advance
  • 7. To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100 The answers for this query are thus Anthony and Cleopatra and Hamlet
  • 8. two key statistics Precision : What fraction of the returned results are relevant to the information need? Recall : What fraction of the relevant documents in the collection were returned by the system?
  • 9. We now cannot build a term-document matrix in a naive way. A 500k * 1 M matrix has half-a-trillion 0's and 1's - too many to fit in a computer's memory. A much better representation is to record only the things that do occur, that is, the 1 positions.   it is the inverted index
  • 10. dictionary we use dictionary for the data structure and vocabulary for the set of terms this is commonly kept in memory posting for each term, we have a list that records which documents the term occurs in. Each item in the list - which records that a term appeared in a document (and, later, often, the positions in the document) this Is stored on disk
  • 11. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance 1. Collect the documents to be indexed: 2. Tokenize the text, turning each document into a list of tokens: 3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: 4.Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
  • 12. ・ The sequence of terms in each document is sorted alphabetically ・ Instances of the same term are then grouped by word and then by documentID ・ The terms and documentID are then separated out ・ The dictionary stores the terms, and has pointer to the postings list for each term
  • 13. Processing Boolean queries simple conjunctive query Locate Brutus in the Dictionary Retrieve its postings Locate Calpurnia in the Dictionary Retrieve its postings Intersect the two postings lists The intersection is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms.
  • 14. Algorithm for the intersection of two postings lists p1 and p2.
  • 16. four additional things we would like to be able to do 1 We would like to better determine the set of terms in the dictionary and to provide retrieval that is tolerant to spelling mistakes and inconsistent choice of words. 2 “operating system “ ,” Gates near Microsoft” To answer such queries , the index has to be augmented to capture the proximities of terms in documents. 3 to be able to accumulate evidence, we need term frequency information in postings lists. 4 Boolean queries just retrieve a set of matching documents, but commonly we wish to have an effective method to order (or ``rank'') the returned results. This requires having a mechanism for determining a document score which encapsulates how good a match a document is for a query.