SlideShare a Scribd company logo
2
Most read
3
Most read
5
Most read
Inverted index
 Definition: an inverted file is a word-oriented
mechanism for indexing a text collection in
order to speed up the searching task.
 Structure of inverted file:
◦ Vocabulary: is the set of all distinct words in the
text
◦ Occurrences: lists containing all information
necessary for each word of the vocabulary (text
position, frequency, documents where the word
appears, etc.)
 Inverted file index is list of terms that appear in the
document collection (called a lexicon or vocabulary) and
for each term in the lexicon, stores a list of pointers to all
occurrences of that term in the document collection. This
list is called an inverted list.
 Granularity of an index determines the accuracy of
representation of the location of the word
◦ Coarse-grained index requires less storage and more
query processing to eliminate false matches
◦ Word-level index enables queries involving adjacency
and proximity, but has higher space requirements
4
Indexed
Terms
Number of
occurrences
Occurrences Lists
Vocabulary
Posting File
This could be a tree like structure !
5
 Text:
 Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
 Prior example allows for boolean
queries.
 Need the document frequency and term
frequency.
Vocabulary entry Posting file entry
k dk doc1 f1k doc2 f2k …
dk : document frequency of term k
doci : i-th document that contains term k
fik : term frequency of term k in document i
 The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(nβ
), where β is a constant between
0.4 and 0.6 in practice
◦ TREC-2: 1 GB text, 5 MB lexicon
 On the other hand, the occurrences demand
much more space. Since each word appearing
in the text is referenced once in that structure,
the extra space is O(n)
 To reduce space requirements, a technique
called block addressing is used
 The text is divided in blocks
 The occurrences point to the blocks where the
word appears
 Advantages:
◦ the number of pointers is smaller than positions
◦ all the occurrences of a word inside a single block
are collapsed to one reference
 Disadvantages:
◦ online search over the qualifying blocks if exact
positions are required
 Text:
 Inverted file
beautiful
flowers
garden
house
4
3
2
1
Vocabulary Occurrences
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
 How big are inverted files?
◦ In relation to original collection size
 right column indexes stopwords while left removes
stopwords
 Blocks require text to be available for location of
terms within blocks.
45%
27%
18%
73%
41%
25%
36%
18%
1.7%
64%
32%
2.4%
35%
5%
0.5%
63%
9%
0.7%
Addressing words
Addressing 256 blocks
Addressing 64K blocks
Index Small collection
(1Mb)
Medium collection
(200Mb)
Large collection
(2Gb)
 The search algorithm on an inverted
index follows three steps:
1. Vocabulary search: the words present in
the query are located in the vocabulary
2. Retrieval occurrences: the lists of the
occurrences of all query words found are
retrieved
3. Manipulation of occurrences: the
occurrences are processed to solve the
query
 Searching inverted files starts with vocabulary
◦ store the vocabulary in a separate file
 Structures used to store the vocabulary
include
◦ Hashing : O (1) lookup, does not support range
queries
◦ Tries : O (c) lookup, c = length (word)
◦ B-trees : O (log v) lookup
 An alternative is simply storing the words in
lexicographical order
◦ cheaper in space and very competitive with O(log
v) cost
 All the vocabulary is kept in a suitable data
structure storing for each word and a list of
its occurrences
 Each word of each text in the corpus is
read and searched for in the vocabulary
 If it is not found, it is added to the
vocabulary with a empty list of occurrences
 The new position is added to the end of its
list of occurrences for the word
 Once the text is exhausted the vocabulary is
written to disk with the list of occurrences.
 Two files are created:
◦ in the first file, each list of word occurrences is
stored contiguously
◦ in the second file, the vocabulary is stored in
lexicographical order and, for each word, a pointer
to its list in the first file is also included. This allows
the vocabulary to be kept in memory at search time
 The overall process is O(n) worst-case time
 An option is to use the previous algorithm until
the main memory is exhausted. When no
more memory is available, the partial index Ii
obtained up to now is written to disk and
erased the main memory before continuing
with the rest of the text
 Once the text is exhausted, a number of
partial indices Ii exist on disk
 The partial indices are merged to obtain the
final index
I 1...8
I 1...4 I 5...8
I 1...2 I 3...4 I 5...6 I 7...8
I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8
1 2 4 5
3 6
7
final index
initial dumps
level 1
level 2
level 3
 The total time to generate partial indices is
O(n)
 The number of partial indices is O(n/M)
 To merge the O(n/M) partial indices are
necessary log2(n/M) merging levels
 The total cost of this algorithm is O(n log(n/M))
 Inverted files are used to index text
 The indices are appropriate when the
text collection is large and semi-static
 If the text collection is volatile online
searching is the only option
 Some techniques combine online and
indexed searching
 Vocabulary List
◦ Text preprocessing modules
 lexical analysis, stemming, stopwords
 Occurrences of Vocabulary Terms
◦ Inverted index creation
 term frequency in documents, document frequency
 Retrieval and Ranking Algorithm
 Query and Ranking Interfaces
 Browsing/Visualization Interface

More Related Content

PPTX
Inheritance in JAVA PPT
PDF
exercices business intelligence
PPTX
Ppt evaluation of information retrieval system
PDF
CS6007 information retrieval - 5 units notes
PPTX
Inheritance in java
PDF
Rapport (Mémoire de Master) de stage PFE pour l’obtention du Diplôme Nationa...
PPTX
Web search vs ir
Inheritance in JAVA PPT
exercices business intelligence
Ppt evaluation of information retrieval system
CS6007 information retrieval - 5 units notes
Inheritance in java
Rapport (Mémoire de Master) de stage PFE pour l’obtention du Diplôme Nationa...
Web search vs ir

What's hot (20)

PPT
Information Retrieval Models
PPTX
Information retrieval s
PPTX
Term weighting
PPTX
Signature files
PPTX
Information retrieval introduction
PPTX
Information retrieval 7 boolean model
PPTX
Vector space model of information retrieval
PPTX
Text MIning
PPTX
Boolean,vector space retrieval Models
PPTX
INFORMATION RETRIEVAL Anandraj.L
PPTX
Data Mining: Graph mining and social network analysis
PDF
Information retrieval-systems notes
PPTX
Vector space model in information retrieval
PPTX
The impact of web on ir
PPTX
Automatic indexing
PPTX
Text mining
PPTX
Model of information retrieval (3)
PPTX
Semantic web
PPT
automatic classification in information retrieval
Information Retrieval Models
Information retrieval s
Term weighting
Signature files
Information retrieval introduction
Information retrieval 7 boolean model
Vector space model of information retrieval
Text MIning
Boolean,vector space retrieval Models
INFORMATION RETRIEVAL Anandraj.L
Data Mining: Graph mining and social network analysis
Information retrieval-systems notes
Vector space model in information retrieval
The impact of web on ir
Automatic indexing
Text mining
Model of information retrieval (3)
Semantic web
automatic classification in information retrieval
Ad

Viewers also liked (20)

PPT
An introduction to inverted index
ODP
The search engine index
PDF
Text Indexing / Inverted Indices
PDF
Using Solr Cloud to Tame an Index Explosion
PPTX
The Role of Enterprise Integration in Digital Transformation
PPTX
Web technology: Web search
PPTX
Product quantization for nearest neighbor search-report
PPTX
Privacy preserving multi-keyword ranked search over encrypted cloud data
PPTX
Information seeking
PDF
Practical Elasticsearch - real world use cases
PDF
Architecture and implementation of Apache Lucene
PDF
Introduction To Apache Lucene
ODP
Search Lucene
PPTX
Solr
PPTX
Public key Cryptography & RSA
PPT
Information searching & retrieving techniques khalid
PDF
Devinsampa nginx-scripting
PDF
Munching & crunching - Lucene index post-processing
PPTX
Index types
DOCX
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
An introduction to inverted index
The search engine index
Text Indexing / Inverted Indices
Using Solr Cloud to Tame an Index Explosion
The Role of Enterprise Integration in Digital Transformation
Web technology: Web search
Product quantization for nearest neighbor search-report
Privacy preserving multi-keyword ranked search over encrypted cloud data
Information seeking
Practical Elasticsearch - real world use cases
Architecture and implementation of Apache Lucene
Introduction To Apache Lucene
Search Lucene
Solr
Public key Cryptography & RSA
Information searching & retrieving techniques khalid
Devinsampa nginx-scripting
Munching & crunching - Lucene index post-processing
Index types
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
Ad

Similar to Inverted index (20)

PDF
Chapter 3 Indexing.pdf
PDF
Chapter 3 Indexing Structure.pdf
PDF
Chapter 2 Text Operation.pdf
PPTX
PPTX
DBMS Data Storage and Query Processing.
PPT
IR CHAPTER_TWO Most important for students
PDF
Chapter 2: Text Operation in information stroage and retrieval
PDF
Search pitb
PPT
Lucece Indexing
PDF
Chapter 2 Text Operation and Term Weighting.pdf
PDF
File Types in Data Structure
PPTX
Ch 17 disk storage, basic files structure, and hashing
PDF
Survey On Building A Database Driven Reverse Dictionary
PDF
Information_Retrievals Unit_3_chap09.pdf
PPT
Chapter13
PPTX
lecture 2 notes indexing in application of database systems.pptx
PPTX
Index Structures.pptx
PPT
Inverted Files for Text Search Engin.ppt
PPT
PPTX
Data storage and indexing
Chapter 3 Indexing.pdf
Chapter 3 Indexing Structure.pdf
Chapter 2 Text Operation.pdf
DBMS Data Storage and Query Processing.
IR CHAPTER_TWO Most important for students
Chapter 2: Text Operation in information stroage and retrieval
Search pitb
Lucece Indexing
Chapter 2 Text Operation and Term Weighting.pdf
File Types in Data Structure
Ch 17 disk storage, basic files structure, and hashing
Survey On Building A Database Driven Reverse Dictionary
Information_Retrievals Unit_3_chap09.pdf
Chapter13
lecture 2 notes indexing in application of database systems.pptx
Index Structures.pptx
Inverted Files for Text Search Engin.ppt
Data storage and indexing

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Construction Project Organization Group 2.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
DOCX
573137875-Attendance-Management-System-original
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
Project quality management in manufacturing
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
additive manufacturing of ss316l using mig welding
Embodied AI: Ushering in the Next Era of Intelligent Systems
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
Arduino robotics embedded978-1-4302-3184-4.pdf
Construction Project Organization Group 2.pptx
Digital Logic Computer Design lecture notes
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
573137875-Attendance-Management-System-original
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Project quality management in manufacturing
Lesson 3_Tessellation.pptx finite Mathematics
Operating System & Kernel Study Guide-1 - converted.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Structs to JSON How Go Powers REST APIs.pdf
additive manufacturing of ss316l using mig welding

Inverted index

  • 2.  Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task.  Structure of inverted file: ◦ Vocabulary: is the set of all distinct words in the text ◦ Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc.)
  • 3.  Inverted file index is list of terms that appear in the document collection (called a lexicon or vocabulary) and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list.  Granularity of an index determines the accuracy of representation of the location of the word ◦ Coarse-grained index requires less storage and more query processing to eliminate false matches ◦ Word-level index enables queries involving adjacency and proximity, but has higher space requirements
  • 5. 5  Text:  Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 Vocabulary Occurrences
  • 6.  Prior example allows for boolean queries.  Need the document frequency and term frequency. Vocabulary entry Posting file entry k dk doc1 f1k doc2 f2k … dk : document frequency of term k doci : i-th document that contains term k fik : term frequency of term k in document i
  • 7.  The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(nβ ), where β is a constant between 0.4 and 0.6 in practice ◦ TREC-2: 1 GB text, 5 MB lexicon  On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n)  To reduce space requirements, a technique called block addressing is used
  • 8.  The text is divided in blocks  The occurrences point to the blocks where the word appears  Advantages: ◦ the number of pointers is smaller than positions ◦ all the occurrences of a word inside a single block are collapsed to one reference  Disadvantages: ◦ online search over the qualifying blocks if exact positions are required
  • 9.  Text:  Inverted file beautiful flowers garden house 4 3 2 1 Vocabulary Occurrences Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful
  • 10.  How big are inverted files? ◦ In relation to original collection size  right column indexes stopwords while left removes stopwords  Blocks require text to be available for location of terms within blocks. 45% 27% 18% 73% 41% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 5% 0.5% 63% 9% 0.7% Addressing words Addressing 256 blocks Addressing 64K blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)
  • 11.  The search algorithm on an inverted index follows three steps: 1. Vocabulary search: the words present in the query are located in the vocabulary 2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved 3. Manipulation of occurrences: the occurrences are processed to solve the query
  • 12.  Searching inverted files starts with vocabulary ◦ store the vocabulary in a separate file  Structures used to store the vocabulary include ◦ Hashing : O (1) lookup, does not support range queries ◦ Tries : O (c) lookup, c = length (word) ◦ B-trees : O (log v) lookup  An alternative is simply storing the words in lexicographical order ◦ cheaper in space and very competitive with O(log v) cost
  • 13.  All the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrences  Each word of each text in the corpus is read and searched for in the vocabulary  If it is not found, it is added to the vocabulary with a empty list of occurrences  The new position is added to the end of its list of occurrences for the word
  • 14.  Once the text is exhausted the vocabulary is written to disk with the list of occurrences.  Two files are created: ◦ in the first file, each list of word occurrences is stored contiguously ◦ in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time  The overall process is O(n) worst-case time
  • 15.  An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text  Once the text is exhausted, a number of partial indices Ii exist on disk  The partial indices are merged to obtain the final index
  • 16. I 1...8 I 1...4 I 5...8 I 1...2 I 3...4 I 5...6 I 7...8 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 1 2 4 5 3 6 7 final index initial dumps level 1 level 2 level 3
  • 17.  The total time to generate partial indices is O(n)  The number of partial indices is O(n/M)  To merge the O(n/M) partial indices are necessary log2(n/M) merging levels  The total cost of this algorithm is O(n log(n/M))
  • 18.  Inverted files are used to index text  The indices are appropriate when the text collection is large and semi-static  If the text collection is volatile online searching is the only option  Some techniques combine online and indexed searching
  • 19.  Vocabulary List ◦ Text preprocessing modules  lexical analysis, stemming, stopwords  Occurrences of Vocabulary Terms ◦ Inverted index creation  term frequency in documents, document frequency  Retrieval and Ranking Algorithm  Query and Ranking Interfaces  Browsing/Visualization Interface