Chapter 3
1
12/10/2024
Indexing structure
Outline
◼ Major Steps in Index Construction
◼ Index file Evaluation Metrics
◼ Building Index file
◼ Sequential File
◼ Inverted file
◼ Suffix tree
◼ Suffix Trie
◼ Suffix TreeApplications
2
12/10/2024
Indexing: Basic Concepts
3
◼ Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
◼ It used to speed up access to desired information from document
collection asper users query such that
◼ It enhances efficiency in terms of time for retrieval.
◼ Relevantdocuments are searched and retrieved quick
◼ Index file usually has index terms in asorted order.
◼ Whichlist is easier to search?
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
12/10/2024
◼ Anindex file consists of records, called index entries.
◼ Index files are much smaller than the original file.
◼ Remember Heaps Law: in 1 GBof text collection the
vocabulary has asize of only 5 MB. This size may be
further reduced by Linguistic pre-processing (or text
operations).
◼ The usual unit for indexing is the word
◼ Index terms - are used to look up records in afile.
4
Indexing: Basic Concepts
12/10/2024
Major Steps in Index Construction
◼Source file: Collection of text document
◼Adocument can be described by a set of representative
keywordscalled index terms.
◼Index Terms Selection: apply text operations or
preprocessing.
◼Tokenize: identify words in a document, so that each
document is represented byalist of keywords or attributes.
◼Stop words removal: words with high frequency are non-
content bearing and needs to be removed from text
collection.
12/10/2024
Major Steps in Index Construction …
6
◼Word stem: reduce words with similar meaning into their
stem/root word.
◼Term relevance weight: Different index terms have varying
relevance when used to describe document contents. This
effect is captured through the assignment of numerical
weights to each index term of a document. There are
different index terms weighting methods: including TF, IDF,
TF*IDF…
◼Indexing structure: a set of index terms (vocabulary) are
organized in Index File to easily identify documents in which
each term occurs in.
12/10/2024
Basic Indexing Process
Tokenizer
Token
stream. Friends Romans countrymen
Linguistic
preprocessor
Modified
tokens.
friend roman countryman
Indexer
Index File
Documents to
be indexed. Friends, Romans, countrymen.
friend
roman
countryman
2 4
2
13 16
1
Inverted file
12/10/2024
Index file Evaluation Metrics
8
◼Running time of the main operations
◼ Access/searchtime
◼ How much is the running time to find the required search key
from the list?
◼Update time (Insertion time, Deletion time)
◼ How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
◼ Does the indexing structure allows incremental update or re-
indexing?
◼Space overhead
◼ Computer storage space consumed for keeping the list.
12/10/2024
Building Index file
9
◼An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
◼An index file is alist of search terms that are organized for
associative look-up, i.e., to answer user’s query:
◼In which documents doesaspecified search term appear?
◼Where within each document does each term appear? (There
may be several occurrences.)
◼For organizing index file for acollection of documents, there
are various optionsavailable:
◼Decide what data structure and/or file structure to use. Isit sequential file,
inverted file, suffixtree, etc. ?
12/10/2024
Sequential File
10
◼ Sequential file is the most primitive file structures.
◼ It hasno vocabulary aswell aslinking pointers.
◼ The records are generally arranged serially, one after
another, but in lexicographic order on the value of some key
field. i.e
◼ a particular attribute is chosen as primary key whose value
will determine the order ofthe records.
◼ when the first key fails to discriminate among records, a
second key ischosen to give an order.
12/10/2024
Example:
11
◼Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
can make it harder
to do even easy tasks.
so make it easy
Doc 1
positive affect can
make it easier
to do difficult tasks
Doc 2
12/10/2024
◼ After all documents
have been tokenized,
stop words are
removed, and
normalizationand
stemmingare
applied, to generate
index terms
◼ These index terms in
sequential file are
sorted in
alphabetical order
Sorting the
Vocabulary Sequential file
12/10/2024
Sequential File
◼To access records search serially.
◼starting at the first record read and investigate all
the succeeding records until the required record is
found or end of the file is reached.
◼Update options: Is the index needs to be rebuilt or
incremental update is supported?
12/10/2024
Sequential File …
◼Its main advantages:
◼easy to implement;
◼provides fast access to the next record using lexicographic
order.
◼ Can be searched quickly, using binary search, O(log n)
◼ Its disadvantages:
◼ No weights attached to terms.
◼ Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query.
12/10/2024
Inverted file
◼Aword oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
◼ Building and maintaining an inverted index is arelatively low
cost risk. On atext of n words an inverted index can be built in
O(n) time
◼ This list is inverted from alist of terms in location order to a
list of terms in alphabetical order.
Original
Documents
Document IDs
Word Extraction
Word IDs
•W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
•Inverted Files
March 8, 2020 15
12/10/2024
Inverted file
17
Datato be held in the inverted file includes
◼ The vocabulary (List of terms):
◼ is the set of all distinct words (index terms) in the text
collection.
◼having information about vocabulary (list of terms) speeds
searching for relevant documents
◼For each term: the inverted file contains information related to
◼Location: all the text locations/positions where the word
occurs
◼frequency of occurrence of terms in adocument collection
12/10/2024
Enhancements to Inverted Files --
Concept
18
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
12/10/2024
Inverted file
◼Having information about the location of each term within
the document helps for:
◼user interface design: highlight location of search term
◼proximity based ranking: adjacency and near operators (in
Boolean searching)
◼Having information about frequency is used for:
◼calculating term weighting (like TF, TF*IDF, …)
◼optimizing query processing
19
12/10/2024
Inverted File
20
Term CF Doc ID TF Location
term 1 3 2
19
29
1
1
1
66
213
45
term 2 4 3
19
22
1
2
1
94
7, 212
56
term 3 1 5 1 43
term 4 3 11
34
2
1
3, 70
40
Documents are organized bythe terms/words they contain
Thisis calledan
index file.
T
ext operations
are performed
before building
the index.
CF
, total
frequencyof tj in
the corpusn
Is it possible to keep all these information during searching?
12/10/2024
Construction of Inverted file
Aninvertedindexconsists of two files:vocabulary and posting files
◼ Avocabulary file (Word list):
◼ stores all of the distinct terms (keywords) that appear in anyofthe
documents (in lexicographicalorder, i.e like that of adictionary) and
◼ For eachword apointer to aposting file
◼ Recordskept for eachterm j in the vocabulary (word list) contains
the following:
◼ term j
◼ number of documents in whichterm j occurs (DFj)
◼ Collection frequency of term j (Cf)
◼ pointer to inverted (postings)
21list for term j
12/10/2024
Postings File (Inverted List)
◼For each distinct term in the vocabulary, the posting file stores a list of
pointers to the documents that contain that term.
◼Eachelement in an inverted list is called aposting, i.e., the
occurrence of aterm in adocument
◼Eachlist consists of one or many individual postings
Advantage of dividing inverted file into vocabulary and posting:
◼Keeping apointer in the vocabulary to the list in the posting file
allows:
◼ the vocabulary to be kept in memory at search time even for large
text collection, while the Posting file is kept on disk for accessing
the pointers to documents
12/10/2024
General structure of Inverted File
◼ The following figure shows the general structure of inverted
index file.
12/10/2024
Organization of Index File
Term DF CF
Pointer
To
posting
term 1 3 3
term 2 3 4
term 3 1 1
term 4 2 3
Inverted
lists
Vocabulary
(word list) Postings
(inverted list)
Documents
24
12/10/2024
Example:
25
◼Given acollection of documents, they are parsed to extract
words and these are saved with the Document ID.
Doc 1
Doc 2
Negative affect
can make it harder
to do even easy tasks.
so make it easy
positive affect can
make it easier
to do difficult tasks
12/10/2024
◼After all documents
have been tokenized
the inverted file is
sorted by terms
◼Steps
◼ Extract the terms in
each doc
◼ Sort the terms
◼ Compile the terms
i.e Collect the
frequencies for each
term
Sorting the
Vocabulary
12/10/2024
◼Multiple term
entries in a
single
document are
merged and
frequency
information
added
Remove stop words and compute
frequency
27
12/10/2024
◼Multiple term
entries in a
single
document are
merged and
frequency
information
added
stemming & compute frequency
28
12/10/2024
The file is commonly split into aDictionary and aPosting file
vocabulary
Pointers
Vocabulary and postings file
posting
29
Term DF CF
affect 2 2
difficult 1 1
do 2 2
easy 2 3
hard 1 1
make 2 3
negative 1
1
positive 2 1
task 2 2
Doc # TF
1 1
2 1
1 1
1 1
2 1
1 2
2 1
1 2
1 2
2 1
1 1
2 1
1 1
2 1
12/10/2024
Searching on Inverted File
◼ Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
◼ Using binary Search the searching takes logarithmic time
◼ The search is done in the vocabulary lists
◼ Updating inverted file is complex.
◼ We need to update both vocabulary and posting files
12/10/2024
Example: Create Inverted file
◼ Create an inverted file (both the vocabulary list and
the posting file) for the following document
collection
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
12/10/2024
Example: Create Inverted file
◼ After text operation red color terms remain asindex
term
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
12/10/2024
…. Example: Create Inverted file
After text operation performed
◼ D1= department comput science establish
◼ D2= department launch bsc comput study
◼ D3= follow msc comput science start
◼ D4= department produce phd graduat
◼ D5= staff contribut intellect profession advance field
12/10/2024
Doc# TF mTF loc
5 1 1 5
2 1 1 3
1 1 1 2
2 1 1 4
3 1 1 2
5 1 1 4
1 1 1 1
2 1 1 1
4 1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
2 2
1 1
1 1
1 1
word DF CF WID
advance 1 1 w1
bsc 1 1 W2
comput 3 3 W3
contribut 1 1 W4
department 3 3 W5
establish 1 1 W6
field 1 1 W7
follow 1 1
graduat 1 1
intellect 1 1
c
launch 1 1
o
msc 1 1 n
phd 1 1 t
produce 1 1 i
profession 1 1 n
science 2 2 u
staff 1 1 e
start 1 1
study 1 1
Pointers
vocabulary posting
•W1:d5
•W2:d2
•W3:d1,d2,d3
•Wn :di,…dn
document file
All term specific
info. (max tf, tf, tf-
idf, location…etc.)
Stored on posting
12/10/2024
Suffix trie
•Asuffix trie is an ordinary trie in which the input strings are
allpossible suffixes.
–Principles:The idea behind suffixTRIE is to assign to each symbol
in a text anindex corresponding to its position in the text. (i.e:
First symbol has index 1, last symbol has index n (#of symbols in
text).
• Tobuild the suffixTRIEwe use these indices instead of the actual object.
•The structure has several advantages:
• It requires less storage space.
• Wedo not have to worry how the text is represented (binary,
ASCII, etc).
• Wedo not haveto store the same object twice (no duplicate).
March 8, 2020 35
12/10/2024
Suffix Trie
•Construct suffix trie for the following string:GOOGOL
•Webegin bygiving aposition to every suffix in the text starting from left to
right asper charactersoccurrence in the string.
• TEXT: G O O G O L$
POSITION: 1 2 3 4 5 6 7
•BuildaSUFFIXTRIEfor all n suffixes of the text.
•Note:The resulting tree hasn leaves and height n.
This structure is
particularlyuseful
for anyapplication
requiringprefix
based ("starts with")
pattern matching.
36
March 8, 2020
12/10/2024
Suffix tree
◼ Asuffix tree is an extension of suffix
trie that construct aTrie of all the
proper suffixes of S
◼ The suffix tree is created by
compacting unarynodes of the
suffixTRIE.
◼ We store pointers rather than words in
the leaves.
◼ It is alsopossible to replace strings
in every edge by apair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
March 8, 2020 37
12/10/2024
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie of all
suffixes of s=abab$
•{
• $
• b$
• ab$
• bab$
• abab$ }
•We label each leaf with the
starting point of the
corresponding suffix.
•$
•1
•2
•b
•$
•3
•4
•$
•5
•ab
•ab$
•ab$
39
March 8, 2020
12/10/2024
Generalized suffix tree
• Given aset of strings S,ageneralized suffix tree of Sis acompressedtrie of
all suffixes of s  S
•T
o make suffixes prefix-free we add aspecial char, $, at the end of s.
•Toassociateeach suffix with aunique string in Sadd adifferent special
symbol to each s
• Buildasuffix tree for the string s1$s2#, where `$' and `#' are aspecial
terminator for s1,s2.
•Ex.: Let s1=abab &s2=aab,ageneralized suffix tree for s1&s2is:
•2
•b
•ab$
•1
•b
•$
•3
•5
•a •$
•1
•2
•#
•3
•#
•$
•4
•4
•#
•ab$ •ab$
40
{
4. #
3. b#
2. ab#
1. aab#
5. $
4. b$
3. ab$
2. bab$
1. abab$
Ma
}
rch 8, 2020
12/10/2024
Search in suffix tree
◼Searching for all instances of asubstring Sin a suffix tree is easy
since anysubstring of Sis the prefix of some suffix.
◼Pseudo-code for searching in suffix tree:
◼Start at root
◼Go down the tree by taking each time the corresponding path
◼IfScorrespond to anode, then return all leaves in sub-tree
◼ the places where Scan be found are given by the pointers in
all the leaves in the subtree rooted at x.
◼If Sencountered a NILpointer before reaching the end, then Sis
not in the tree
12/10/2024
Suffix Tree Applications
◼SuffixTree can be used to solve alarge number of string problems
that occur in:
◼text-editing,
◼free-text search, etc.
◼Some examples of string problems are given below.
◼Stringmatching
◼Longest Common Substring
◼Longest Repeated Substring
◼Palindromes
44
12/10/2024
Complexity Analysis
◼ The suffix tree for astring hasbeen built in O(n2) time.
◼ Searching is very fast: The search time is linear in the length of
string S.
◼ The number of leaves is n+1, where n is the number of input
strings.
◼ Furthermore, in the leaves, we may store either the strings
themselves or pointers to the strings (that is, integers).
◼ Searching for asubstring[1..m], in string[1..n], can be solved in
O(m) time.
45
12/10/2024
Exercise
Given the following index terms:
worker, word and world
construct index file using suffix tree?
12/10/2024
46
12/10/2024
End of Chapter 3

More Related Content

PDF
Chapter 3 Indexing Structure.pdf
PPT
Inverted Files for Text Search Engin.ppt
PPT
Csci12 report aug18
PPT
StorageIndexing_CS541.ppt indexes for dtata bae
PPT
INDEXING METHODS USED IN DATABASE STORAGE
PPT
StorageIndexing_Main memory (RAM) for currently used data. Disk for the main ...
PDF
Chapter 3 Indexing.pdf
PPTX
6 chapter 6 record storage and primary file organization
Chapter 3 Indexing Structure.pdf
Inverted Files for Text Search Engin.ppt
Csci12 report aug18
StorageIndexing_CS541.ppt indexes for dtata bae
INDEXING METHODS USED IN DATABASE STORAGE
StorageIndexing_Main memory (RAM) for currently used data. Disk for the main ...
Chapter 3 Indexing.pdf
6 chapter 6 record storage and primary file organization

Similar to Information storage and Retrieval-chapter 3.pdf (20)

PPTX
MARUTHI_INVERTED_SEARCH_presentation.pptx
PPTX
Application portfolio development.advadisadvan.pptx
PPT
File organisation
PPT
Chaptnkihgvjjfder05 File Organization.ppt
PPTX
Elmasri Navathe Primary Files database A
PPT
chapter 1-Overview of Information Retrieval.ppt
PPTX
File organization and introduction of DBMS
PDF
fileorganizationandintroductionofdbms-210313163900.pdf
PPTX
DBMS Data Storage and Query Processing.
PPTX
Updated-Index-Data-Structures.pptx inndetail
PPTX
DBMS (UNIT 5)
PPTX
Ch 17 disk storage, basic files structure, and hashing
PPT
File organisation in system analysis and design
PDF
File organisation
PPTX
Ch12-OS9e-modified (1).pptx
PPTX
overview of storage and indexing BY-Pratik kadam
PDF
Degonto file management
PPTX
Research Data Management Fundamentals for MSU Engineering Students
PPT
Chapter13
PPTX
Lec 1 indexing and hashing
MARUTHI_INVERTED_SEARCH_presentation.pptx
Application portfolio development.advadisadvan.pptx
File organisation
Chaptnkihgvjjfder05 File Organization.ppt
Elmasri Navathe Primary Files database A
chapter 1-Overview of Information Retrieval.ppt
File organization and introduction of DBMS
fileorganizationandintroductionofdbms-210313163900.pdf
DBMS Data Storage and Query Processing.
Updated-Index-Data-Structures.pptx inndetail
DBMS (UNIT 5)
Ch 17 disk storage, basic files structure, and hashing
File organisation in system analysis and design
File organisation
Ch12-OS9e-modified (1).pptx
overview of storage and indexing BY-Pratik kadam
Degonto file management
Research Data Management Fundamentals for MSU Engineering Students
Chapter13
Lec 1 indexing and hashing
Ad

More from fikadumeuedu (19)

PPTX
Chapter 04 Object Oriented programming .pptx
PPTX
Chapter 1 history and overview of Operating system.pptx
PPTX
Operating system Chapter 3 Part-1 process and threads.pptx
PPTX
Data structure chapter 2 Time complexity of known algorithms.pptx
PDF
Information storage and Retrieval-Chapter 2 Updated.pdf
PDF
Advanced programming chapter 2 - Java Applet.pdf
DOC
chapter 5--Data Communications and Computer Networks.doc
DOC
Chapter 4--Data representation Method.doc
PDF
windows 7 installation guide edit how to install window 7.pdf
PDF
Computer maintenance tool and parts chapter1.pdf
PPTX
Advanced Database System Chapter 2 Transaction.pptx
PDF
Data communication and computer network-Chapter 05.pdf
PDF
Data communication and computer network- chapter 04.pdf
PDF
Data communication and computer network- Chapter 03.pdf
PDF
Chapter seven Laptop and other devices1.pdf
PPTX
ARTIFICICIAL INTELLIGENT LECTURE NOTE Chapter 1.pptx
PPTX
Telecom Technologies lecture note Chapter 1.pptx
PPTX
Artificial Intelligent lecture note chapter 1.pptx
PPTX
1Chapter_ Two_ 2 Data Preparation lecture note.pptx
Chapter 04 Object Oriented programming .pptx
Chapter 1 history and overview of Operating system.pptx
Operating system Chapter 3 Part-1 process and threads.pptx
Data structure chapter 2 Time complexity of known algorithms.pptx
Information storage and Retrieval-Chapter 2 Updated.pdf
Advanced programming chapter 2 - Java Applet.pdf
chapter 5--Data Communications and Computer Networks.doc
Chapter 4--Data representation Method.doc
windows 7 installation guide edit how to install window 7.pdf
Computer maintenance tool and parts chapter1.pdf
Advanced Database System Chapter 2 Transaction.pptx
Data communication and computer network-Chapter 05.pdf
Data communication and computer network- chapter 04.pdf
Data communication and computer network- Chapter 03.pdf
Chapter seven Laptop and other devices1.pdf
ARTIFICICIAL INTELLIGENT LECTURE NOTE Chapter 1.pptx
Telecom Technologies lecture note Chapter 1.pptx
Artificial Intelligent lecture note chapter 1.pptx
1Chapter_ Two_ 2 Data Preparation lecture note.pptx
Ad

Recently uploaded (20)

PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
semiconductor packaging in vlsi design fab
PPTX
Education and Perspectives of Education.pptx
PDF
Hazard Identification & Risk Assessment .pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
CRP102_SAGALASSOS_Final_Projects_2025.pdf
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
Empowerment Technology for Senior High School Guide
PPTX
Core Concepts of Personalized Learning and Virtual Learning Environments
PDF
My India Quiz Book_20210205121199924.pdf
PDF
Journal of Dental Science - UDMY (2021).pdf
PDF
Journal of Dental Science - UDMY (2022).pdf
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
Journal of Dental Science - UDMY (2020).pdf
PDF
IP : I ; Unit I : Preformulation Studies
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
semiconductor packaging in vlsi design fab
Education and Perspectives of Education.pptx
Hazard Identification & Risk Assessment .pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
AI-driven educational solutions for real-life interventions in the Philippine...
CRP102_SAGALASSOS_Final_Projects_2025.pdf
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
Empowerment Technology for Senior High School Guide
Core Concepts of Personalized Learning and Virtual Learning Environments
My India Quiz Book_20210205121199924.pdf
Journal of Dental Science - UDMY (2021).pdf
Journal of Dental Science - UDMY (2022).pdf
Environmental Education MCQ BD2EE - Share Source.pdf
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
Journal of Dental Science - UDMY (2020).pdf
IP : I ; Unit I : Preformulation Studies

Information storage and Retrieval-chapter 3.pdf

  • 2. Outline ◼ Major Steps in Index Construction ◼ Index file Evaluation Metrics ◼ Building Index file ◼ Sequential File ◼ Inverted file ◼ Suffix tree ◼ Suffix Trie ◼ Suffix TreeApplications 2 12/10/2024
  • 3. Indexing: Basic Concepts 3 ◼ Indexing is an arrangement of index terms to permit fast searching and reducing memory space requirement ◼ It used to speed up access to desired information from document collection asper users query such that ◼ It enhances efficiency in terms of time for retrieval. ◼ Relevantdocuments are searched and retrieved quick ◼ Index file usually has index terms in asorted order. ◼ Whichlist is easier to search? fox pig zebra hen ant cat dog lion ox ant cat dog fox hen lion ox pig zebra 12/10/2024
  • 4. ◼ Anindex file consists of records, called index entries. ◼ Index files are much smaller than the original file. ◼ Remember Heaps Law: in 1 GBof text collection the vocabulary has asize of only 5 MB. This size may be further reduced by Linguistic pre-processing (or text operations). ◼ The usual unit for indexing is the word ◼ Index terms - are used to look up records in afile. 4 Indexing: Basic Concepts 12/10/2024
  • 5. Major Steps in Index Construction ◼Source file: Collection of text document ◼Adocument can be described by a set of representative keywordscalled index terms. ◼Index Terms Selection: apply text operations or preprocessing. ◼Tokenize: identify words in a document, so that each document is represented byalist of keywords or attributes. ◼Stop words removal: words with high frequency are non- content bearing and needs to be removed from text collection. 12/10/2024
  • 6. Major Steps in Index Construction … 6 ◼Word stem: reduce words with similar meaning into their stem/root word. ◼Term relevance weight: Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. There are different index terms weighting methods: including TF, IDF, TF*IDF… ◼Indexing structure: a set of index terms (vocabulary) are organized in Index File to easily identify documents in which each term occurs in. 12/10/2024
  • 7. Basic Indexing Process Tokenizer Token stream. Friends Romans countrymen Linguistic preprocessor Modified tokens. friend roman countryman Indexer Index File Documents to be indexed. Friends, Romans, countrymen. friend roman countryman 2 4 2 13 16 1 Inverted file 12/10/2024
  • 8. Index file Evaluation Metrics 8 ◼Running time of the main operations ◼ Access/searchtime ◼ How much is the running time to find the required search key from the list? ◼Update time (Insertion time, Deletion time) ◼ How much time does it take to update existing records in an attempt to add new terms or delete existing unnecessary terms? ◼ Does the indexing structure allows incremental update or re- indexing? ◼Space overhead ◼ Computer storage space consumed for keeping the list. 12/10/2024
  • 9. Building Index file 9 ◼An index file of a document is a file consisting of a list of index terms and a link to one or more documents that has the index term ◼An index file is alist of search terms that are organized for associative look-up, i.e., to answer user’s query: ◼In which documents doesaspecified search term appear? ◼Where within each document does each term appear? (There may be several occurrences.) ◼For organizing index file for acollection of documents, there are various optionsavailable: ◼Decide what data structure and/or file structure to use. Isit sequential file, inverted file, suffixtree, etc. ? 12/10/2024
  • 10. Sequential File 10 ◼ Sequential file is the most primitive file structures. ◼ It hasno vocabulary aswell aslinking pointers. ◼ The records are generally arranged serially, one after another, but in lexicographic order on the value of some key field. i.e ◼ a particular attribute is chosen as primary key whose value will determine the order ofthe records. ◼ when the first key fails to discriminate among records, a second key ischosen to give an order. 12/10/2024
  • 11. Example: 11 ◼Given a collection of documents, they are parsed to extract words and these are saved with the Document ID. Negative affect can make it harder to do even easy tasks. so make it easy Doc 1 positive affect can make it easier to do difficult tasks Doc 2 12/10/2024
  • 12. ◼ After all documents have been tokenized, stop words are removed, and normalizationand stemmingare applied, to generate index terms ◼ These index terms in sequential file are sorted in alphabetical order Sorting the Vocabulary Sequential file 12/10/2024
  • 13. Sequential File ◼To access records search serially. ◼starting at the first record read and investigate all the succeeding records until the required record is found or end of the file is reached. ◼Update options: Is the index needs to be rebuilt or incremental update is supported? 12/10/2024
  • 14. Sequential File … ◼Its main advantages: ◼easy to implement; ◼provides fast access to the next record using lexicographic order. ◼ Can be searched quickly, using binary search, O(log n) ◼ Its disadvantages: ◼ No weights attached to terms. ◼ Random access is slow: since similar terms are indexed individually, we need to find all terms that match with the query. 12/10/2024
  • 15. Inverted file ◼Aword oriented indexing mechanism based on sorted list of keywords, with each keyword having links to the documents containing it ◼ Building and maintaining an inverted index is arelatively low cost risk. On atext of n words an inverted index can be built in O(n) time ◼ This list is inverted from alist of terms in location order to a list of terms in alphabetical order. Original Documents Document IDs Word Extraction Word IDs •W1:d1,d2,d3 •W2:d2,d4,d7,d9 •… •Wn :di,…dn •Inverted Files March 8, 2020 15 12/10/2024
  • 16. Inverted file 17 Datato be held in the inverted file includes ◼ The vocabulary (List of terms): ◼ is the set of all distinct words (index terms) in the text collection. ◼having information about vocabulary (list of terms) speeds searching for relevant documents ◼For each term: the inverted file contains information related to ◼Location: all the text locations/positions where the word occurs ◼frequency of occurrence of terms in adocument collection 12/10/2024
  • 17. Enhancements to Inverted Files -- Concept 18 Location: Each posting holds information about the location of each term within the document. Uses user interface design -- highlight location of search term adjacency and near operators (in Boolean searching) Frequency: Each inverted list includes the number of postings for each term. Uses term weighting query processing optimization 12/10/2024
  • 18. Inverted file ◼Having information about the location of each term within the document helps for: ◼user interface design: highlight location of search term ◼proximity based ranking: adjacency and near operators (in Boolean searching) ◼Having information about frequency is used for: ◼calculating term weighting (like TF, TF*IDF, …) ◼optimizing query processing 19 12/10/2024
  • 19. Inverted File 20 Term CF Doc ID TF Location term 1 3 2 19 29 1 1 1 66 213 45 term 2 4 3 19 22 1 2 1 94 7, 212 56 term 3 1 5 1 43 term 4 3 11 34 2 1 3, 70 40 Documents are organized bythe terms/words they contain Thisis calledan index file. T ext operations are performed before building the index. CF , total frequencyof tj in the corpusn Is it possible to keep all these information during searching? 12/10/2024
  • 20. Construction of Inverted file Aninvertedindexconsists of two files:vocabulary and posting files ◼ Avocabulary file (Word list): ◼ stores all of the distinct terms (keywords) that appear in anyofthe documents (in lexicographicalorder, i.e like that of adictionary) and ◼ For eachword apointer to aposting file ◼ Recordskept for eachterm j in the vocabulary (word list) contains the following: ◼ term j ◼ number of documents in whichterm j occurs (DFj) ◼ Collection frequency of term j (Cf) ◼ pointer to inverted (postings) 21list for term j 12/10/2024
  • 21. Postings File (Inverted List) ◼For each distinct term in the vocabulary, the posting file stores a list of pointers to the documents that contain that term. ◼Eachelement in an inverted list is called aposting, i.e., the occurrence of aterm in adocument ◼Eachlist consists of one or many individual postings Advantage of dividing inverted file into vocabulary and posting: ◼Keeping apointer in the vocabulary to the list in the posting file allows: ◼ the vocabulary to be kept in memory at search time even for large text collection, while the Posting file is kept on disk for accessing the pointers to documents 12/10/2024
  • 22. General structure of Inverted File ◼ The following figure shows the general structure of inverted index file. 12/10/2024
  • 23. Organization of Index File Term DF CF Pointer To posting term 1 3 3 term 2 3 4 term 3 1 1 term 4 2 3 Inverted lists Vocabulary (word list) Postings (inverted list) Documents 24 12/10/2024
  • 24. Example: 25 ◼Given acollection of documents, they are parsed to extract words and these are saved with the Document ID. Doc 1 Doc 2 Negative affect can make it harder to do even easy tasks. so make it easy positive affect can make it easier to do difficult tasks 12/10/2024
  • 25. ◼After all documents have been tokenized the inverted file is sorted by terms ◼Steps ◼ Extract the terms in each doc ◼ Sort the terms ◼ Compile the terms i.e Collect the frequencies for each term Sorting the Vocabulary 12/10/2024
  • 26. ◼Multiple term entries in a single document are merged and frequency information added Remove stop words and compute frequency 27 12/10/2024
  • 27. ◼Multiple term entries in a single document are merged and frequency information added stemming & compute frequency 28 12/10/2024
  • 28. The file is commonly split into aDictionary and aPosting file vocabulary Pointers Vocabulary and postings file posting 29 Term DF CF affect 2 2 difficult 1 1 do 2 2 easy 2 3 hard 1 1 make 2 3 negative 1 1 positive 2 1 task 2 2 Doc # TF 1 1 2 1 1 1 1 1 2 1 1 2 2 1 1 2 1 2 2 1 1 1 2 1 1 1 2 1 12/10/2024
  • 29. Searching on Inverted File ◼ Since the whole index file is divided into two, searching can be done faster by loading vocabulary list which takes less memory even for large document collection ◼ Using binary Search the searching takes logarithmic time ◼ The search is done in the vocabulary lists ◼ Updating inverted file is complex. ◼ We need to update both vocabulary and posting files 12/10/2024
  • 30. Example: Create Inverted file ◼ Create an inverted file (both the vocabulary list and the posting file) for the following document collection D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc in Computer Studies in 1987. D3 Followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. 12/10/2024
  • 31. Example: Create Inverted file ◼ After text operation red color terms remain asindex term D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc in Computer Studies in 1987. D3 Followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. 12/10/2024
  • 32. …. Example: Create Inverted file After text operation performed ◼ D1= department comput science establish ◼ D2= department launch bsc comput study ◼ D3= follow msc comput science start ◼ D4= department produce phd graduat ◼ D5= staff contribut intellect profession advance field 12/10/2024
  • 33. Doc# TF mTF loc 5 1 1 5 2 1 1 3 1 1 1 2 2 1 1 4 3 1 1 2 5 1 1 4 1 1 1 1 2 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 word DF CF WID advance 1 1 w1 bsc 1 1 W2 comput 3 3 W3 contribut 1 1 W4 department 3 3 W5 establish 1 1 W6 field 1 1 W7 follow 1 1 graduat 1 1 intellect 1 1 c launch 1 1 o msc 1 1 n phd 1 1 t produce 1 1 i profession 1 1 n science 2 2 u staff 1 1 e start 1 1 study 1 1 Pointers vocabulary posting •W1:d5 •W2:d2 •W3:d1,d2,d3 •Wn :di,…dn document file All term specific info. (max tf, tf, tf- idf, location…etc.) Stored on posting 12/10/2024
  • 34. Suffix trie •Asuffix trie is an ordinary trie in which the input strings are allpossible suffixes. –Principles:The idea behind suffixTRIE is to assign to each symbol in a text anindex corresponding to its position in the text. (i.e: First symbol has index 1, last symbol has index n (#of symbols in text). • Tobuild the suffixTRIEwe use these indices instead of the actual object. •The structure has several advantages: • It requires less storage space. • Wedo not have to worry how the text is represented (binary, ASCII, etc). • Wedo not haveto store the same object twice (no duplicate). March 8, 2020 35 12/10/2024
  • 35. Suffix Trie •Construct suffix trie for the following string:GOOGOL •Webegin bygiving aposition to every suffix in the text starting from left to right asper charactersoccurrence in the string. • TEXT: G O O G O L$ POSITION: 1 2 3 4 5 6 7 •BuildaSUFFIXTRIEfor all n suffixes of the text. •Note:The resulting tree hasn leaves and height n. This structure is particularlyuseful for anyapplication requiringprefix based ("starts with") pattern matching. 36 March 8, 2020 12/10/2024
  • 36. Suffix tree ◼ Asuffix tree is an extension of suffix trie that construct aTrie of all the proper suffixes of S ◼ The suffix tree is created by compacting unarynodes of the suffixTRIE. ◼ We store pointers rather than words in the leaves. ◼ It is alsopossible to replace strings in every edge by apair (a,b), where a & b are the beginning and end index of the string. i.e. (3,7) for OGOL$ (1,2) for GO (7,7) for $ March 8, 2020 37 12/10/2024
  • 37. Example: Suffix tree •Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ •{ • $ • b$ • ab$ • bab$ • abab$ } •We label each leaf with the starting point of the corresponding suffix. •$ •1 •2 •b •$ •3 •4 •$ •5 •ab •ab$ •ab$ 39 March 8, 2020 12/10/2024
  • 38. Generalized suffix tree • Given aset of strings S,ageneralized suffix tree of Sis acompressedtrie of all suffixes of s  S •T o make suffixes prefix-free we add aspecial char, $, at the end of s. •Toassociateeach suffix with aunique string in Sadd adifferent special symbol to each s • Buildasuffix tree for the string s1$s2#, where `$' and `#' are aspecial terminator for s1,s2. •Ex.: Let s1=abab &s2=aab,ageneralized suffix tree for s1&s2is: •2 •b •ab$ •1 •b •$ •3 •5 •a •$ •1 •2 •# •3 •# •$ •4 •4 •# •ab$ •ab$ 40 { 4. # 3. b# 2. ab# 1. aab# 5. $ 4. b$ 3. ab$ 2. bab$ 1. abab$ Ma } rch 8, 2020 12/10/2024
  • 39. Search in suffix tree ◼Searching for all instances of asubstring Sin a suffix tree is easy since anysubstring of Sis the prefix of some suffix. ◼Pseudo-code for searching in suffix tree: ◼Start at root ◼Go down the tree by taking each time the corresponding path ◼IfScorrespond to anode, then return all leaves in sub-tree ◼ the places where Scan be found are given by the pointers in all the leaves in the subtree rooted at x. ◼If Sencountered a NILpointer before reaching the end, then Sis not in the tree 12/10/2024
  • 40. Suffix Tree Applications ◼SuffixTree can be used to solve alarge number of string problems that occur in: ◼text-editing, ◼free-text search, etc. ◼Some examples of string problems are given below. ◼Stringmatching ◼Longest Common Substring ◼Longest Repeated Substring ◼Palindromes 44 12/10/2024
  • 41. Complexity Analysis ◼ The suffix tree for astring hasbeen built in O(n2) time. ◼ Searching is very fast: The search time is linear in the length of string S. ◼ The number of leaves is n+1, where n is the number of input strings. ◼ Furthermore, in the leaves, we may store either the strings themselves or pointers to the strings (that is, integers). ◼ Searching for asubstring[1..m], in string[1..n], can be solved in O(m) time. 45 12/10/2024
  • 42. Exercise Given the following index terms: worker, word and world construct index file using suffix tree? 12/10/2024