Information storage and retrieval system unit two

A.M.T COLLEGE
DEPARTMENT OF INFORMATION
TECHNOLOGY
Information Storage and Retrieval

CHAPTER TWO
TEXT/DOCUMENT OPERATION AND
AUTOMATIC INDEXING

CHAPTER TWO
TEXT/DOCUMENT OPERATION AND AUTOMATIC INDEXING
The main contents of this chapter are the following.
Index term selection(Zipf’s law and Luhn’s
selection)
Document pre-processing(lexical analysis,
stop word elimination, Stemming)
Term extraction(Term weighting and
Similarity measures).

2.1 index term selection/ማውጫ ቃል ምርጫ
•An index language is the
language used to describe
documents and requests.
•The elements of index
language are index terms.

Cont…
•Some words are not good for
representing documents, use of all
words have computational cost,
increase searching time and storage
requirements and using the set of all
አንዳንድ ቃላቶች ሰነዶችን ለመወከል ጥሩ አይደሉም ፣ የሁሉም ቃላት አጠቃቀም ስሌት ዋጋ አላቸው ፣
የፍለጋ ጊዜን እና የማከማቻ መስፈርቶችን ይጨምሩ እና የሁሉንም ስብስብ ለመጠቀም።

words in a collection to index
document generates too much noise
for the retrieval task, therefor, term
selection is very important.
በክምችት ውስጥ ያሉ ቃላት ወደ መረጃ
ጠቋሚ ሰነድ በጣም ብዙ ድምጽ ያመነጫሉ,
ስለዚህ የቃላት ምርጫ በጣም አስፈላጊ ነው.

The main objectives of term selection are:
•Represent textual documents by a
set of keywords called index terms
or simply terms.
•Increase efficiency by extracting
from the resulting document a
selected set of terms to be used for
indexing the document.

•If full text representation is adopted
then all words are used for indexing
Index terms is called keyword or is
a word(a single word) or
phrase(multiword).

indexing/መረጃ ጠቋሚ
Is the art of organizing information
Is an association of
descriptors(keywords, concepts) to
document s in view of
Act of assigning index terms to a
document.

Is the process of storing data in a
particular way in order to locate and
retrieve the data.
Is a way of identify important
information and represent it in a
useful way.

why indexing?
•Need some representation
of content
•Can not use the full
document for search

indexing used in:
Find documents by topic
Define topic areas, relate documents
to each other
Predict relevance between documents
and information need
To allow easy identification of
documents

There are two ways of
indexing
1. Manual indexing
Indexers decide which keywords to
assign to documents based on
controlled vocabulary(human
indexers assign index terms to
documents).

The indexers analyse and represent
the content of a document through
keywords which is based on
intellectual judgment and semantic
interpretation of (concepts, themes)
of indexers.

The ff are important in manual
indexing
Terms that will be used by the user
Indexing vocabulary
Collection characteristics

•Indexers are normally provided with
guidelines(input sheets, manuals
and printed thesaurus) to
determine the contents of a given
document and are usually done in
the library environment.

Advantage of manual indexing
Ability to perform abstraction
(conclude what the subject is) and
determine additional related terms.
Ability to judge the value of
concepts

Disadvantage of manual indexing
Slow and expensive (significant
cost)
-cost of professional indexers is
very expensive.

High probability off inconsistency
or low consistency among
indexers(maintaining consistency is
difficult).
Labor intensive

2. Automatic indexing
Automatic indexing is the
assignment of content identifiers,
with the help of modern computing
technology.

A computer system is used to
record the descriptors generated by
the human and the system extracts
“typical”/”significant” terms.
The original texts of information
items are used as basis of indexing.

An automatic indexing is necessary
because of the ff reason:
Information overload
-enormous amount of
information is being generated
from day to day activity.

Explosion of machine readable text
-massive information available in
electronic format and on internet.
Cost effective
-human indexing is expensive and
labor intensive

Procedures for automatic
indexing
Generating document representatives
through automatic indexing involves
oLexical analysis
oUse of stop list
oNoun identification(optional)
oPhrase formation (optional)

oUse of conflation
procedures(stemming, optional)
oSelection of index terms
oWeighting the resulting
terms(optional)

Advantage of automatic
indexing
•Reduced processing time(Fast)
•Reduced cost (inexpensive)
•Easy to maintain
•Improved consistency
•Better retrieval(achieved)

Disadvantage of automatic
indexing
•Mechanical execution of algorithm,
with no intelligent interpretation(of
aboutness/relevance)

2.1.1 Zipf’s law in IR and
Luhn’s selection
2.1.1.1 Zipf’s law
Zipf’s law states that given a corpus
of natural language utterances, the
frequency of any word is inversely
proportional to its rank in the
frequency table.

•The rank-frequency distribution is
an inverse relation.
•2 most frequent words (e.g “the”,
“to”) can account for about 10% of
words documents.
•Eg. The word “the” is the most
frequently occurring

Zipf’s law example
The table shows the most frequently
occurring words from 336,310
document collection containing
125,720,891 total words; out of
which 508,209 unique words.

•Frequent word Number of occ
•the----------------------7,398,934
•of------------------------3,893,790
•to------------------------3,364,653
•and----------------------3,320,687
•in------------------------2,311,785
•is------------------------1,559,147

•for-------------------------1,313,561
•The-----------------------1,144,860
•that----------------------1,066,503
•Said----------------------1,027,713

Information storage and retrieval system unit two

2.1.1.2 Luhn’s
analysis
•Luhn Idea (1958): the frequency
of word occurrence in a text
provides a useful measurement of
word significance.

•He suggested that both extremely
common and extremely
uncommon words were not very
useful for document
representation and indexing.

•Therefore, the most important
words for indexing are those
which occur with intermediate
frequencies.
•Thus, according to Luhn
medium frequency term are
better candidates for indexing.

•He states proposed that the
frequency of word of
occurrence in an article
furnishes a useful
measurement of word
significance.

2.2 Document Pre-
processing
Preprocessing is the process
of controlling the size of the
vocabulary or the number of
distinct words used as index
terms.

Text operation is the process of
text transformations into a
logical representation.
5 main
operations/transformations
selecting index terms.

A. Lexical analysis of the text
generate a set of words
from text collection
With the objective of treating
digits, hyphens, punctuations
marks, and the cases of letter.

Digits
(1999),
Case (Republican vs. republican)
•HYPHEN
•Eg. MS-DOS, B-49,
•PUNCTUATION
•WWW.WSU.EDU.ET

B. Elimination of stop-words.
Filter out words which are
not useful in the retrieval
process.

C. Stemming: of the remaining
words with the objective of
removing affixes(i.e suffixes and
prefixes) and allowing the retrieval
documents containing syntactic
variation of query terms(e.g
connect,connected,connecting etc..

D. Selection of index terms:
To determine which words/
stems are or groups of words will
be used as an indexing elements.

E. Construction of term
categorization
•Structures such as thesaurus, to
capture relationship for allowing
the expansion of the original query
with related terms.

Text processing system
Tokenization
is one of the step used to convert
text of the documents into a
sequence of words.

Elimination of stop words
Stop words are extremely common
words across document collections
that have no discriminatory power.
Eg. Articles, Pronouns,
Prepositions, Conjunction/
connectors

Normalization
It is in a way standardization of text.
E.g U.S.A vs USA

Case folding
Often best to lower case everything
Eg. Fasil vs. fasil vs. FASIL
Stemming
The process involves removal of
affixes.
Eg.Boy-boys, cut-cutting, creation-create

Information storage and retrieval system unit two

More Related Content

Similar to Information storage and retrieval system unit two (20)

More from garedew32 (20)

Recently uploaded (20)

Information storage and retrieval system unit two