SlideShare a Scribd company logo
A.M.T COLLEGE
DEPARTMENT OF INFORMATION
TECHNOLOGY
Information Storage and Retrieval
CHAPTER TWO
TEXT/DOCUMENT OPERATION AND
AUTOMATIC INDEXING
CHAPTER TWO
TEXT/DOCUMENT OPERATION AND AUTOMATIC INDEXING
The main contents of this chapter are the following.
Index term selection(Zipf’s law and Luhn’s
selection)
Document pre-processing(lexical analysis,
stop word elimination, Stemming)
Term extraction(Term weighting and
Similarity measures).
2.1 index term selection/ማውጫ ቃል ምርጫ
•An index language is the
language used to describe
documents and requests.
•The elements of index
language are index terms.
Cont…
•Some words are not good for
representing documents, use of all
words have computational cost,
increase searching time and storage
requirements and using the set of all
አንዳንድ ቃላቶች ሰነዶችን ለመወከል ጥሩ አይደሉም ፣ የሁሉም ቃላት አጠቃቀም ስሌት ዋጋ አላቸው ፣
የፍለጋ ጊዜን እና የማከማቻ መስፈርቶችን ይጨምሩ እና የሁሉንም ስብስብ ለመጠቀም።
words in a collection to index
document generates too much noise
for the retrieval task, therefor, term
selection is very important.
በክምችት ውስጥ ያሉ ቃላት ወደ መረጃ
ጠቋሚ ሰነድ በጣም ብዙ ድምጽ ያመነጫሉ,
ስለዚህ የቃላት ምርጫ በጣም አስፈላጊ ነው.
The main objectives of term selection are:
•Represent textual documents by a
set of keywords called index terms
or simply terms.
•Increase efficiency by extracting
from the resulting document a
selected set of terms to be used for
indexing the document.
•If full text representation is adopted
then all words are used for indexing
Index terms is called keyword or is
a word(a single word) or
phrase(multiword).
indexing/መረጃ ጠቋሚ
Is the art of organizing information
Is an association of
descriptors(keywords, concepts) to
document s in view of
Act of assigning index terms to a
document.
Is the process of storing data in a
particular way in order to locate and
retrieve the data.
Is a way of identify important
information and represent it in a
useful way.
why indexing?
•Need some representation
of content
•Can not use the full
document for search
indexing used in:
Find documents by topic
Define topic areas, relate documents
to each other
Predict relevance between documents
and information need
To allow easy identification of
documents
There are two ways of
indexing
1. Manual indexing
Indexers decide which keywords to
assign to documents based on
controlled vocabulary(human
indexers assign index terms to
documents).
The indexers analyse and represent
the content of a document through
keywords which is based on
intellectual judgment and semantic
interpretation of (concepts, themes)
of indexers.
The ff are important in manual
indexing
Terms that will be used by the user
Indexing vocabulary
Collection characteristics
•Indexers are normally provided with
guidelines(input sheets, manuals
and printed thesaurus) to
determine the contents of a given
document and are usually done in
the library environment.
Advantage of manual indexing
Ability to perform abstraction
(conclude what the subject is) and
determine additional related terms.
Ability to judge the value of
concepts
Disadvantage of manual indexing
Slow and expensive (significant
cost)
-cost of professional indexers is
very expensive.
High probability off inconsistency
or low consistency among
indexers(maintaining consistency is
difficult).
Labor intensive
2. Automatic indexing
Automatic indexing is the
assignment of content identifiers,
with the help of modern computing
technology.
A computer system is used to
record the descriptors generated by
the human and the system extracts
“typical”/”significant” terms.
The original texts of information
items are used as basis of indexing.
An automatic indexing is necessary
because of the ff reason:
Information overload
-enormous amount of
information is being generated
from day to day activity.
Explosion of machine readable text
-massive information available in
electronic format and on internet.
Cost effective
-human indexing is expensive and
labor intensive
Procedures for automatic
indexing
Generating document representatives
through automatic indexing involves
oLexical analysis
oUse of stop list
oNoun identification(optional)
oPhrase formation (optional)
oUse of conflation
procedures(stemming, optional)
oSelection of index terms
oWeighting the resulting
terms(optional)
Advantage of automatic
indexing
•Reduced processing time(Fast)
•Reduced cost (inexpensive)
•Easy to maintain
•Improved consistency
•Better retrieval(achieved)
Disadvantage of automatic
indexing
•Mechanical execution of algorithm,
with no intelligent interpretation(of
aboutness/relevance)
2.1.1 Zipf’s law in IR and
Luhn’s selection
2.1.1.1 Zipf’s law
Zipf’s law states that given a corpus
of natural language utterances, the
frequency of any word is inversely
proportional to its rank in the
frequency table.
•The rank-frequency distribution is
an inverse relation.
•2 most frequent words (e.g “the”,
“to”) can account for about 10% of
words documents.
•Eg. The word “the” is the most
frequently occurring
Zipf’s law example
The table shows the most frequently
occurring words from 336,310
document collection containing
125,720,891 total words; out of
which 508,209 unique words.
•Frequent word Number of occ
•the----------------------7,398,934
•of------------------------3,893,790
•to------------------------3,364,653
•and----------------------3,320,687
•in------------------------2,311,785
•is------------------------1,559,147
•for-------------------------1,313,561
•The-----------------------1,144,860
•that----------------------1,066,503
•Said----------------------1,027,713
Information storage and retrieval system unit two
2.1.1.2 Luhn’s
analysis
•Luhn Idea (1958): the frequency
of word occurrence in a text
provides a useful measurement of
word significance.
•He suggested that both extremely
common and extremely
uncommon words were not very
useful for document
representation and indexing.
•Therefore, the most important
words for indexing are those
which occur with intermediate
frequencies.
•Thus, according to Luhn
medium frequency term are
better candidates for indexing.
•He states proposed that the
frequency of word of
occurrence in an article
furnishes a useful
measurement of word
significance.
2.2 Document Pre-
processing
Preprocessing is the process
of controlling the size of the
vocabulary or the number of
distinct words used as index
terms.
Text operation is the process of
text transformations into a
logical representation.
5 main
operations/transformations
selecting index terms.
A. Lexical analysis of the text
generate a set of words
from text collection
With the objective of treating
digits, hyphens, punctuations
marks, and the cases of letter.
Digits
(1999),
Case (Republican vs. republican)
•HYPHEN
•Eg. MS-DOS, B-49,
•PUNCTUATION
•WWW.WSU.EDU.ET
B. Elimination of stop-words.
Filter out words which are
not useful in the retrieval
process.
C. Stemming: of the remaining
words with the objective of
removing affixes(i.e suffixes and
prefixes) and allowing the retrieval
documents containing syntactic
variation of query terms(e.g
connect,connected,connecting etc..
D. Selection of index terms:
To determine which words/
stems are or groups of words will
be used as an indexing elements.
E. Construction of term
categorization
•Structures such as thesaurus, to
capture relationship for allowing
the expansion of the original query
with related terms.
Text processing system
Tokenization
is one of the step used to convert
text of the documents into a
sequence of words.
Elimination of stop words
Stop words are extremely common
words across document collections
that have no discriminatory power.
Eg. Articles, Pronouns,
Prepositions, Conjunction/
connectors
Normalization
It is in a way standardization of text.
E.g U.S.A vs USA
Case folding
Often best to lower case everything
Eg. Fasil vs. fasil vs. FASIL
Stemming
The process involves removal of
affixes.
Eg.Boy-boys, cut-cutting, creation-create

More Related Content

PPTX
Chapter 2 - Text operations Information retrieval ch2
PPT
3_Indexing.ppt
PDF
Chapter 3 Indexing Structure.pdf
PDF
Chapter 2 Text Operation and Term Weighting.pdf
PPT
IR CHAPTER_TWO Most important for students
PPTX
Automatic indexing
PDF
Chapter 2 Text Operation.pdf
PDF
Information storage and Retrieval-chapter 3.pdf
Chapter 2 - Text operations Information retrieval ch2
3_Indexing.ppt
Chapter 3 Indexing Structure.pdf
Chapter 2 Text Operation and Term Weighting.pdf
IR CHAPTER_TWO Most important for students
Automatic indexing
Chapter 2 Text Operation.pdf
Information storage and Retrieval-chapter 3.pdf

Similar to Information storage and retrieval system unit two (20)

PDF
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
PPTX
Info 2402 irt-chapter_4
PDF
Chapter 2: Text Operation in information stroage and retrieval
PPT
Information Retrieval
PPT
Information retrieval chapter 2-Text Operations.ppt
PPTX
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
PPTX
IRS-Cataloging and Indexing-2.1.pptx
PDF
information retrival and text processing
PPT
CHapter 2_text operation.ppt material for university students
PPTX
Lecture 7- Text Statistics and Document Parsing
PDF
Information storage and Retrieval-Chapter 2 Updated.pdf
PDF
Shilpa shukla processing_text
PDF
Information retrieval concept, practice and challenge
PDF
Chapter 3 Indexing.pdf
PPTX
01 IRS-1 (1) document upload the link to
PPTX
01 IRS to upload the data according to the.pptx
PPT
Indexing
PPT
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
Info 2402 irt-chapter_4
Chapter 2: Text Operation in information stroage and retrieval
Information Retrieval
Information retrieval chapter 2-Text Operations.ppt
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
IRS-Cataloging and Indexing-2.1.pptx
information retrival and text processing
CHapter 2_text operation.ppt material for university students
Lecture 7- Text Statistics and Document Parsing
Information storage and Retrieval-Chapter 2 Updated.pdf
Shilpa shukla processing_text
Information retrieval concept, practice and challenge
Chapter 3 Indexing.pdf
01 IRS-1 (1) document upload the link to
01 IRS to upload the data according to the.pptx
Indexing
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
Ad

More from garedew32 (20)

PPTX
99Language_Acquisition for english language.pptx
PPTX
ADDIS SLIDE for research article review .pptx
PPT
quantitative and qualitative research presentation.ppt
PPTX
Research method power point of reaserch.pptx
PPTX
research of English language and lppt.pptx
PPTX
PhD Slide research for professional degree.pptx
PPTX
chapter 2. AI Agents and introduction.pptx
PPTX
Chapter 1 information assurance and security
PPTX
chapter 1 Introduction To Artificial I.pptx
PPTX
chapter 5 Robotics good best in artificial intelli.pptx
PPTX
Computer-Basics - computer_basics2 to ppt.pptx
PPT
concepts-of-computer and computer application
PPTX
Basic computer application in basic computer skills
PPTX
INFORMATION TECHNOLOGY UNIT 2 THE EMERGING TECHNOLOGY
PPTX
UNIT TWO PART TWO THE EMERGING TECHNOLOGY
PPTX
UNIT TWO PART TWO THE EMERGING TECHNOLOGY
PPTX
Computer application in management for third year
PPTX
Information storage and retrieval system and
PPTX
Grade eleven Information Technology unit 5
PPTX
Computer application in management for thrid year degree student
99Language_Acquisition for english language.pptx
ADDIS SLIDE for research article review .pptx
quantitative and qualitative research presentation.ppt
Research method power point of reaserch.pptx
research of English language and lppt.pptx
PhD Slide research for professional degree.pptx
chapter 2. AI Agents and introduction.pptx
Chapter 1 information assurance and security
chapter 1 Introduction To Artificial I.pptx
chapter 5 Robotics good best in artificial intelli.pptx
Computer-Basics - computer_basics2 to ppt.pptx
concepts-of-computer and computer application
Basic computer application in basic computer skills
INFORMATION TECHNOLOGY UNIT 2 THE EMERGING TECHNOLOGY
UNIT TWO PART TWO THE EMERGING TECHNOLOGY
UNIT TWO PART TWO THE EMERGING TECHNOLOGY
Computer application in management for third year
Information storage and retrieval system and
Grade eleven Information Technology unit 5
Computer application in management for thrid year degree student
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
A Presentation on Artificial Intelligence
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Information storage and retrieval system unit two

  • 1. A.M.T COLLEGE DEPARTMENT OF INFORMATION TECHNOLOGY Information Storage and Retrieval
  • 2. CHAPTER TWO TEXT/DOCUMENT OPERATION AND AUTOMATIC INDEXING
  • 3. CHAPTER TWO TEXT/DOCUMENT OPERATION AND AUTOMATIC INDEXING The main contents of this chapter are the following. Index term selection(Zipf’s law and Luhn’s selection) Document pre-processing(lexical analysis, stop word elimination, Stemming) Term extraction(Term weighting and Similarity measures).
  • 4. 2.1 index term selection/ማውጫ ቃል ምርጫ •An index language is the language used to describe documents and requests. •The elements of index language are index terms.
  • 5. Cont… •Some words are not good for representing documents, use of all words have computational cost, increase searching time and storage requirements and using the set of all አንዳንድ ቃላቶች ሰነዶችን ለመወከል ጥሩ አይደሉም ፣ የሁሉም ቃላት አጠቃቀም ስሌት ዋጋ አላቸው ፣ የፍለጋ ጊዜን እና የማከማቻ መስፈርቶችን ይጨምሩ እና የሁሉንም ስብስብ ለመጠቀም።
  • 6. words in a collection to index document generates too much noise for the retrieval task, therefor, term selection is very important. በክምችት ውስጥ ያሉ ቃላት ወደ መረጃ ጠቋሚ ሰነድ በጣም ብዙ ድምጽ ያመነጫሉ, ስለዚህ የቃላት ምርጫ በጣም አስፈላጊ ነው.
  • 7. The main objectives of term selection are: •Represent textual documents by a set of keywords called index terms or simply terms. •Increase efficiency by extracting from the resulting document a selected set of terms to be used for indexing the document.
  • 8. •If full text representation is adopted then all words are used for indexing Index terms is called keyword or is a word(a single word) or phrase(multiword).
  • 9. indexing/መረጃ ጠቋሚ Is the art of organizing information Is an association of descriptors(keywords, concepts) to document s in view of Act of assigning index terms to a document.
  • 10. Is the process of storing data in a particular way in order to locate and retrieve the data. Is a way of identify important information and represent it in a useful way.
  • 11. why indexing? •Need some representation of content •Can not use the full document for search
  • 12. indexing used in: Find documents by topic Define topic areas, relate documents to each other Predict relevance between documents and information need To allow easy identification of documents
  • 13. There are two ways of indexing 1. Manual indexing Indexers decide which keywords to assign to documents based on controlled vocabulary(human indexers assign index terms to documents).
  • 14. The indexers analyse and represent the content of a document through keywords which is based on intellectual judgment and semantic interpretation of (concepts, themes) of indexers.
  • 15. The ff are important in manual indexing Terms that will be used by the user Indexing vocabulary Collection characteristics
  • 16. •Indexers are normally provided with guidelines(input sheets, manuals and printed thesaurus) to determine the contents of a given document and are usually done in the library environment.
  • 17. Advantage of manual indexing Ability to perform abstraction (conclude what the subject is) and determine additional related terms. Ability to judge the value of concepts
  • 18. Disadvantage of manual indexing Slow and expensive (significant cost) -cost of professional indexers is very expensive.
  • 19. High probability off inconsistency or low consistency among indexers(maintaining consistency is difficult). Labor intensive
  • 20. 2. Automatic indexing Automatic indexing is the assignment of content identifiers, with the help of modern computing technology.
  • 21. A computer system is used to record the descriptors generated by the human and the system extracts “typical”/”significant” terms. The original texts of information items are used as basis of indexing.
  • 22. An automatic indexing is necessary because of the ff reason: Information overload -enormous amount of information is being generated from day to day activity.
  • 23. Explosion of machine readable text -massive information available in electronic format and on internet. Cost effective -human indexing is expensive and labor intensive
  • 24. Procedures for automatic indexing Generating document representatives through automatic indexing involves oLexical analysis oUse of stop list oNoun identification(optional) oPhrase formation (optional)
  • 25. oUse of conflation procedures(stemming, optional) oSelection of index terms oWeighting the resulting terms(optional)
  • 26. Advantage of automatic indexing •Reduced processing time(Fast) •Reduced cost (inexpensive) •Easy to maintain •Improved consistency •Better retrieval(achieved)
  • 27. Disadvantage of automatic indexing •Mechanical execution of algorithm, with no intelligent interpretation(of aboutness/relevance)
  • 28. 2.1.1 Zipf’s law in IR and Luhn’s selection 2.1.1.1 Zipf’s law Zipf’s law states that given a corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.
  • 29. •The rank-frequency distribution is an inverse relation. •2 most frequent words (e.g “the”, “to”) can account for about 10% of words documents. •Eg. The word “the” is the most frequently occurring
  • 30. Zipf’s law example The table shows the most frequently occurring words from 336,310 document collection containing 125,720,891 total words; out of which 508,209 unique words.
  • 31. •Frequent word Number of occ •the----------------------7,398,934 •of------------------------3,893,790 •to------------------------3,364,653 •and----------------------3,320,687 •in------------------------2,311,785 •is------------------------1,559,147
  • 34. 2.1.1.2 Luhn’s analysis •Luhn Idea (1958): the frequency of word occurrence in a text provides a useful measurement of word significance.
  • 35. •He suggested that both extremely common and extremely uncommon words were not very useful for document representation and indexing.
  • 36. •Therefore, the most important words for indexing are those which occur with intermediate frequencies. •Thus, according to Luhn medium frequency term are better candidates for indexing.
  • 37. •He states proposed that the frequency of word of occurrence in an article furnishes a useful measurement of word significance.
  • 38. 2.2 Document Pre- processing Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words used as index terms.
  • 39. Text operation is the process of text transformations into a logical representation. 5 main operations/transformations selecting index terms.
  • 40. A. Lexical analysis of the text generate a set of words from text collection With the objective of treating digits, hyphens, punctuations marks, and the cases of letter.
  • 41. Digits (1999), Case (Republican vs. republican) •HYPHEN •Eg. MS-DOS, B-49, •PUNCTUATION •WWW.WSU.EDU.ET
  • 42. B. Elimination of stop-words. Filter out words which are not useful in the retrieval process.
  • 43. C. Stemming: of the remaining words with the objective of removing affixes(i.e suffixes and prefixes) and allowing the retrieval documents containing syntactic variation of query terms(e.g connect,connected,connecting etc..
  • 44. D. Selection of index terms: To determine which words/ stems are or groups of words will be used as an indexing elements.
  • 45. E. Construction of term categorization •Structures such as thesaurus, to capture relationship for allowing the expansion of the original query with related terms.
  • 46. Text processing system Tokenization is one of the step used to convert text of the documents into a sequence of words.
  • 47. Elimination of stop words Stop words are extremely common words across document collections that have no discriminatory power. Eg. Articles, Pronouns, Prepositions, Conjunction/ connectors
  • 48. Normalization It is in a way standardization of text. E.g U.S.A vs USA
  • 49. Case folding Often best to lower case everything Eg. Fasil vs. fasil vs. FASIL Stemming The process involves removal of affixes. Eg.Boy-boys, cut-cutting, creation-create