SlideShare a Scribd company logo
Text Operations
CHAPTER TWO
Statistical Properties of Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a
corpus?
• There are three well-known researcher who define
statistical properties of words in a text:
– Zipf’s Law: models word distribution in text corpus
– Luhn’s idea: measures word significance
– Heap’s Law: shows how vocabulary size grows with the growth
corpus size
• Such properties of text collection greatly affect the
performance of IR system & can be used to select
suitable term weights & other aspects of the system.
Word Distribution
• A few words are very
common.
2 most frequent words
(e.g. “the”, “of”) can
account for about 10%
of word occurrences.
• Most words are very rare.
Half the words in a
corpus appear only
once, called “read only
once”
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
• For all the words in a collection of documents, for each word w
f : is the frequency that w appears
r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f
r
w has rank r &
frequency f
Distribution of sorted word
frequencies, according to
Zipf’s law
Zipf’s
distributions:
Rank
Frequency
Distribution
Word distribution: Zipf's Law
• If the words, w, in a
collection are ranked, r,
by their frequency, f,
they roughly fit the
relation:
r * f = c
– Different collections
have different constants
c.
• Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.
• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Methods that Build on Zipf's Law
• Stop lists:
• Ignore the most frequent words (upper cut-off).
Used by almost all systems.
• Significant words:
• Take words in between the most frequent (upper
cut-off) and least frequent words (lower cut-off).
• Term weighting:
• Give differing weights to terms based on their
frequency, with most frequent words weighed
less. Used by almost all ranking methods.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a
text furnishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
• For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
–The words exceeding the upper cutoff were considered to be
common
–The words below the lower cutoff were considered to be rare
–Hence they are not contributing significantly to the content of the
text
–The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve
Luhn’s Ideas
Luhn (1958) suggested that both extremely common and
extremely uncommon words were not very useful for document
representation & indexing.
Heap’s distributions
• Distribution of size of the vocabulary vs. total number
of terms extracted from text corpus
Example: from 1,000,000,000 documents, there
may be 1,000,000 distinct words. Can you agree?
Example: Heaps Law
• We want to estimate the size of the vocabulary for a
corpus of 1,000,000 words.
• Assume that based on statistical analysis on smaller
corpora sizes:
– A corpus with 100,000 words contain 50,000 unique words;
and
– A corpus with 500,000 words contain 150,000 unique
words
• Estimate the vocabulary size for the 1,000,000 words
corpus?
– What about for a corpus of 1,000,000,000 words?
Text Operations
• Not all words in a document are equally significant to
represent the contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, need to preprocess the text of a document in a
collection to be used as index terms
• Using the set of all words in a collection to index documents
creates too much noise for the retrieval task
– Reduce noise means reduce words which can be used to refer
to the document
• Text operation is the task of preprocessing text documents to
control the size of the vocabulary or the number of distinct
words used as index terms
– Preprocessing will lead to an improvement in the information
retrieval performance
• However, some search engines on the Web omit preprocessing
– Every word in the document is an index term
Text Operations
• Text operations is the process of text transformations in
to logical representations
• 5 main operations for selecting index terms, i.e. to
choose words/stems (or groups of words) to be used as
indexing terms:
– Lexical analysis/Tokenization of the text: generate a set or
words from text collection
– Elimination of stop words: filter out words which are not useful
in the retrieval process
– Stemming words: remove affixes (prefixes and suffixes) and
group together word variants with similar meaning
– Construction of term categorization structures such as
thesaurus, to capture relationship among words for allowing
the expansion of the original query with related terms
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in an automatic
retrieval system
• The document representative consists of a list of class names, each
name representing a class of words occurring in the total input text. A
document will be indexed by a name if one of its significant words occurs as a
member of that class.
Tokenization stemming Thesaurus
Index
terms
stop words
Document
Corps
Free
Text
Lexical Analysis/Tokenization of Text
• Tokenization is one of the step used to convert text of the
documents into a sequence of words, w1, w2, … wn to be
adopted as index terms.
– It is the process of demarcating and possibly classifying
sections of a string of input characters into words.
– For example,
• The quick brown fox jumps over the lazy dog
• Objective of tokenization is identifying words in the text
– What is a word means?
• Is that a sequence of characters, numbers and alpha-
numeric once?
– How we identify a set of words that exist in a text documents?
• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …
Issues in Tokenization
• Two words may be connected by hyphens.
–Can two words connected by hyphens and punctuation marks
taken as one word or two words? Break up hyphenated
sequence as two tokens?
• In most cases hyphen – break up the words (e.g. state-of-the-art
 state of the art), but some words, e.g. MS-DOS, B-49 - unique
words which require hyphens
• Two words may be connected by punctuation marks .
–remove totally punctuation marks unless significant, e.g.
program code: x.exe and xexe
• Two words may be separated by space.
–E.g. Addis Ababa, San Francisco, Los Angeles
• Two words may be written in different ways
–lowercase, lower-case, lower case ?
–data base, database, data-base?
Issues in Tokenization
• Numbers: Are numbers/digits words & used as index
terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415005)
– IP addresses (100.2.86.144)
– Generally, don’t index numbers as text most numbers are not
good index terms (like 1910, 1999)
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or
lower. Which one is mostly followed by human beings?
• Simplest approach is to ignore all numbers & punctuations
and use only case-insensitive unbroken strings of
alphabetic characters as tokens.
• Issues of tokenization are language specific
–Requires the language to be known
Tokenization
• Analyze text into a sequence of discrete tokens (words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that
are grouped together as a useful semantic unit for processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry,
after further processing
• But what are valid tokens to emit as index terms?
Exercise: Tokenization
• The cat slept peacefully in the living room. It’s
a very old cat.
• The instructor (Dr. O’Neill) thinks that the
boys’ stories about Chile’s capital aren’t
amusing.
Elimination of Stopwords
• Stopwords are extremely common words across document
collections that have no discriminatory power
– They may occur in 80% of the documents in a collection.
• Stopwords have little semantic content; It is typical to
remove such high-frequency words
– They would appear to be of little value in helping select
documents matching a user need and needs to be filtered out
as potential index terms
• Examples of stopwords are articles, prepositions,
conjunctions, etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
conjunctions/ connectors (and, but, for, nor, or, so, yet), verbs
(is, are, was, were), adverbs (here, there, out, because, soon,
after) and adjectives (all, any, each, every, few, many, some)
can also be treated as stopwords
Stopwords
•Intuition:
– Stopwords take up 50% of the text. Hence, document size
reduces drastically enabling to organizes smaller indices
for information retrieval
– Good compression techniques for indices: The 30
most common words account for 30% of the tokens
in written text
• Better approximation of importance for classification,
summarization, etc.
• Stopwords are language dependent.
How to detect a stopword?
• One method: Sort terms (in decreasing order) by
document frequency and take the most frequent ones
– In a collection about insurance practices, “insurance”
would be a stop word
• Another method: Build a stop word list that contains a
set of articles, pronouns, etc.
– Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the
commonest words.
• With the removal of stopwords, we can measure better
approximation of importance for classification,
summarization, etc.
Stop words
• Stop word elimination used to be standard in older IR
systems.
• But the trend is away from doing this. Most web search
engines index stop words:
–Good query optimization techniques mean you pay little at query
time for including stop words.
–You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
–Elimination of stopwords might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Normalization
• It is canonicalizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting
periods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
– Car vs. automobile?
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
– Helps a search engine when most users type ferrari
when they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, …
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way
users issue their queries
Stemming/Morphological analysis
•Stemming reduces tokens to their “root” form of words
to recognize morphological variation .
–The process involves removal of affixes (i.e. prefixes and suffixes)
with the aim of reducing variants to the same stem
•Often stemming removes inflectional and derivational
morphology of a word
–Inflectional morphology: vary the form of words in order to
express grammatical features, such as singular/plural or
past/present tense. E.g. Boy → boys, cut → cutting.
–Derivational morphology: makes new words from old ones. E.g.
creation is formed from create , but they are two separate words.
And also, destruction → destroy
•Stemming is language dependent
–Correct stemming is language specific and can be complex.
for example compressed and
compression are both accepted.
for example compress and
compress are both accept
Stemming
• The final output from a conflation algorithm is a set of classes,
one for each stem detected.
–A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
–Thus, [automate, automatic, automation] all reduce to 
automat
• A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the
document.
–A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords.
• Queries : Queries are handled in the same way.
Ways to implement stemming
There are basically two ways to implement stemming.
–The first approach is to create a big dictionary that maps words
to their stems.
• The advantage of this approach is that it works perfectly (insofar as
the stem of a word can be defined perfectly); the disadvantages are
the space required by the dictionary and the investment required
to maintain the dictionary as new words appear.
–The second approach is to use a set of rules that extract stems
from words.
• The advantages of this approach are that the code is typically small,
and it can gracefully handle new words; the disadvantage is that it
occasionally makes mistakes.
• But, since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, and the rule-based approach is the one that
is generally chosen.
Porter Stemmer
• Stemming is the operation of stripping the suffices from a word,
leaving its stem.
– Google, for instance, uses stemming to search for web pages containing
the words connected, connecting, connection and connections when
users ask for a web page that contains the word connect.
• In 1979, Martin Porter developed a stemming algorithm that
uses a set of rules to extract stems from words, and though it
makes some mistakes, most common words seem to work out
right.
– Porter describes his algorithm and provides a reference implementation
in C at http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S   cats  cat
– EMENT   (Delete final ement if what remains is longer than
1 character )
replacement  replac
cement  cement
Porter stemmer
• While step 1a gets rid of plurals, step 1b removes -ed or -
ing.
– e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> mee
;; feed -> feedt
Stemming: challenges
• May produce unusual stems that are not English
words:
– Removing ‘UAL’ from FACTUAL and EQUAL
• May conflate (reduce to the same token) words
that are actually distinct.
• “computer”, “computational”, “computation” all
reduced to same token “comput”
• Not recognize all morphological derivations.
Thesauri
• Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same concept
– Problem: The same meaning can be expressed using different
terms that are synonyms, homonyms and related terms
– How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
• Thesaurus: The vocabulary of a controlled indexing language,
formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.
• A thesaurus contains terms and relationships between terms
– IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT, and RT to demonstrate inter-term
relationships.
– e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour, paint
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by showing a set
of related words to handle synonyms and homonyms
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and searching
• Thesaurus rewrite to form equivalence classes, and we index such
equivalences
• When the document contains automobile, index it under car as well
(usually, also vice-versa)
– to assist users with locating terms for proper query formulation: When
the query contains automobile, look under car as well for expanding
query
– to provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs
Thesaurus Construction
Example: thesaurus built to assist IR for searching
cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport
More Example
Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
– UF natural language processing (UF=used for NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)
– RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
Language-specificity
• Many of the above features embody
transformations that are
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins are
available for handling these
Index Term Selection
• Index language is the language used to describe
documents and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be
described, or may be arrived at independently.
–If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text
indexing
–Otherwise, need to select content-bearing words to be
used as index terms for reducing the size of the index file
which is basic to design an efficient searching IR system

More Related Content

PPT
2_text operationinformation retrieval. ppt
PDF
Chapter 2: Text Operation in information stroage and retrieval
PDF
Chapter 2 Text Operation.pdf
PPT
IR CHAPTER_TWO Most important for students
PPTX
NLP_KASHK:Text Normalization
PPT
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
PDF
learn about text preprocessing nip using nltk
PPTX
2_text operationinformation retrieval. ppt
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2 Text Operation.pdf
IR CHAPTER_TWO Most important for students
NLP_KASHK:Text Normalization
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
learn about text preprocessing nip using nltk

Similar to CHapter 2_text operation.ppt material for university students (20)

PDF
NLP Lecture on the preprocessing approaches
PPTX
NLP Introduction and basics of natural language processing
PPTX
PDF
Computational linguistics
PPTX
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
PPTX
NLP pipeline in machine translation
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPTX
Artificial Intelligence Notes Unit 4
PPTX
Introduction to natural language processing (NLP)
PDF
Chapter 2 Text Operation and Term Weighting.pdf
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PPTX
4 Natural Language Processing-Text Normalization.pptx
PPTX
4 Natural Language Processing-Text Normalization.pptx
PDF
Natural language processing (nlp)
PPTX
Natural Language Processing
PDF
Natural Language Processing Course in AI
PDF
Automated Abstracts and Big Data
PDF
Adnan: Introduction to Natural Language Processing
PPTX
Natural Language Processing (NLP).pptx
NLP Lecture on the preprocessing approaches
NLP Introduction and basics of natural language processing
Computational linguistics
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP pipeline in machine translation
Natural Language Processing, Techniques, Current Trends and Applications in I...
Artificial Intelligence Notes Unit 4
Introduction to natural language processing (NLP)
Chapter 2 Text Operation and Term Weighting.pdf
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
4 Natural Language Processing-Text Normalization.pptx
4 Natural Language Processing-Text Normalization.pptx
Natural language processing (nlp)
Natural Language Processing
Natural Language Processing Course in AI
Automated Abstracts and Big Data
Adnan: Introduction to Natural Language Processing
Natural Language Processing (NLP).pptx
Ad

More from jamsibro140 (17)

PPT
RM chapter-4 (3).ppt measurements and descriptive
PPTX
RM chapter-5 (5).pptx survey and experimental
PPT
System development chapter six power point
PPT
Artificial intelligence chapter three power point
PPTX
Accounting for merchandise chapter four ppt
PPT
Enterprise_Systems_for_Management.power point
PPTX
Funda mental of information CHAPTER TWO.pptx
PPTX
Computer organization and architecture Chapter 1-1.pptx
PPTX
Algorithm 4Chapter Four- Deadlock (5).pptx
PPTX
Operating system 1Chapter One- Introduction(0) (1).pptx
PPTX
2Chapter Two- Process Management(2) (1).pptx
PPT
Information system society Chapter one.ppt
PPTX
Machine learning Chapter three (16).pptx
PPTX
Information systems security chapter (5).pptx
PPTX
Data communication and computer network Chapter 2.pptx
PPT
Knowledge Management system_Slides_Ch 1.ppt
PPTX
Computer organization and architecture Chapter 1 (3).PPTX
RM chapter-4 (3).ppt measurements and descriptive
RM chapter-5 (5).pptx survey and experimental
System development chapter six power point
Artificial intelligence chapter three power point
Accounting for merchandise chapter four ppt
Enterprise_Systems_for_Management.power point
Funda mental of information CHAPTER TWO.pptx
Computer organization and architecture Chapter 1-1.pptx
Algorithm 4Chapter Four- Deadlock (5).pptx
Operating system 1Chapter One- Introduction(0) (1).pptx
2Chapter Two- Process Management(2) (1).pptx
Information system society Chapter one.ppt
Machine learning Chapter three (16).pptx
Information systems security chapter (5).pptx
Data communication and computer network Chapter 2.pptx
Knowledge Management system_Slides_Ch 1.ppt
Computer organization and architecture Chapter 1 (3).PPTX
Ad

Recently uploaded (20)

PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Cell Structure & Organelles in detailed.
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Classroom Observation Tools for Teachers
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
master seminar digital applications in india
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Business Ethics Teaching Materials for college
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Microbial diseases, their pathogenesis and prophylaxis
human mycosis Human fungal infections are called human mycosis..pptx
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Cell Structure & Organelles in detailed.
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Classroom Observation Tools for Teachers
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
2.FourierTransform-ShortQuestionswithAnswers.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
TR - Agricultural Crops Production NC III.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
master seminar digital applications in india
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Business Ethics Teaching Materials for college
O7-L3 Supply Chain Operations - ICLT Program
Microbial diseases, their pathogenesis and prophylaxis

CHapter 2_text operation.ppt material for university students

  • 2. Statistical Properties of Text • How is the frequency of different words distributed? • How fast does vocabulary size grow with the size of a corpus? • There are three well-known researcher who define statistical properties of words in a text: – Zipf’s Law: models word distribution in text corpus – Luhn’s idea: measures word significance – Heap’s Law: shows how vocabulary size grows with the growth corpus size • Such properties of text collection greatly affect the performance of IR system & can be used to select suitable term weights & other aspects of the system.
  • 3. Word Distribution • A few words are very common. 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences. • Most words are very rare. Half the words in a corpus appear only once, called “read only once”
  • 4. Word distribution: Zipf's Law • Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf (1902-1950), attempts to capture the distribution of the frequencies (i.e. , number of occurances ) of the words within a text. • For all the words in a collection of documents, for each word w f : is the frequency that w appears r : is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f r w has rank r & frequency f Distribution of sorted word frequencies, according to Zipf’s law Zipf’s distributions: Rank Frequency Distribution
  • 5. Word distribution: Zipf's Law • If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c – Different collections have different constants c. • Zipf's Law states that when the distinct words in a text are arranged in decreasing order of their frequency of occuerence (most frequent words first), the occurence characterstics of the vocabulary can be characterized by the constant rank-frequency law of Zipf. • The table shows the most frequently occurring words from 336,310 document corpus containing 125, 720, 891 total words; out of which 508, 209 are unique words
  • 6. More Example: Zipf’s Law • Illustration of Rank-Frequency Law. Let the total number of word occurrences in the sample N = 1, 000, 000 Rank (R) Term Frequency (F) R.(F/N) 1 the 69 971 0.070 2 of 36 411 0.073 3 and 28 852 0.086 4 to 26 149 0.104 5 a 23237 0.116 6 in 21341 0.128 7 that 10595 0.074 8 is 10099 0.081 9 was 9816 0.088 10 he 9543 0.095
  • 7. Methods that Build on Zipf's Law • Stop lists: • Ignore the most frequent words (upper cut-off). Used by almost all systems. • Significant words: • Take words in between the most frequent (upper cut-off) and least frequent words (lower cut-off). • Term weighting: • Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.
  • 8. Word significance: Luhn’s Ideas • Luhn Idea (1958): the frequency of word occurrence in a text furnishes a useful measurement of word significance. • Luhn suggested that both extremely common and extremely uncommon words were not very useful for indexing. • For this, Luhn specifies two cutoff points: an upper and a lower cutoffs based on which non-significant words are excluded –The words exceeding the upper cutoff were considered to be common –The words below the lower cutoff were considered to be rare –Hence they are not contributing significantly to the content of the text –The ability of words to discriminate content, reached a peak at a rank order position half way between the two-cutoffs • Let f be the frequency of occurrence of words in a text, and r their rank in decreasing order of word frequency, then a plot relating f & r yields the following curve
  • 9. Luhn’s Ideas Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for document representation & indexing.
  • 10. Heap’s distributions • Distribution of size of the vocabulary vs. total number of terms extracted from text corpus Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words. Can you agree?
  • 11. Example: Heaps Law • We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. • Assume that based on statistical analysis on smaller corpora sizes: – A corpus with 100,000 words contain 50,000 unique words; and – A corpus with 500,000 words contain 150,000 unique words • Estimate the vocabulary size for the 1,000,000 words corpus? – What about for a corpus of 1,000,000,000 words?
  • 12. Text Operations • Not all words in a document are equally significant to represent the contents/meanings of a document – Some word carry more meaning than others – Noun words are the most representative of a document content • Therefore, need to preprocess the text of a document in a collection to be used as index terms • Using the set of all words in a collection to index documents creates too much noise for the retrieval task – Reduce noise means reduce words which can be used to refer to the document • Text operation is the task of preprocessing text documents to control the size of the vocabulary or the number of distinct words used as index terms – Preprocessing will lead to an improvement in the information retrieval performance • However, some search engines on the Web omit preprocessing – Every word in the document is an index term
  • 13. Text Operations • Text operations is the process of text transformations in to logical representations • 5 main operations for selecting index terms, i.e. to choose words/stems (or groups of words) to be used as indexing terms: – Lexical analysis/Tokenization of the text: generate a set or words from text collection – Elimination of stop words: filter out words which are not useful in the retrieval process – Stemming words: remove affixes (prefixes and suffixes) and group together word variants with similar meaning – Construction of term categorization structures such as thesaurus, to capture relationship among words for allowing the expansion of the original query with related terms
  • 14. Generating Document Representatives • Text Processing System – Input text – full text, abstract or title – Output – a document representative adequate for use in an automatic retrieval system • The document representative consists of a list of class names, each name representing a class of words occurring in the total input text. A document will be indexed by a name if one of its significant words occurs as a member of that class. Tokenization stemming Thesaurus Index terms stop words Document Corps Free Text
  • 15. Lexical Analysis/Tokenization of Text • Tokenization is one of the step used to convert text of the documents into a sequence of words, w1, w2, … wn to be adopted as index terms. – It is the process of demarcating and possibly classifying sections of a string of input characters into words. – For example, • The quick brown fox jumps over the lazy dog • Objective of tokenization is identifying words in the text – What is a word means? • Is that a sequence of characters, numbers and alpha- numeric once? – How we identify a set of words that exist in a text documents? • Tokenization Issues – numbers, hyphens, punctuations marks, apostrophes …
  • 16. Issues in Tokenization • Two words may be connected by hyphens. –Can two words connected by hyphens and punctuation marks taken as one word or two words? Break up hyphenated sequence as two tokens? • In most cases hyphen – break up the words (e.g. state-of-the-art  state of the art), but some words, e.g. MS-DOS, B-49 - unique words which require hyphens • Two words may be connected by punctuation marks . –remove totally punctuation marks unless significant, e.g. program code: x.exe and xexe • Two words may be separated by space. –E.g. Addis Ababa, San Francisco, Los Angeles • Two words may be written in different ways –lowercase, lower-case, lower case ? –data base, database, data-base?
  • 17. Issues in Tokenization • Numbers: Are numbers/digits words & used as index terms? – dates (3/12/91 vs. Mar. 12, 1991); – phone numbers (+251923415005) – IP addresses (100.2.86.144) – Generally, don’t index numbers as text most numbers are not good index terms (like 1910, 1999) • What about case of letters (e.g. Data or data or DATA): – cases are not important and there is a need to convert all to upper or lower. Which one is mostly followed by human beings? • Simplest approach is to ignore all numbers & punctuations and use only case-insensitive unbroken strings of alphabetic characters as tokens. • Issues of tokenization are language specific –Requires the language to be known
  • 18. Tokenization • Analyze text into a sequence of discrete tokens (words). • Input: “Friends, Romans and Countrymen” • Output: Tokens (an instance of a sequence of characters that are grouped together as a useful semantic unit for processing) – Friends – Romans – and – Countrymen • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to emit as index terms?
  • 19. Exercise: Tokenization • The cat slept peacefully in the living room. It’s a very old cat. • The instructor (Dr. O’Neill) thinks that the boys’ stories about Chile’s capital aren’t amusing.
  • 20. Elimination of Stopwords • Stopwords are extremely common words across document collections that have no discriminatory power – They may occur in 80% of the documents in a collection. • Stopwords have little semantic content; It is typical to remove such high-frequency words – They would appear to be of little value in helping select documents matching a user need and needs to be filtered out as potential index terms • Examples of stopwords are articles, prepositions, conjunctions, etc.: – articles (a, an, the); pronouns: (I, he, she, it, their, his) – Some prepositions (on, of, in, about, besides, against), conjunctions/ connectors (and, but, for, nor, or, so, yet), verbs (is, are, was, were), adverbs (here, there, out, because, soon, after) and adjectives (all, any, each, every, few, many, some) can also be treated as stopwords
  • 21. Stopwords •Intuition: – Stopwords take up 50% of the text. Hence, document size reduces drastically enabling to organizes smaller indices for information retrieval – Good compression techniques for indices: The 30 most common words account for 30% of the tokens in written text • Better approximation of importance for classification, summarization, etc. • Stopwords are language dependent.
  • 22. How to detect a stopword? • One method: Sort terms (in decreasing order) by document frequency and take the most frequent ones – In a collection about insurance practices, “insurance” would be a stop word • Another method: Build a stop word list that contains a set of articles, pronouns, etc. – Why do we need stop lists: With a stop list, we can compare and exclude from index terms entirely the commonest words. • With the removal of stopwords, we can measure better approximation of importance for classification, summarization, etc.
  • 23. Stop words • Stop word elimination used to be standard in older IR systems. • But the trend is away from doing this. Most web search engines index stop words: –Good query optimization techniques mean you pay little at query time for including stop words. –You need stopwords for: • Phrase queries: “King of Denmark” • Various song titles, etc.: “Let it be”, “To be or not to be” • “Relational” queries: “flights to London” –Elimination of stopwords might reduce recall (e.g. “To be or not to be” – all eliminated except “be” – no or irrelevant retrieval)
  • 24. Normalization • It is canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens – Need to “normalize” terms in indexed text as well as query terms into the same form – Example: We want to match U.S.A. and USA, by deleting periods in a term • Case Folding: Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… – Republican vs. republican – Fasil vs. fasil vs. FASIL – Anti-discriminatory vs. antidiscriminatory – Car vs. automobile?
  • 25. Normalization issues • Good for – Allow instances of Automobile at the beginning of a sentence to match with a query of automobile – Helps a search engine when most users type ferrari when they are interested in a Ferrari car • Bad for – Proper names vs. common nouns • E.g. General Motors, Associated Press, … • Solution: – lowercase only words at the beginning of the sentence • In IR, lowercasing is most practical because of the way users issue their queries
  • 26. Stemming/Morphological analysis •Stemming reduces tokens to their “root” form of words to recognize morphological variation . –The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of reducing variants to the same stem •Often stemming removes inflectional and derivational morphology of a word –Inflectional morphology: vary the form of words in order to express grammatical features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting. –Derivational morphology: makes new words from old ones. E.g. creation is formed from create , but they are two separate words. And also, destruction → destroy •Stemming is language dependent –Correct stemming is language specific and can be complex. for example compressed and compression are both accepted. for example compress and compress are both accept
  • 27. Stemming • The final output from a conflation algorithm is a set of classes, one for each stem detected. –A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or suffixes). –Example: ‘connect’ is the stem for {connected, connecting connection, connections} –Thus, [automate, automatic, automation] all reduce to  automat • A class name is assigned to a document if and only if one of its members occurs as a significant word in the text of the document. –A document representative then becomes a list of class names, which are often referred as the documents index terms/keywords. • Queries : Queries are handled in the same way.
  • 28. Ways to implement stemming There are basically two ways to implement stemming. –The first approach is to create a big dictionary that maps words to their stems. • The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. –The second approach is to use a set of rules that extract stems from words. • The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. • But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.
  • 29. Porter Stemmer • Stemming is the operation of stripping the suffices from a word, leaving its stem. – Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when users ask for a web page that contains the word connect. • In 1979, Martin Porter developed a stemming algorithm that uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. – Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html
  • 30. Porter stemmer • Most common algorithm for stemming English words to their common grammatical root • It is simple procedure for removing known affixes in English without using a dictionary. To gets rid of plurals the following rules are used: – SSES  SS caresses  caress – IES  i ponies  poni – SS  SS caress → caress – S   cats  cat – EMENT   (Delete final ement if what remains is longer than 1 character ) replacement  replac cement  cement
  • 31. Porter stemmer • While step 1a gets rid of plurals, step 1b removes -ed or - ing. – e.g. ;; agreed -> agree ;; disabled -> disable ;; matting -> mat ;; mating -> mate ;; meeting -> meet ;; milling -> mill ;; messing -> mess ;; meetings -> mee ;; feed -> feedt
  • 32. Stemming: challenges • May produce unusual stems that are not English words: – Removing ‘UAL’ from FACTUAL and EQUAL • May conflate (reduce to the same token) words that are actually distinct. • “computer”, “computational”, “computation” all reduced to same token “comput” • Not recognize all morphological derivations.
  • 33. Thesauri • Mostly full-text searching cannot be accurate, since different authors may select different words to represent the same concept – Problem: The same meaning can be expressed using different terms that are synonyms, homonyms and related terms – How can it be achieved such that for the same meaning the identical terms are used in the index and the query? • Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori relationships between concepts (for example as "broader" and “related") are made explicit. • A thesaurus contains terms and relationships between terms – IR thesauri rely typically upon the use of symbols such as USE/UF (UF=used for), BT, and RT to demonstrate inter-term relationships. – e.g., car = automobile, truck, bus, taxi, motor vehicle -color = colour, paint
  • 34. Aim of Thesaurus • Thesaurus tries to control the use of the vocabulary by showing a set of related words to handle synonyms and homonyms • The aim of thesaurus is therefore: – to provide a standard vocabulary for indexing and searching • Thesaurus rewrite to form equivalence classes, and we index such equivalences • When the document contains automobile, index it under car as well (usually, also vice-versa) – to assist users with locating terms for proper query formulation: When the query contains automobile, look under car as well for expanding query – to provide classified hierarchies that allow the broadening and narrowing of the current request according to user needs
  • 35. Thesaurus Construction Example: thesaurus built to assist IR for searching cars and vehicles : Term: Motor vehicles UF : Automobiles Cars Trucks BT: Vehicles RT: Road Engineering Road Transport
  • 36. More Example Example: thesaurus built to assist IR in the fields of computer science: TERM: natural languages – UF natural language processing (UF=used for NLP) – BT languages (BT=broader term is languages) – TT languages (TT = top term is languages) – RT artificial intelligence (RT=related term/s) computational linguistic formal languages query languages speech recognition
  • 37. Language-specificity • Many of the above features embody transformations that are – Language-specific and – Often, application-specific • These are “plug-in” addenda to the indexing process • Both open source and commercial plug-ins are available for handling these
  • 38. Index Term Selection • Index language is the language used to describe documents and requests • Elements of the index language are index terms which may be derived from the text of the document to be described, or may be arrived at independently. –If a full text representation of the text is adopted, then all words in the text are used as index terms = full text indexing –Otherwise, need to select content-bearing words to be used as index terms for reducing the size of the index file which is basic to design an efficient searching IR system