Information retrieval 20 divergence from randomness

Information Retrieval : 20
Divergence from Randomness
Prof Neeraj Bhargava
Vaibhav Khanna
Department of Computer Science
School of Engineering and Systems Sciences
Maharshi Dayanand Saraswati University Ajmer

• A distinct probabilistic model has been
proposed by Amati and Rijsbergen
• The idea is to compute term weights by
measuring the divergence between a term
distribution produced by a random process
and the actual term distribution
• Thus, the name divergence from randomness
• The model is based on two fundamental
assumptions, as follows.

First assumption:
• Not all words are equally important for describing
the content of the documents
• Words that carry little information are assumed to
be randomly distributed over the whole
document collection C
• Given a term ki, its probability distribution over
the whole collection is referred to as P(ki|C)
• The amount of information associated with this
distribution is given by −log P(ki|C)
• By modifying this probability function, we can
implement distinct notions of term randomness

Second assumption
• A complementary term distribution can be obtained by
considering just the subset of documents that contain
term ki
• This subset is referred to as the elite set
• The corresponding probability distribution, computed
with regard to document dj , is referred to as P(ki|dj)
• Smaller the probability of observing a term ki in a
document dj , more rare and important is the term
considered to be
• Thus, the amount of information associated with the
term in the elite set is defined as 1 − P(ki|dj)

Random Distribution
• To compute the distribution of terms in the collection,
distinct probability models can be considered
• For instance, consider that Bernoulli trials are used to
model the occurrences of a term in the collection
• To illustrate, consider a collection with 1,000 documents
and a term ki that occurs 10 times in the collection
• Then, the probability of observing 4 occurrences of term
ki in a document is given by

Random Distribution
• Under these conditions, we can aproximate
the binomial distribution by a Poisson process,
which yields

Distribution over the Elite Set

Assignment
• Explain the Information Retrieval Model of

Information retrieval 20 divergence from randomness

More Related Content

What's hot (17)

Similar to Information retrieval 20 divergence from randomness (20)

More from Vaibhav Khanna (20)

Recently uploaded (20)

Information retrieval 20 divergence from randomness