Presentation

Evaluation in Information
Retrieval

(Book chapter from C.D. Manning, P. Raghavan, and H. Schutze.
Introduction to information retrieval)

Dishant Ailawadi
INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11

Outline

● Why Evaluation?
● Standard test collections.

● Precision and Recall

● Mean Average Precision

● Kappa Statistic

● RPrecision

● Summary

Why Evaluation?

●
There are many retrieval models/ algorithms/ systems,
which one is the best?
●
Measure effect of adding new features.
●
How far down the ranked list will a user need to look to find
some/all relevant documents?
●
Difficulties : Relevance, it is not binary but continuous. How
to say if a document is relevant?

Standard Test Collections
A standard test collection consists of three things:
1. A document collection.
2. A set of queries on this collection
3. A set of relevance judgments on those queries.

If a document in test collection is given a binary classification.
This decision is referred to as the gold standard or ground
truth judgment of relevance.

Standard Test Collections

● Cranfield: 1950s in UK. Too small to be used nowadays.
TREC (text retrieval conference)
●

● Early TREC had 50 Information needs, TREC 68 provide 150
information needs over more than 500 thousand articles.
● Recent work on 25 million pages of GOV2 is now available for
research.
NTCIR EastAsian Language and Cross Language IR Systems
●

Cross Language Evaluation Forum (CLEF)
●

Reuters21578 collection most used for text classification.
●

Evaluation Measures
Retrieved True positives (tp) False positives (fp)

Not Retrieved False negatives (fn) True negatives (tn)
Relevant Non Relevant

Number of relevant documents retrieved = tp/(tp + fn)
recall =
Total number of relevant documents

Number of relevant documents retrieved
precision = = tp/(tp + fp)
Total number of documents retrieved

(How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)

An Example
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
Missing one
11 103 relevant document.
12 591 Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
7

Combining Precision & Recall
FMeasure: Weighted HM of precision and recall.

Value of β controls tradeoff:
●β = 1: Equally weight precision and recall.

●β > 1: Weight recall more.

●
β < 1: Weight precision more.
2 PR 2
F= = 1 1
P + R R+P

Precision-Recall curve

Interpolated Precision: To get smooth curve.

11-point Interpolated Average Precision

Recall   Interp.
          Precision
   0.0      1.00
   0.1      0.67
   0.2      0.63
   0.3      0.55
   0.4      0.45
   0.5      0.41
   0.6      0.36
   0.7      0.29
   0.8      0.13
   0.9      0.10
   1.0      0.08

Single Figure Measures

Mean Average Precision (MAP): Average Precision over all
queries.
Example: Average Precision: (1 + 1 + 0.75 + 0.667 + 0.38 +
0)/6 = 0.633

Normalized Distributed Cumulative Gain (NDCG): For non
binary notions.

Assesing Relevance
Pooling: To obtain a subset of collection related to query
●

– Use a set of search engines/algorithms
– The topk results (k is between 20 to 50 in TREC) are
  merged into a pool, duplicates are removed
– Present the documents in a random order to analysts for
  relevance judgments

Kappa Statistic:
●

  If we have multiple judges on one information need, how consistent are
those judges?
  kappa = (P(A) – P(E)) / (1 – P(E))
   – P(A) is the proportion of the times that the judges
     agreed
   – P(E) is the proportion of the times they would be

    expected to agree by chance

Example: Kappa Statistic
                           Judge 2 Relevance
                            Yes      No  Total
Judge 1      Yes     300     20    320
Relevance   No      10      70     80
                 Total   310     90    400
Observed proportion of the times the judges agreed :

Pooled marginals:

Probability that two judges agreed by chance (Max Value=1, Min =0.5):

Kappa statistic:

Kappa Value between 0.67 and 0.8 is fair agreement but below 0.67 is

seen as data providing a dubious basis for evaluation.

Evaluation
n doc # relevant
RPRECISION : 1 588 x
R = # of relevant docs = 7 2 589 x
3 576
RPrecision = 4/7 = 0.571 4 590 x
5 986
6 592 x
7 984
8 988
A/B Test : Precisely one change between 9 578
10 985
current and previous system. We evaluate the 11 103
Affect of that change on system. 12 591
13 772 x
14 990

Summary
● FMeasure: To combine Precision and recall.
● Recallprecision graph – conveying more information than

a single number measure.
● Mean average precision – single number value, popular

measure.
● Normalized Discounted Cumulative Gain (NDCG) – single

number summary for each rank level emphasizing top ranked
documents, relevance judgments only needed to a specific rank
depth (e.g., 10)
● Kappa Measure: Judgement reliability

● RPrecision: Only need to examine top rel documents.

Presentation

More Related Content

Similar to Presentation (20)

Recently uploaded (20)

Presentation