Evaluation in Information
               Retrieval


      (Book chapter from C.D. Manning, P. Raghavan, and H. Schutze. 
                Introduction to information retrieval)



                            Dishant Ailawadi
    INF384H / CS395T: Concepts of Information Retrieval (and Web Search) Fall11




                                         
Outline

● Why Evaluation?
● Standard test collections.

● Precision and Recall

● Mean Average Precision

● Kappa Statistic

● R­Precision

● Summary




                           
Why Evaluation?


●
  There are many retrieval models/ algorithms/ systems, 
which one is the best?
●
  Measure effect of adding new features.
●
  How far down the ranked list will a user need to look to find 
some/all relevant documents?
●
  Difficulties : Relevance, it is not binary but continuous. How 
to say if a document is relevant?



                                  
Standard Test Collections
 A standard test collection consists of three things:
1. A document collection.
2. A set of queries on this collection
3. A set of relevance judgments on those queries.

If a document in test collection is given a binary classification.  
This decision is referred to as the gold standard or ground 
truth judgment of relevance.  




                                  
Standard Test Collections

    ●    Cranfield: 1950s in UK. Too small to be used nowadays.
     TREC (text retrieval conference)
    ●


           ●   Early TREC had 50 Information needs, TREC 6­8 provide 150 
                 information needs over more than 500 thousand articles.
           ●   Recent work on 25 million pages of GOV2 is now available for 
                 research.
     NTCIR East­Asian Language and Cross Language IR Systems
    ●



     Cross Language Evaluation Forum (CLEF)
    ●



     Reuters­21578 collection most used for text classification.
    ●



                                           
Evaluation Measures
         Retrieved    True positives (tp)    False positives (fp)

     Not Retrieved    False negatives (fn)   True negatives (tn)
                       Relevant               Non Relevant


               Number  of  relevant  documents retrieved            = tp/(tp + fn)
    recall  = 
                Total  number  of  relevant  documents


                 Number  of  relevant documents  retrieved
    precision =                                                       = tp/(tp + fp)
                  Total number of  documents  retrieved



 
    (How many correct selections?) Accuracy = (tp + tn)/(tp + fp + fn + tn)
                                     
An Example
    n doc # relevant
                       Let total # of relevant docs = 6
    1 588       x
                       Check each new recall point:
    2 589       x
    3 576
                       R=1/6=0.167;     P=1/1=1
    4 590       x
    5 986
                       R=2/6=0.333;     P=2/2=1
    6 592       x
    7 984              R=3/6=0.5;     P=3/4=0.75
    8 988
    9 578              R=4/6=0.667; P=4/6=0.667
    10 985
                                                    Missing one 
    11 103                                          relevant document.
    12 591                                          Never reach 
    13 772      x      R=5/6=0.833;     p=5/13=0.38 100% recall
    14 990
                                                              7

                                 
Combining Precision & Recall
F­Measure: Weighted HM of precision and recall.




Value of β controls trade­off:
●β = 1: Equally weight precision and recall.


●β > 1: Weight recall more.


●
 β < 1: Weight precision more.
                     2 PR    2
                  F=      = 1 1
                     P + R R+P

                                   
Precision-Recall curve




Interpolated Precision: To get smooth curve.

                                  
11-point Interpolated Average Precision

Recall   Interp.
          Precision
   0.0      1.00
   0.1      0.67
   0.2      0.63
   0.3      0.55
   0.4      0.45
   0.5      0.41
   0.6      0.36
   0.7      0.29
   0.8      0.13
   0.9      0.10
   1.0      0.08

                         
Single Figure Measures

Mean Average Precision (MAP): Average Precision over all 
queries.
Example: Average Precision: (1 + 1 + 0.75 + 0.667 + 0.38 + 
0)/6 = 0.633



Normalized Distributed Cumulative Gain (NDCG): For non­
binary notions. 



                              
Assesing Relevance
 Pooling: To obtain a subset of collection related to query
●

    – Use a set of search engines/algorithms
    – The top­k results (k is between 20 to 50 in TREC) are
      merged into a pool, duplicates are removed
    – Present the documents in a random order to analysts for
      relevance judgments


 Kappa Statistic:
●

     If we have multiple judges on one information need, how consistent are 
      those judges?
  kappa = (P(A) – P(E)) / (1 – P(E))
   – P(A) is the proportion of the times that the judges
     agreed
   – P(E) is the proportion of the times they would be
                                         
    expected to agree by chance
Example: Kappa Statistic
                           Judge 2 Relevance
                            Yes      No  Total
Judge 1      Yes     300     20    320
Relevance   No      10      70     80
                 Total   310     90    400
Observed proportion of the times the judges agreed :


Pooled marginals: 


Probability that two judges agreed by chance (Max Value=1, Min =0.5): 


Kappa statistic: 


Kappa Value between 0.67 and 0.8 is fair agreement but below 0.67 is 
                                       
seen as data providing a dubious basis for evaluation.
Evaluation
                                                  n doc # relevant
R­PRECISION :                                      1 588      x
                     R = # of relevant docs = 7    2 589      x
                                                   3 576
                      R­Precision = 4/7 = 0.571    4 590      x
                                                   5 986
                                                   6 592      x
                                                   7 984
                                                   8 988
A/B Test : Precisely one change between            9 578
                                                  10 985
 current and previous system. We evaluate the     11 103
Affect of that change on system.                  12 591
                                                  13 772      x
                                                  14 990




                               
Summary
● F­Measure: To combine Precision and recall. 
● Recall­precision graph – conveying more information than


 a single number measure.
● Mean average precision – single number value, popular 


measure.
● Normalized Discounted Cumulative Gain (NDCG) – single 


number summary for each rank level emphasizing top ranked 
documents, relevance judgments only needed to a specific rank 
depth (e.g., 10)
● Kappa Measure: Judgement reliability

● R­Precision: Only need to examine top rel documents. 




                                 
THANK YOU!




         

More Related Content

PDF
Evaluation in Information Retrieval
PDF
Lec7 deeprlbootcamp-svg+scg
PPTX
learned optimizer.pptx
PPTX
Database ,2 Background
PPTX
Ire final
PDF
Search engines in the industry
PDF
Talk on Parallel Computing at IGWA
PPTX
Common evaluation measures in NLP and IR
Evaluation in Information Retrieval
Lec7 deeprlbootcamp-svg+scg
learned optimizer.pptx
Database ,2 Background
Ire final
Search engines in the industry
Talk on Parallel Computing at IGWA
Common evaluation measures in NLP and IR

Similar to Presentation (20)

PPT
Statistics chm 235
PPTX
Lecture 3 for Machine learning in IITIJ
PPT
Statistics
PDF
Andres hernandez ai_machine_learning_london_nov2017
PPT
Performance evaluation of IR models
PDF
ML MODULE 4.pdf
PPTX
2 Machine Learning GeneralAAAAAAAAAAAAAAAAAAAAAAA
PPT
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
PPTX
UNIT - 5 DESIGN AND ANALYSIS OF MACHINE LEARNING EXPERIMENTS
PDF
2 Machine Learning General.pdf
PDF
S1 - Process product optimization using design experiments and response surfa...
PDF
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
PPT
T test statistics
PDF
Estimating Space-Time Covariance from Finite Sample Sets
PPTX
Summer 2015 Internship
PPT
Lecture 7
PPT
Lecture 7
PPT
GC-S005-DataAnalysis
PPT
A05 Continuous One Variable Stat Tests
Statistics chm 235
Lecture 3 for Machine learning in IITIJ
Statistics
Andres hernandez ai_machine_learning_london_nov2017
Performance evaluation of IR models
ML MODULE 4.pdf
2 Machine Learning GeneralAAAAAAAAAAAAAAAAAAAAAAA
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
UNIT - 5 DESIGN AND ANALYSIS OF MACHINE LEARNING EXPERIMENTS
2 Machine Learning General.pdf
S1 - Process product optimization using design experiments and response surfa...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
T test statistics
Estimating Space-Time Covariance from Finite Sample Sets
Summer 2015 Internship
Lecture 7
Lecture 7
GC-S005-DataAnalysis
A05 Continuous One Variable Stat Tests
Ad

Recently uploaded (20)

PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PPTX
The various Industrial Revolutions .pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
STKI Israel Market Study 2025 version august
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Flame analysis and combustion estimation using large language and vision assi...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Hindi spoken digit analysis for native and non-native speakers
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Module 1.ppt Iot fundamentals and Architecture
sustainability-14-14877-v2.pddhzftheheeeee
Benefits of Physical activity for teenagers.pptx
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
The various Industrial Revolutions .pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
UiPath Agentic Automation session 1: RPA to Agents
1 - Historical Antecedents, Social Consideration.pdf
Convolutional neural network based encoder-decoder for efficient real-time ob...
OpenACC and Open Hackathons Monthly Highlights July 2025
Enhancing emotion recognition model for a student engagement use case through...
STKI Israel Market Study 2025 version august
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
A proposed approach for plagiarism detection in Myanmar Unicode text
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Flame analysis and combustion estimation using large language and vision assi...
Ad

Presentation