SlideShare a Scribd company logo
Daniel Schneiter
Elastic{Meetup} #41, Zürich, April 9, 2019
Original author: Christoph Büscher
Made to Measure:

Ranking Evaluation
using Elasticsearch
!2
If you can not
measure it,

you cannot
improve it!
AlmostAnActualQuoteTM by Lord Kelvin
https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
?!3
How
good
is
your
search
Image by Kecko
https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)
!4
Image by Muff Wiggler
https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)
!5
Ranking Evaluation



A repeatable way
to quickly measure the quality
of search results

over a wide range of user needs
!6
• Automate - don’t make people
look at screens
• no gut-feeling / “management-
driven” ad-hoc search ranking
REPEATABILITY
!7
• fast iterations instead of long
waits (e.g. in A/B testing)
SPEED
!8
• numeric output
• support of different metrics
• define “quality“ in your domain
QUALITY

MEASURE
!9
• optimize across wider range of
use case (aka “information
needs”)
• think about what the majority
of your users want
• collect data to discover what is
important for your use case
USER

NEEDS
!10
Prerequisites for Ranking Evaluation
1. Define a set of typical information needs
2. For each search case, rate your documents for those information needs

(either binary relevant/non-relevant or on some graded scale)
3. If full labelling is not feasible, choose a small subset instead

(often the case because document set is too large)
4. Choose a metric to calculate.

Some good metrics already defined in Information Retrieval research:
• Precision@K, (N)DCG, ERR, Reciprocal Rank etc…
Source: Gray Arial 10pt
!11
Search Evaluation Continuum
speed
preparation time
people looking 

at screens
Some sort of

unit test
QA assisted by
scripts
user studies
A/B testing
Ranking Evaluation
slow
fast
little lots
!12
Where Ranking Evaluation can help
Development Production Communication

Tool
• guiding design decisions
• enabling quick iteration
• helps defining “search quality”
clearer
• forces stakeholders to “get
real” about their expectations
• monitor changes
• spot degradations
!13
Elasticsearch 

‘rank_eval’ API
!14
Ranking Evaluation API
GET /my_index/_rank_eval
{
"metric": {
"mean_reciprocal_rank": {
[...]
}
},
"templates": [{
[...]
}],
"requests": [{

"template_id": “my_query_template”,
"ratings": [...],
"params": {
"query_string": “hotel amsterdam",
"field": "text"
}

[...]
}]
}
• introduced in 6.2 (still experimental API)
• joint work between
• Christoph Büscher (@dalatangi)
• Isabel Drost-Fromm (@MaineC)
• Inputs:
• a set of search requests (“information needs”)
• document ratings for each request
• a metrics definition; currently available
• Precision@K
• Discounted Cumulative Gain / (N)DCG
• Expected Reciprocal Rank / ERR
• MRR, …

!15
Ranking Evaluation API Details
"metric": {
"precision": {
"relevant_rating_threshold": "2",
"k": 5
}
}
metric
"requests": [{
"id": "JFK_query",
"request": {
“query”: { […] }
},
"ratings": […]
},
… other use cases …]
requests
"ratings": [ {
"_id": "3054546",
"rating": 3
}, {
"_id": "5119376",
"rating": 1
}, […]
]
ratings
{
"rank_eval": {
"metric_score": 0.431,
"details": {
"my_query_id1": {
"metric_score": 0.6,
"unrated_docs": [
{
"_index": "idx",
"_id": "1960795"
}, [...]
],
"hits": [...],
"metric_details": {
“precision" : {
“relevant_docs_retrieved": 6,

"docs_retrieved": 10
}
}
},
"my_query_id2" : { [...] }
}
}
}
!16
_rank_eval response
overall score
details per query
maybe rate those?
details about metric
!17
How to get document ratings?
1. Define a set of typical information needs of user

(e.g. analyze logs, ask product management / customer etc…)
2. For each case, get small set of candidate documents

(e.g. by very broad query)
3. Rate those documents with respect to the underlying information need
• can initially be done by you or other stakeholders;

later maybe outsource e.g. via Mechanical Turk
4. Iterate!
Source: Gray Arial 10pt
!18
Metrics currently available
Metric Description Ratings
Precision At K Set-based metric; ratio of relevant doc in top K results binary
Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary
Discounted Cumulative
Gain (DCG)
takes order into account; highly relevant docs score more

if they appear earlier in result list
graded
Expected Reciprocal
Rank (ERR)
motivated by “cascade model” of search; models
dependency of results with respect to their predecessors
graded
!19
Precision At K
• In short: “How many good results appear in the first K results”

(e.g. first few pages in UI)
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: least stable across different user needs, e.g. total number of
relevant documents for a query influences precision at k
Source: Gray Arial 10pt
prec@k =
# relevant docs{ }
# all results at k{ }
!20
Reciprocal Rank
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: limited to cases where amount of good results doesn’t matter
• If averaged over a sample of queries Q often called MRR

(mean reciprocal rank):
Source: Gray Arial 10pt
RR =
1
position of first relevant document
MRR =
1
Q
1
rankii
Q
∑
!21
Discounted Cumulative Gain (DCG)
• Predecessor: Cumulative Gain (CG)
• sums relevance judgement over top k results
Source: Gray Arial 10pt
CG = relk
i=1
k
∑
DCG =
reli
log2
(i +1)i=1
k
∑
• DCG takes position into account
• divides by log2 at each position
• NDCG (Normalized DCG)
• divides by “ideal” DCG for a query (IDCG) NDCG =
DCG
IDCG
!22
Expected Reciprocal Rank (ERR)
• cascade based metric
• supports graded relevance judgements
• model assumes user goes through

result list in order and is satisfied with

the first relevant document
• R_i probability that user stops at position i
• ERR is high

when relevant document appear early
Source: Gray Arial 10pt
ERR =
1
r
(1− Ri
)Rr
i=1
r−1
∏r=1
k
∑
Ri
=
2
reli
−1
2
relmax
reli
! relevance at pos. i
relmax
! maximal relevance grade
!23
DEMO TIME
!24
Demo project and Data
• Demo uses aprox. 1800 documents from the english Wikipedia
• Wikipedias Discovery department collects and publishes relevance
judgements with their Discernatron project
• Bulk data and all query examples available at

https://github.com/cbuescher/rankEvalDemo
Source: Gray Arial 10pt
!25
Q&A
!26
Some questions I have for you…
• How do you measure search relevance currently?
• Did you find anything useful about the ranking evaluation approach?
• Feedback about usability of the API

(ping be on Github or our Discuss Forum @cbuescher)
Source: Gray Arial 10pt
!27
Further reading
• Manning, Raghavan & Schütze: Introduction to Information
Retrieval, Cambridge University Press. 2008.
• Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected
reciprocal rank for graded relevance. Proceeding of the 18th
ACM Conference on Information and Knowledge
Management - CIKM ’09, 621.
• Blog: https://www.elastic.co/blog/made-to-measure-how-to-
use-the-ranking-evaluation-api-in-elasticsearch
• Docs: https://www.elastic.co/guide/en/elasticsearch/reference/
current/search-rank-eval.html
• Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher)
• Github: :Search/Ranking Label (cbuescher)
Source: Gray Arial 10pt

More Related Content

PDF
GraphQL Advanced
PPTX
Azure Durable Functions
PPT
Introduction To Jira Slide Share
PDF
Shift left Observability
PPTX
Fleet and elastic agent
PPTX
Salesforce Online Training
PPTX
Jira Service Management New Features
PDF
Go Observability (in practice)
GraphQL Advanced
Azure Durable Functions
Introduction To Jira Slide Share
Shift left Observability
Fleet and elastic agent
Salesforce Online Training
Jira Service Management New Features
Go Observability (in practice)

What's hot (20)

PPTX
PPTX
PowerBI Embedded in D365 Finance and Operations
PPTX
Azure Logic Apps and Copilot.pptx .
PPTX
Ms Dynamics 365 vs Salesforce CRM
PPTX
Barcelona Digital Festival 28th Nov 2019 - The Future Of Work Is Automation F...
PDF
Web Maintenance And Support Proposal PowerPoint Presentation Slides
PPT
Introduction to jira
PPTX
Salesforce Sharing Architecture
PDF
Crystal Hirschorn_Building Internal Developer Platforms that will make the en...
ODP
Elasticsearch for beginners
PDF
Jira Agile
PDF
Microservices for Enterprises
PPTX
What is DevOps
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
PDF
Customer 360 de Salesforce
KEY
Redis in Practice
PDF
Ray: Enterprise-Grade, Distributed Python
PPTX
Cloud Custodian
PDF
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
PPTX
New relic
PowerBI Embedded in D365 Finance and Operations
Azure Logic Apps and Copilot.pptx .
Ms Dynamics 365 vs Salesforce CRM
Barcelona Digital Festival 28th Nov 2019 - The Future Of Work Is Automation F...
Web Maintenance And Support Proposal PowerPoint Presentation Slides
Introduction to jira
Salesforce Sharing Architecture
Crystal Hirschorn_Building Internal Developer Platforms that will make the en...
Elasticsearch for beginners
Jira Agile
Microservices for Enterprises
What is DevOps
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Customer 360 de Salesforce
Redis in Practice
Ray: Enterprise-Grade, Distributed Python
Cloud Custodian
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
New relic
Ad

Similar to Made to Measure: Ranking Evaluation using Elasticsearch (20)

PDF
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
PDF
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
PDF
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
PDF
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
PDF
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
PDF
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
PDF
Search Quality Evaluation: Tools and Techniques
PDF
Haystack London - Search Quality Evaluation, Tools and Techniques
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
PDF
Web search-metrics-tutorial-www2010-section-2of7-relevance
PDF
Search quality in practice
PPT
information technology materrailas paper
PDF
Evaluating Search Performance
PPTX
Information Retrieval Evaluation
PPT
Performance evaluation of IR models
PDF
An introduction to Elasticsearch's advanced relevance ranking toolbox
PDF
Evaluating Search Performance
PDF
assia2015sakai
PDF
Learning to rank search results
PDF
Click Model-Based Information Retrieval Metrics
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Search Quality Evaluation: Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Web search-metrics-tutorial-www2010-section-2of7-relevance
Search quality in practice
information technology materrailas paper
Evaluating Search Performance
Information Retrieval Evaluation
Performance evaluation of IR models
An introduction to Elasticsearch's advanced relevance ranking toolbox
Evaluating Search Performance
assia2015sakai
Learning to rank search results
Click Model-Based Information Retrieval Metrics
Ad

Recently uploaded (20)

PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
web development for engineering and engineering
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
PPT on Performance Review to get promotions
PPTX
Construction Project Organization Group 2.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Welding lecture in detail for understanding
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
web development for engineering and engineering
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPT on Performance Review to get promotions
Construction Project Organization Group 2.pptx
Geodesy 1.pptx...............................................
bas. eng. economics group 4 presentation 1.pptx
Welding lecture in detail for understanding
CH1 Production IntroductoryConcepts.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Model Code of Practice - Construction Work - 21102022 .pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf

Made to Measure: Ranking Evaluation using Elasticsearch

  • 1. Daniel Schneiter Elastic{Meetup} #41, Zürich, April 9, 2019 Original author: Christoph Büscher Made to Measure:
 Ranking Evaluation using Elasticsearch
  • 2. !2 If you can not measure it,
 you cannot improve it! AlmostAnActualQuoteTM by Lord Kelvin https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
  • 4. !4 Image by Muff Wiggler https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)
  • 5. !5 Ranking Evaluation
 
 A repeatable way to quickly measure the quality of search results
 over a wide range of user needs
  • 6. !6 • Automate - don’t make people look at screens • no gut-feeling / “management- driven” ad-hoc search ranking REPEATABILITY
  • 7. !7 • fast iterations instead of long waits (e.g. in A/B testing) SPEED
  • 8. !8 • numeric output • support of different metrics • define “quality“ in your domain QUALITY
 MEASURE
  • 9. !9 • optimize across wider range of use case (aka “information needs”) • think about what the majority of your users want • collect data to discover what is important for your use case USER
 NEEDS
  • 10. !10 Prerequisites for Ranking Evaluation 1. Define a set of typical information needs 2. For each search case, rate your documents for those information needs
 (either binary relevant/non-relevant or on some graded scale) 3. If full labelling is not feasible, choose a small subset instead
 (often the case because document set is too large) 4. Choose a metric to calculate.
 Some good metrics already defined in Information Retrieval research: • Precision@K, (N)DCG, ERR, Reciprocal Rank etc… Source: Gray Arial 10pt
  • 11. !11 Search Evaluation Continuum speed preparation time people looking 
 at screens Some sort of
 unit test QA assisted by scripts user studies A/B testing Ranking Evaluation slow fast little lots
  • 12. !12 Where Ranking Evaluation can help Development Production Communication
 Tool • guiding design decisions • enabling quick iteration • helps defining “search quality” clearer • forces stakeholders to “get real” about their expectations • monitor changes • spot degradations
  • 14. !14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank": { [...] } }, "templates": [{ [...] }], "requests": [{
 "template_id": “my_query_template”, "ratings": [...], "params": { "query_string": “hotel amsterdam", "field": "text" }
 [...] }] } • introduced in 6.2 (still experimental API) • joint work between • Christoph Büscher (@dalatangi) • Isabel Drost-Fromm (@MaineC) • Inputs: • a set of search requests (“information needs”) • document ratings for each request • a metrics definition; currently available • Precision@K • Discounted Cumulative Gain / (N)DCG • Expected Reciprocal Rank / ERR • MRR, …

  • 15. !15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold": "2", "k": 5 } } metric "requests": [{ "id": "JFK_query", "request": { “query”: { […] } }, "ratings": […] }, … other use cases …] requests "ratings": [ { "_id": "3054546", "rating": 3 }, { "_id": "5119376", "rating": 1 }, […] ] ratings
  • 16. { "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score": 0.6, "unrated_docs": [ { "_index": "idx", "_id": "1960795" }, [...] ], "hits": [...], "metric_details": { “precision" : { “relevant_docs_retrieved": 6,
 "docs_retrieved": 10 } } }, "my_query_id2" : { [...] } } } } !16 _rank_eval response overall score details per query maybe rate those? details about metric
  • 17. !17 How to get document ratings? 1. Define a set of typical information needs of user
 (e.g. analyze logs, ask product management / customer etc…) 2. For each case, get small set of candidate documents
 (e.g. by very broad query) 3. Rate those documents with respect to the underlying information need • can initially be done by you or other stakeholders;
 later maybe outsource e.g. via Mechanical Turk 4. Iterate! Source: Gray Arial 10pt
  • 18. !18 Metrics currently available Metric Description Ratings Precision At K Set-based metric; ratio of relevant doc in top K results binary Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary Discounted Cumulative Gain (DCG) takes order into account; highly relevant docs score more
 if they appear earlier in result list graded Expected Reciprocal Rank (ERR) motivated by “cascade model” of search; models dependency of results with respect to their predecessors graded
  • 19. !19 Precision At K • In short: “How many good results appear in the first K results”
 (e.g. first few pages in UI) • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: least stable across different user needs, e.g. total number of relevant documents for a query influences precision at k Source: Gray Arial 10pt prec@k = # relevant docs{ } # all results at k{ }
  • 20. !20 Reciprocal Rank • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: limited to cases where amount of good results doesn’t matter • If averaged over a sample of queries Q often called MRR
 (mean reciprocal rank): Source: Gray Arial 10pt RR = 1 position of first relevant document MRR = 1 Q 1 rankii Q ∑
  • 21. !21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG) • sums relevance judgement over top k results Source: Gray Arial 10pt CG = relk i=1 k ∑ DCG = reli log2 (i +1)i=1 k ∑ • DCG takes position into account • divides by log2 at each position • NDCG (Normalized DCG) • divides by “ideal” DCG for a query (IDCG) NDCG = DCG IDCG
  • 22. !22 Expected Reciprocal Rank (ERR) • cascade based metric • supports graded relevance judgements • model assumes user goes through
 result list in order and is satisfied with
 the first relevant document • R_i probability that user stops at position i • ERR is high
 when relevant document appear early Source: Gray Arial 10pt ERR = 1 r (1− Ri )Rr i=1 r−1 ∏r=1 k ∑ Ri = 2 reli −1 2 relmax reli ! relevance at pos. i relmax ! maximal relevance grade
  • 24. !24 Demo project and Data • Demo uses aprox. 1800 documents from the english Wikipedia • Wikipedias Discovery department collects and publishes relevance judgements with their Discernatron project • Bulk data and all query examples available at
 https://github.com/cbuescher/rankEvalDemo Source: Gray Arial 10pt
  • 26. !26 Some questions I have for you… • How do you measure search relevance currently? • Did you find anything useful about the ranking evaluation approach? • Feedback about usability of the API
 (ping be on Github or our Discuss Forum @cbuescher) Source: Gray Arial 10pt
  • 27. !27 Further reading • Manning, Raghavan & Schütze: Introduction to Information Retrieval, Cambridge University Press. 2008. • Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management - CIKM ’09, 621. • Blog: https://www.elastic.co/blog/made-to-measure-how-to- use-the-ranking-evaluation-api-in-elasticsearch • Docs: https://www.elastic.co/guide/en/elasticsearch/reference/ current/search-rank-eval.html • Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher) • Github: :Search/Ranking Label (cbuescher) Source: Gray Arial 10pt