SlideShare a Scribd company logo
© 2019 The MITRE Corporation. All rights reserved.
Quaerite – Search Relevance Toolkit
Tim Allison
tallison@apache.org, @_tallison
April 24, 2019
Haystack Conference
Approved for Public Release;
Distribution Unlimited. Case
Number 18-3138-5
| 2 |
© 2019 The MITRE Corporation. All rights reserved.
Debt of Gratitude
▪ Thank you Doug Turnbull, John Berryman and Open Source
Connections for the inspiration/examples/training with tmdb and for
sharing your ground truth set!
| 3 |
© 2019 The MITRE Corporation. All rights reserved.
Yet Another Toolkit? Why!?
▪ How many parameters do we have?
▪ How many permutations of those parameters are available?
| 4 |
© 2019 The MITRE Corporation. All rights reserved.
Available Parameters
▪ 14 tokenizers https://lucene.apache.org/solr/guide/7_1/tokenizers.html
▪ ~45 token filters (not including language-specific token filters – see next slide)
https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html
▪ Query parsers
▪ Query operators, minimum should match, should, must, not
▪ Token/field based scoring – best_fields, most_fields, cross_fields
▪ Field boosting
▪ Phrasal boosting/shingling
▪ Synonym lists, taxonomies
▪ Similarity scoring parameters (with BM25)
▪ Elevate
▪ External signal enrichment
– manual or automatic (NLP – entity extraction, categorization, etc.)
▪ Reranking via machine learning (Learning to Rank)
| 4 |
© 2019 The MITRE Corporation. All rights reserved. For internal MITRE use
| 5 |
© 2019 The MITRE Corporation. All rights reserved.
Each Token Filter Can Have Many Parameters
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
| 5 |
© 2019 The MITRE Corporation. All rights reserved. For internal MITRE use
| 6 |
© 2019 The MITRE Corporation. All rights reserved.
Overview – Offline testing toolkit
Prerequisites:
1. Reliable, generalizable ground truth
2. Reliable, useful underlying data
3. Offline metric has to have some connection to KPIs
4. Expertise – you still have to know what you’re doing!!!
| 7 |
© 2019 The MITRE Corporation. All rights reserved.
Main Tools
1. Run Experiments
2. Generate Experiments
▪ All permutations (grid search)
▪ Random experiments (random search)
3. Genetic Algorithm
▪ Cross-fold validation!!!
▪ Complementary to LTR -- main diff is algorithm and in running offline to tune general settings rather
than as reranking top n
| 8 |
© 2019 The MITRE Corporation. All rights reserved.
Odds and Ends
▪ Analyzer Comparison over (mostly) the index
▪ Significant Terms (yawn…for archaic versions of Solr)…and planning to
add these as parameters in “generate experiments”
| 9 |
© 2019 The MITRE Corporation. All rights reserved.
Adding Porter Stemming: create account
creat
created: 709
create: 551
creating: 269
creates: 153
creat: 1
account
account: 3244
accounts: 1924
accounting: 1548
accountants: 340
accountant: 176
accounted: 134
accountability: 74
accountable: 74
accountancy: 65
account's: 7
accountant's: 7
| 10 |
© 2019 The MITRE Corporation. All rights reserved.
Status
▪ Alpha release 3/22/2019 (Solr only)
▪ Beta1 release this week (?)
– This will include support for ElasticSearch
▪ Dream
– Incorporate experiment generation/GA into Rated Ranking Evaluator (RRE)
– Apache Incubator -> Top Level Project (TLP)
| 11 |
© 2019 The MITRE Corporation. All rights reserved.
Links
▪ Main site: https://github.com/mitre/quaerite
▪ Examples: https://github.com/mitre/quaerite/blob/master/quaerite-
examples/README.md
▪ Contact
– tallison@apache.org
– @_tallison

More Related Content

PDF
Haystack 2019 - Architectural considerations on search relevancy in the conte...
PDF
Documentation and Deployment through Python Libraries
PDF
Real time analytics with Power BI
PDF
Introduction to Power BI
PDF
Apply MLOps at Scale
PPTX
Kashif Khurshid's Career Journey- Visual Guide
PPTX
Designing Data Pipelines Using Hadoop
PDF
Scaling AI At H&M
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Documentation and Deployment through Python Libraries
Real time analytics with Power BI
Introduction to Power BI
Apply MLOps at Scale
Kashif Khurshid's Career Journey- Visual Guide
Designing Data Pipelines Using Hadoop
Scaling AI At H&M

What's hot (19)

PDF
Datahive 360 - Felipe Wesbonk
PDF
Building A Feature Factory
PPTX
The DataSift platform
PDF
Better Together: How Graph database enables easy data integration with Spark ...
PPTX
Turning Machine Learning Prototypes into Products
PDF
MLSD18. Automating Machine Learning Workflows
PPTX
SharePoint Search Results Branding
PDF
Powering Next Best Action
PDF
How a global manufacturing company built a data science capability from scratch
PDF
Schema on read with runtime fields
PDF
An introduction to Elasticsearch's advanced relevance ranking toolbox
PPTX
Arquitectura de Datos en Azure
PDF
Building a Scalable Data Science Solution to Outperform Sales Execution in Tr...
PDF
APIdays Paris 2019 - Data APIs as a service: Focusing on your core business w...
PDF
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
PDF
API Management: La Puerta de enlace (por Francisco Nieto)
Datahive 360 - Felipe Wesbonk
Building A Feature Factory
The DataSift platform
Better Together: How Graph database enables easy data integration with Spark ...
Turning Machine Learning Prototypes into Products
MLSD18. Automating Machine Learning Workflows
SharePoint Search Results Branding
Powering Next Best Action
How a global manufacturing company built a data science capability from scratch
Schema on read with runtime fields
An introduction to Elasticsearch's advanced relevance ranking toolbox
Arquitectura de Datos en Azure
Building a Scalable Data Science Solution to Outperform Sales Execution in Tr...
APIdays Paris 2019 - Data APIs as a service: Focusing on your core business w...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
API Management: La Puerta de enlace (por Francisco Nieto)
Ad

Similar to Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit - Tim Allison (20)

PPTX
Implementing Machine Learning Incrementally
PPTX
Hyperledger weatherreport20190219 公開版
PDF
Robotic Process Auditing
PDF
Atlassian Executive Business Forum - LinkedIn HQ
PDF
Keeping SharePoint Always On
PDF
Using Machine Learning to Debug complex Oracle RAC Issues
PDF
Volume_54.2_-_FCCS_Implementation_Best_Practices.pdf
PDF
Extreme Automation: The Emergence of RPA and AI for Treasury
PPTX
FLITE_Presentation JG v
PDF
UiPath Automation Developer Associate Training Series 2025 - Session 6
PDF
Leveraging Generative AI: Exploring New Technology for Data Integration
PDF
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
PDF
MITRE-Module 2 Slides.pdf
PPTX
Proofpoint Emerging Threats Suricata 5.0 Webinar
PPTX
Washington DC DataOps Meetup -- Nov 2019
PDF
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
PDF
Driving TAS Enterprise Fitness
PPTX
BRE Deep Dive
PDF
Crafting enhanced customer experience through chatbots, beacons and oracle jet
PDF
Enabling Agility Through DevOps
Implementing Machine Learning Incrementally
Hyperledger weatherreport20190219 公開版
Robotic Process Auditing
Atlassian Executive Business Forum - LinkedIn HQ
Keeping SharePoint Always On
Using Machine Learning to Debug complex Oracle RAC Issues
Volume_54.2_-_FCCS_Implementation_Best_Practices.pdf
Extreme Automation: The Emergence of RPA and AI for Treasury
FLITE_Presentation JG v
UiPath Automation Developer Associate Training Series 2025 - Session 6
Leveraging Generative AI: Exploring New Technology for Data Integration
Datarobot, 자동화된 분석 적용 시 분석 절차의 변화 및 효용 - 홍운표 데이터 사이언티스트, DataRobot :: AWS Sum...
MITRE-Module 2 Slides.pdf
Proofpoint Emerging Threats Suricata 5.0 Webinar
Washington DC DataOps Meetup -- Nov 2019
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Driving TAS Enterprise Fitness
BRE Deep Dive
Crafting enhanced customer experience through chatbots, beacons and oracle jet
Enabling Agility Through DevOps
Ad

More from OpenSource Connections (20)

PDF
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
PDF
Test driven relevancy
PDF
How To Structure Your Search Team for Success
PPT
The right path to making search relevant - Taxonomy Bootcamp London 2019
PDF
Payloads and OCR with Solr
PPTX
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
PPTX
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
PPTX
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
PPTX
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
PDF
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
PPTX
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
PPTX
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
PPTX
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
PDF
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
PDF
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
PDF
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
Test driven relevancy
How To Structure Your Search Team for Success
The right path to making search relevant - Taxonomy Bootcamp London 2019
Payloads and OCR with Solr
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Lecture1 pattern recognition............
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Knowledge Engineering Part 1
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Global journeys: estimating international migration
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Fluorescence-microscope_Botany_detailed content
Miokarditis (Inflamasi pada Otot Jantung)
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Knowledge Engineering Part 1
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Global journeys: estimating international migration
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Acumen Training GuidePresentation.pptx

Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit - Tim Allison

  • 1. © 2019 The MITRE Corporation. All rights reserved. Quaerite – Search Relevance Toolkit Tim Allison tallison@apache.org, @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-5
  • 2. | 2 | © 2019 The MITRE Corporation. All rights reserved. Debt of Gratitude ▪ Thank you Doug Turnbull, John Berryman and Open Source Connections for the inspiration/examples/training with tmdb and for sharing your ground truth set!
  • 3. | 3 | © 2019 The MITRE Corporation. All rights reserved. Yet Another Toolkit? Why!? ▪ How many parameters do we have? ▪ How many permutations of those parameters are available?
  • 4. | 4 | © 2019 The MITRE Corporation. All rights reserved. Available Parameters ▪ 14 tokenizers https://lucene.apache.org/solr/guide/7_1/tokenizers.html ▪ ~45 token filters (not including language-specific token filters – see next slide) https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html ▪ Query parsers ▪ Query operators, minimum should match, should, must, not ▪ Token/field based scoring – best_fields, most_fields, cross_fields ▪ Field boosting ▪ Phrasal boosting/shingling ▪ Synonym lists, taxonomies ▪ Similarity scoring parameters (with BM25) ▪ Elevate ▪ External signal enrichment – manual or automatic (NLP – entity extraction, categorization, etc.) ▪ Reranking via machine learning (Learning to Rank) | 4 | © 2019 The MITRE Corporation. All rights reserved. For internal MITRE use
  • 5. | 5 | © 2019 The MITRE Corporation. All rights reserved. Each Token Filter Can Have Many Parameters <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/> | 5 | © 2019 The MITRE Corporation. All rights reserved. For internal MITRE use
  • 6. | 6 | © 2019 The MITRE Corporation. All rights reserved. Overview – Offline testing toolkit Prerequisites: 1. Reliable, generalizable ground truth 2. Reliable, useful underlying data 3. Offline metric has to have some connection to KPIs 4. Expertise – you still have to know what you’re doing!!!
  • 7. | 7 | © 2019 The MITRE Corporation. All rights reserved. Main Tools 1. Run Experiments 2. Generate Experiments ▪ All permutations (grid search) ▪ Random experiments (random search) 3. Genetic Algorithm ▪ Cross-fold validation!!! ▪ Complementary to LTR -- main diff is algorithm and in running offline to tune general settings rather than as reranking top n
  • 8. | 8 | © 2019 The MITRE Corporation. All rights reserved. Odds and Ends ▪ Analyzer Comparison over (mostly) the index ▪ Significant Terms (yawn…for archaic versions of Solr)…and planning to add these as parameters in “generate experiments”
  • 9. | 9 | © 2019 The MITRE Corporation. All rights reserved. Adding Porter Stemming: create account creat created: 709 create: 551 creating: 269 creates: 153 creat: 1 account account: 3244 accounts: 1924 accounting: 1548 accountants: 340 accountant: 176 accounted: 134 accountability: 74 accountable: 74 accountancy: 65 account's: 7 accountant's: 7
  • 10. | 10 | © 2019 The MITRE Corporation. All rights reserved. Status ▪ Alpha release 3/22/2019 (Solr only) ▪ Beta1 release this week (?) – This will include support for ElasticSearch ▪ Dream – Incorporate experiment generation/GA into Rated Ranking Evaluator (RRE) – Apache Incubator -> Top Level Project (TLP)
  • 11. | 11 | © 2019 The MITRE Corporation. All rights reserved. Links ▪ Main site: https://github.com/mitre/quaerite ▪ Examples: https://github.com/mitre/quaerite/blob/master/quaerite- examples/README.md ▪ Contact – tallison@apache.org – @_tallison