SlideShare a Scribd company logo
Sujit Pal
March 13, 2016
Measuring Search Engine
Quality using Spark and Python
| 2
• About Me
 Work at Elsevier Labs
 Interests: Search, NLP and Distributed Processing.
 URL: labs.elsevier.com
 Email: sujit.pal@elsevier.com
 Blog: Salmon Run
 Twitter: @palsujit
• About Elsevier
 World’s largest publisher of STM Books and Journals
 Uses data to inform and enable consumers of STM info
Introduction
| 3
• Problem Description
• Our Solution
• Other Uses
• Future Work
• Q&A
Agenda
Problem Description
• Migrate the Science Direct Search Engine from Microsoft FAST to
Apache Solr.
Background
• Keep search quality consistent across platforms.
• Full text downloads from Solr must equal or exceed that from FAST.
Success Criteria
• A/B testing
Measurement
• A/B Tests happen in Production
 Expensive: Requires Production Deployment of Search
Engine(s).
 Risky: Bad customer experience in AB can result in customer
leaving.
 Limited scope for iterative improvement: because of expense
and risk.
 A/B tests take time to be statistically meaningful.
• We needed something that
 Could be run by DEV/QA on demand.
 Produces repeatable indicator of search engine quality.
 Does not require production deployment.
• This tool is the subject of our talk today.
But…
Our Solution
• We have
 Query Logs – from query string entered by user.
 Click Logs – from the download links clicked by user.
• We can generate
 Search results for each query against the search engine.
• Combining which we can provide
 Click Rank Distribution for a search engine (configuration)
Solution Overview
Click Rank Definition
• Click Rank is the sum of the ranks of all PIIs in the result set that
match the PIIs clicked for that query, divided by the number of
matches.
• Deconstructing the above:
 Let the click logs for a query be the document set Q.
 Let the top N search results for the query be represented by a
List R.
 Let P be the intersection of Q and R, and P be the (one-based)
indexes of the documents in R that are in P.
 Click Rank = Σ P / || P ||
Inputs
Preprocess Logs
• Query and Click logs already merged by A/B test framework.
• Provided as line-oriented JSON format.
Generate Search Results for Engine
• Replay query logs against Search Engine (configuration).
• Use Python Multiprocessing module (for parallel access).
Search Result Generation Code
Search Results Example Data
• Search Results are saved one file per query.
• Top 50 results extracted (so each file is 50 lines long).
• Maintains parity with FAST reference query results (provided one-
time via legacy process).
Compute Click Rank Distribution
• Use Apache Spark to compute the Click Rank Distribution.
• Use Python + Matplotlib to build reports and visualizations.
Spark Code for Generating Click Rank Distribution
• Skeleton of a PySpark Program
Spark Code for Generating Click Rank Distribution
• Step #1: Convert the Clicks data to (Query_ID, List of clicked
PIIs)
Spark Code for Generating Click Rank Distribution
• Step #2: Convert the Search Results data to (Query_ID, List of
PIIs in search result)
Spark Code for Generating Click Rank Distribution
• Step #3: Join the two RDDs and compute Click Rank from the
intersection of the clicked PIIs and the result PIIs for each
Query_ID.
Generate Reports
• Download Click Rank Distribution for Search Engine (configuration).
• Use Python + Matplotlib to build reports and visualizations.
Outputs from Tool
• Step #4: Download distribution from S3 and aggregate to chart
and spreadsheet.
How did we do (in our A/B test)?
• Solr PDF downloads were 99.6% of FAST downloads.
• Difference in download rates not statistically significant.
• Decision made to put Solr into production.
90%
95%
100%
105%
Jan
(AB #1)
Feb
(AB #2)
Mar
(AB #3)
Apr
(AB #4)
SOLR Downloads as % of FAST Downloads
PDF
HTML
Level of FAST Downloads
Other Uses
Find Search Result Overlap between Configurations
• Measure drift between two search configurations.
• Ordered and Unordered comparison at different top N positions.
• Result set overlap increases with N.
• Lot of positional overlap in the top N positions across engines.
Search Quality as Overlap between Title and Query
• Measures overlap of title words with query words for various top
N positions.
• Overlap @ N defined as sum of number of words overlap for the
first N titles with the query normalized by N times the number of
words in the query.
• Overlap @ N decreases monotonically with N.
• Solr engines seem to do better at this measure.
Click Distribution
• Measures the distribution of clicked positions across the top 50
positions for each engine and compares them.
• In this chart, FAST has higher number of clicks at the top
positions than the Solr configurations shown.
Distribution of Publication Dates in Results
• The engine has a temporal component in its ranking algorithm.
• Compares the distribution of publication dates across search
engine configurations to visualize its behavior.
More Uses …
• Measuring impact of query response time on click rank.
• Comparing click rank distributions by document type.
• …
• Compute
(Average/Median) CR
per user.
• Compute CR per query
and user.
• Use this as input to
Learning to Rank
Algorithms.
• Other ideas…
Future Work
Thank you for
listening!
Questions?
My Email:
sujit.pal@elsevier.com

More Related Content

PPTX
Evolving a Medical Image Similarity Search
PPTX
SoDA v2 - Named Entity Recognition from streaming text
PPTX
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
PDF
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
PPTX
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
PPTX
Machine Learning With Spark
PDF
Best Practices for Hyperparameter Tuning with MLflow
Evolving a Medical Image Similarity Search
SoDA v2 - Named Entity Recognition from streaming text
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Synchronizing Clusters in Fusion: CDCR and Streaming Expressions
Machine Learning With Spark
Best Practices for Hyperparameter Tuning with MLflow

What's hot (20)

PDF
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PDF
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
PPTX
Classification of URLs
KEY
Cascalog at Strange Loop
PDF
Snorkel: Dark Data and Machine Learning with Christopher RĂŠ
PDF
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
PDF
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
PDF
Latent Semantic Analysis of Wikipedia with Spark
PDF
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
PDF
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Better {ML} Together: GraphLab Create + Spark
PPTX
Magellan FOSS4G Talk, Boston 2017
PDF
Machine Learning Pipelines
PDF
AgensGraph Presentation at PGConf.us 2017
PPTX
Large Scale Machine learning with Spark
PDF
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
PPTX
The Challenges of Bringing Machine Learning to the Masses
PPTX
What’s New in the Berkeley Data Analytics Stack
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Classification of URLs
Cascalog at Strange Loop
Snorkel: Dark Data and Machine Learning with Christopher RĂŠ
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Latent Semantic Analysis of Wikipedia with Spark
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Better {ML} Together: GraphLab Create + Spark
Magellan FOSS4G Talk, Boston 2017
Machine Learning Pipelines
AgensGraph Presentation at PGConf.us 2017
Large Scale Machine learning with Spark
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
The Challenges of Bringing Machine Learning to the Masses
What’s New in the Berkeley Data Analytics Stack
Ad

Viewers also liked (6)

PPTX
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
PDF
Determining Relevance Rankings with Search Click Logs
PPTX
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
PDF
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
Elasticsearch and Spark
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Determining Relevance Rankings with Search Click Logs
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Elasticsearch and Spark
Ad

Similar to Measuring Search Engine Quality using Spark and Python (20)

PPTX
Analyzing search engine results pages(SERPs) All over the worlds
PDF
Search quality in practice
PPTX
Peters matthew periodictableseo
PDF
180 sspcc3 b_lederman
PDF
Tutorial 12 (click models)
 
PDF
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
PPTX
How Google works
PPTX
Haystack keynote 2019: What is Search Relevance? - Max Irwin
PDF
Calculating Rank of Web Documents Using Its Content and Link Analysis
PDF
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
PPTX
Practical SPARQL Benchmarking
PDF
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
PDF
Everything You Wish You Knew About Search
PPTX
Search-Engines-and-Information-Retrievals.pptx
PPT
The right path to making search relevant - Taxonomy Bootcamp London 2019
PDF
Searchland: Search quality for Beginners
PDF
Webinar: Modern Techniques for Better Search Relevance with Fusion
PDF
Charting Searchland, ACM SIG Data Mining
Analyzing search engine results pages(SERPs) All over the worlds
Search quality in practice
Peters matthew periodictableseo
180 sspcc3 b_lederman
Tutorial 12 (click models)
 
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
How Google works
Haystack keynote 2019: What is Search Relevance? - Max Irwin
Calculating Rank of Web Documents Using Its Content and Link Analysis
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Solr and Spark
Practical SPARQL Benchmarking
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Everything You Wish You Knew About Search
Search-Engines-and-Information-Retrievals.pptx
The right path to making search relevant - Taxonomy Bootcamp London 2019
Searchland: Search quality for Beginners
Webinar: Modern Techniques for Better Search Relevance with Fusion
Charting Searchland, ACM SIG Data Mining

More from Sujit Pal (20)

PPTX
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
PPTX
Google AI Hackathon: LLM based Evaluator for RAG
PPTX
Building Learning to Rank (LTR) search reranking models using Large Language ...
PPTX
Cheap Trick for Question Answering
PPTX
Searching Across Images and Test
PPTX
Learning a Joint Embedding Representation for Image Search using Self-supervi...
PPTX
The power of community: training a Transformer Language Model on a shoestring
PPTX
Backprop Visualization
PPTX
Accelerating NLP with Dask and Saturn Cloud
PPTX
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
PPTX
Leslie Smith's Papers discussion for DL Journal Club
PPTX
Using Graph and Transformer Embeddings for Vector Based Retrieval
PPTX
Transformer Mods for Document Length Inputs
PPTX
Question Answering as Search - the Anserini Pipeline and Other Stories
PPTX
Building Named Entity Recognition Models Efficiently using NERDS
PPTX
Graph Techniques for Natural Language Processing
PPTX
Learning to Rank Presentation (v2) at LexisNexis Search Guild
PPTX
Search summit-2018-ltr-presentation
PPTX
Search summit-2018-content-engineering-slides
PPTX
Deep Learning Models for Question Answering
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Google AI Hackathon: LLM based Evaluator for RAG
Building Learning to Rank (LTR) search reranking models using Large Language ...
Cheap Trick for Question Answering
Searching Across Images and Test
Learning a Joint Embedding Representation for Image Search using Self-supervi...
The power of community: training a Transformer Language Model on a shoestring
Backprop Visualization
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Leslie Smith's Papers discussion for DL Journal Club
Using Graph and Transformer Embeddings for Vector Based Retrieval
Transformer Mods for Document Length Inputs
Question Answering as Search - the Anserini Pipeline and Other Stories
Building Named Entity Recognition Models Efficiently using NERDS
Graph Techniques for Natural Language Processing
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Search summit-2018-ltr-presentation
Search summit-2018-content-engineering-slides
Deep Learning Models for Question Answering

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Introduction to the R Programming Language
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Lecture1 pattern recognition............
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Introduction to Data Science and Data Analysis
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Qualitative Qantitative and Mixed Methods.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to the R Programming Language
Supervised vs unsupervised machine learning algorithms
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Knowledge Engineering Part 1
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Measuring Search Engine Quality using Spark and Python

  • 1. Sujit Pal March 13, 2016 Measuring Search Engine Quality using Spark and Python
  • 2. | 2 • About Me  Work at Elsevier Labs  Interests: Search, NLP and Distributed Processing.  URL: labs.elsevier.com  Email: sujit.pal@elsevier.com  Blog: Salmon Run  Twitter: @palsujit • About Elsevier  World’s largest publisher of STM Books and Journals  Uses data to inform and enable consumers of STM info Introduction
  • 3. | 3 • Problem Description • Our Solution • Other Uses • Future Work • Q&A Agenda
  • 5. • Migrate the Science Direct Search Engine from Microsoft FAST to Apache Solr. Background
  • 6. • Keep search quality consistent across platforms. • Full text downloads from Solr must equal or exceed that from FAST. Success Criteria
  • 8. • A/B Tests happen in Production  Expensive: Requires Production Deployment of Search Engine(s).  Risky: Bad customer experience in AB can result in customer leaving.  Limited scope for iterative improvement: because of expense and risk.  A/B tests take time to be statistically meaningful. • We needed something that  Could be run by DEV/QA on demand.  Produces repeatable indicator of search engine quality.  Does not require production deployment. • This tool is the subject of our talk today. But…
  • 10. • We have  Query Logs – from query string entered by user.  Click Logs – from the download links clicked by user. • We can generate  Search results for each query against the search engine. • Combining which we can provide  Click Rank Distribution for a search engine (configuration) Solution Overview
  • 11. Click Rank Definition • Click Rank is the sum of the ranks of all PIIs in the result set that match the PIIs clicked for that query, divided by the number of matches. • Deconstructing the above:  Let the click logs for a query be the document set Q.  Let the top N search results for the query be represented by a List R.  Let P be the intersection of Q and R, and P be the (one-based) indexes of the documents in R that are in P.  Click Rank = ÎŁ P / || P ||
  • 13. Preprocess Logs • Query and Click logs already merged by A/B test framework. • Provided as line-oriented JSON format.
  • 14. Generate Search Results for Engine • Replay query logs against Search Engine (configuration). • Use Python Multiprocessing module (for parallel access).
  • 16. Search Results Example Data • Search Results are saved one file per query. • Top 50 results extracted (so each file is 50 lines long). • Maintains parity with FAST reference query results (provided one- time via legacy process).
  • 17. Compute Click Rank Distribution • Use Apache Spark to compute the Click Rank Distribution. • Use Python + Matplotlib to build reports and visualizations.
  • 18. Spark Code for Generating Click Rank Distribution • Skeleton of a PySpark Program
  • 19. Spark Code for Generating Click Rank Distribution • Step #1: Convert the Clicks data to (Query_ID, List of clicked PIIs)
  • 20. Spark Code for Generating Click Rank Distribution • Step #2: Convert the Search Results data to (Query_ID, List of PIIs in search result)
  • 21. Spark Code for Generating Click Rank Distribution • Step #3: Join the two RDDs and compute Click Rank from the intersection of the clicked PIIs and the result PIIs for each Query_ID.
  • 22. Generate Reports • Download Click Rank Distribution for Search Engine (configuration). • Use Python + Matplotlib to build reports and visualizations.
  • 23. Outputs from Tool • Step #4: Download distribution from S3 and aggregate to chart and spreadsheet.
  • 24. How did we do (in our A/B test)? • Solr PDF downloads were 99.6% of FAST downloads. • Difference in download rates not statistically significant. • Decision made to put Solr into production. 90% 95% 100% 105% Jan (AB #1) Feb (AB #2) Mar (AB #3) Apr (AB #4) SOLR Downloads as % of FAST Downloads PDF HTML Level of FAST Downloads
  • 26. Find Search Result Overlap between Configurations • Measure drift between two search configurations. • Ordered and Unordered comparison at different top N positions. • Result set overlap increases with N. • Lot of positional overlap in the top N positions across engines.
  • 27. Search Quality as Overlap between Title and Query • Measures overlap of title words with query words for various top N positions. • Overlap @ N defined as sum of number of words overlap for the first N titles with the query normalized by N times the number of words in the query. • Overlap @ N decreases monotonically with N. • Solr engines seem to do better at this measure.
  • 28. Click Distribution • Measures the distribution of clicked positions across the top 50 positions for each engine and compares them. • In this chart, FAST has higher number of clicks at the top positions than the Solr configurations shown.
  • 29. Distribution of Publication Dates in Results • The engine has a temporal component in its ranking algorithm. • Compares the distribution of publication dates across search engine configurations to visualize its behavior.
  • 30. More Uses … • Measuring impact of query response time on click rank. • Comparing click rank distributions by document type. • …
  • 31. • Compute (Average/Median) CR per user. • Compute CR per query and user. • Use this as input to Learning to Rank Algorithms. • Other ideas… Future Work
  • 32. Thank you for listening! Questions? My Email: sujit.pal@elsevier.com