SlideShare a Scribd company logo
Extending Solr:  Building a Cloud-like Knowledge Discovery Platform Trey Grainger , ,CareerBuilder
Overview CareerBuilder’s Cloud-like Knowledge Discovery Platform Scalable approaches to multi-lingual text analysis (with research study) Multiple fields vs Multiple Cores vs Single Field Custom Scoring Payloads and on-the-fly bucket scoring Implementing a keyword spamming penalty Solr as a Cloud Service Scalable, customizable search for everybody Knowledge Discovery & Data Analytics
My background Trey Grainger Search Technology Development Team Lead     @  CareerBuilder.com Relevant Background: Search & Recommendations High-volume, N-tier Architectures NLP, Relevancy Tuning, user group testing & machine learning Fun Side Project:  Founder and Site Architect @  Celiaccess.com
CareerBuilder’s Search Scale Over 1 million new jobs each month Over 40 million resumes ~150 globally distributed search servers  (in the U.S., Europe, & Asia) Several thousand unique, dynamically generated indexes Over a million searches an hour >100 Million Search Documents
Job Search
Resume Search
Talent Network Search
Auto-Complete
Geo-spatial Search
Recommendations We classify all content (Jobs, Resumes, etc.) and index the classified content into Solr We use a combination of collaborative filtering and classification techniques We utilize a custom scorer and payloads to apply higher bucket weights to more relevant content  Recommendations are real-time and largely driven by search
Job Recommendations
Resume Recommendations
Multi-lingual Analysis Approach 1: Different Field Per Language Advantages:  Simple, easiest to implement Disadvantages:  My require keeping duplicate copies of your text per language If searching across each field (dismax style), slows search down, especially if handling many languages Approach 2: Different Solr Core per language Each core has your field defined with a different Analyzer chain  specific to that core’s language Advantages:  Searching can be completely language-agnostic and additional overhead to search more languages  simultaneously is negligible Disadvantages :  Multi-lingual documents require indexing to multiple cores, potentially messing up relevancy and adding complexity Have to write your own language-dependent sharding If you don’t already have distributed search, this adds complexity and overhead
Multi-lingual Analysis Approach 3: All languages in one field Advantages:  Only one field needed regardless of number of languages Avoids a field explosion or a Solr core explosion as you scale to handle more languages Disadvantages: Can end up with some “noise” in the index if you process most text in lots of languages (especially if stemming and not lemmatizing) Currently requires writing your own Tokenizer or Filter Strategy:  1) Copy token stream and create a stemmer/lemmatizer for each language  2) Pass the original into each stemmer/lemmatizer  3) Stack the outputs of each stemmer/lemmatizer Input: Output:
Multi-lingual Analysis Case Study: Stemming vs. Lemmatization Example: dries >> dri  vs  dries >> dry Take-away : Lemmatization allows you to greatly increase recall while preserving the precision you lose with stemming i.e. English shows 92% increase in recall using Lemmatization with minimal impact on precision Measuring Recall Overlap Between Options
Custom Scoring Search Terms can be boosted differently: q=web^2 development^5 AND  jobtitle :(software engineer)^10 Some Fields can be weighted (scored) higher than others i.e.  Field1 ^10,  Field2 ^5,  Field3 ^2,  Field 4 ^.01 Content within Fields can be boosted differently design [ 1 ] / engineer [ 1 ] / really [ ] / great [ ] / job [ ] / ten[ 3 ] / years[ 3 ] / experience[ 3 ] / careerbuilder [ 2 ] / design [ 2 ], … Field1 : bucket=[1] boost= 10;  Field2 : bucket=[2] boost=1.5;  Field3 : bucket=[] weight= 1;  Field4 : bucket=3 weight=1.5 We can pass in a parameter to solr at query time specifying the boost to apply to each bucket  i.e.  …&bucketWeights=1:10;2:1.5;3:1.5   You can also do index-time boosting, but this reduces your ability to do query-side relevancy experiments and requires norms to always be on By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model
Stopping Keyword Spamming We already subclass PayloadTermQuery and tie in custom scoring for our buckets weights For each payload “bucket” (or across all buckets), we can count the number of hits and penalize the score if a particular keyword appears too many times Payload scoring then essentially becomes BucketBoost( payloadBucket )   *   HitMap( #hitsPerbucket ) By adjusting our HitMap function, we can thus generate any kind of relevancy curve for how much each additional term adds to (or subtracts from) the relevancy score for that document ex: Bell curve, Linear, Bi-linear, Linear with drop-off, custom map, etc.
CareerBuilder’s Search Cloud Goals:  Make search easy to use and accessible to all engineers (not just the search team) Allow schema changes without mucking with solr (on hundreds of servers) Make solr installs generic and independent of any particular implementation
Creating a virtual search engine 3 Main Cloud Actions: Index, Search, Delete
Creating a virtual search engine Creating a Schema
Creating a virtual search engine Creating a Document Processing Results A QueryResult object comes back from the  SearchEngine.Search  method with all of the main types (search records, facets, meta info, etc) parsed out into objects Behind the Scenes: We have a distributed architecture handling queuing all documents to appropriate datacenters, feeding the clusters, and load-balancing searches between all available clusters for the given search pool.
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Clustering: Nursing
Clustering: .Net
Clustering: Hyperion Developer
Take Aways Know how your linguistics affect precision and recall and choose wisely; know how to tweak for your domain. A flexible software api that turn Solr into a SAAS type cloud app can greatly increase agility and adoption of search. Search isn’t just about finding and navigating content… it can be used to learn from and create it, as well.
Contact Trey Grainger [email_address] http://www.careerbuilder.com

More Related Content

PDF
Enhancing relevancy through personalization & semantic search
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
PPTX
Building Search & Recommendation Engines
PPTX
The Apache Solr Smart Data Ecosystem
PPTX
The Intent Algorithms of Search & Recommendation Engines
PDF
Reflected intelligence evolving self-learning data systems
PPTX
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Enhancing relevancy through personalization & semantic search
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Crowdsourced query augmentation through the semantic discovery of domain spec...
Building Search & Recommendation Engines
The Apache Solr Smart Data Ecosystem
The Intent Algorithms of Search & Recommendation Engines
Reflected intelligence evolving self-learning data systems
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...

What's hot (20)

PPTX
Self-learned Relevancy with Apache Solr
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
PPTX
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
PDF
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
PDF
Vespa, A Tour
PDF
Building a Real-time Solr-powered Recommendation Engine
PDF
Haystacks slides
PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
PPTX
How to Build a Semantic Search System
PPTX
Building a real time, solr-powered recommendation engine
PPTX
Solr 6.0 Graph Query Overview
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
PDF
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
PDF
Webinar: Simpler Semantic Search with Solr
PPTX
Evolving the Optimal Relevancy Ranking Model at Dice.com
PDF
Solr Graph Query: Presented by Kevin Watters, KMW Technology
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Self-learned Relevancy with Apache Solr
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Reflected Intelligence: Lucene/Solr as a self-learning data system
Semantic & Multilingual Strategies in Lucene/Solr
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Vespa, A Tour
Building a Real-time Solr-powered Recommendation Engine
Haystacks slides
Thought Vectors and Knowledge Graphs in AI-powered Search
How to Build a Semantic Search System
Building a real time, solr-powered recommendation engine
Solr 6.0 Graph Query Overview
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Webinar: Simpler Semantic Search with Solr
Evolving the Optimal Relevancy Ranking Model at Dice.com
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Ad

Similar to Extending Solr: Building a Cloud-like Knowledge Discovery Platform (20)

PDF
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
PDF
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
PDF
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
PPTX
Domain Driven Design
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
PDF
Automated product categorization
PDF
Automated product categorization
PPTX
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
PDF
Sumo Logic QuickStart Webinar - Jan 2016
PPTX
Sumo Logic QuickStart
PPTX
Achieve big data analytic platform with lambda architecture on cloud
PPTX
Sumo Logic QuickStart - May 2016
PDF
Sumo Logic Quick Start - Feb 2016
PPTX
Apache Solr for begginers
PDF
Sumo Logic Quickstart Training 10/14/2015
PDF
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
DOCX
SURE Research Report
PPTX
Azure CosmosDb - Where we are
PPTX
Webinar: Scaling MongoDB
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Domain Driven Design
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Automated product categorization
Automated product categorization
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart
Achieve big data analytic platform with lambda architecture on cloud
Sumo Logic QuickStart - May 2016
Sumo Logic Quick Start - Feb 2016
Apache Solr for begginers
Sumo Logic Quickstart Training 10/14/2015
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
SURE Research Report
Azure CosmosDb - Where we are
Webinar: Scaling MongoDB
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Ad

More from Trey Grainger (15)

PDF
Balancing the Dimensions of User Intent
PDF
Reflected Intelligence: Real world AI in Digital Transformation
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
PDF
The Next Generation of AI-powered Search
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
PDF
AI, Search, and the Disruption of Knowledge Management
PDF
Measuring Relevance in the Negative Space
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
PDF
The Future of Search and AI
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
PPTX
Searching for Meaning
PPTX
The Apache Solr Semantic Knowledge Graph
PPTX
South Big Data Hub: Text Data Analysis Panel
PPTX
The Semantic Knowledge Graph
PDF
Building a real time big data analytics platform with solr
Balancing the Dimensions of User Intent
Reflected Intelligence: Real world AI in Digital Transformation
Natural Language Search with Knowledge Graphs (Chicago Meetup)
The Next Generation of AI-powered Search
Natural Language Search with Knowledge Graphs (Activate 2019)
AI, Search, and the Disruption of Knowledge Management
Measuring Relevance in the Negative Space
Natural Language Search with Knowledge Graphs (Haystack 2019)
The Future of Search and AI
The Relevance of the Apache Solr Semantic Knowledge Graph
Searching for Meaning
The Apache Solr Semantic Knowledge Graph
South Big Data Hub: Text Data Analysis Panel
The Semantic Knowledge Graph
Building a real time big data analytics platform with solr

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

  • 1. Extending Solr: Building a Cloud-like Knowledge Discovery Platform Trey Grainger , ,CareerBuilder
  • 2. Overview CareerBuilder’s Cloud-like Knowledge Discovery Platform Scalable approaches to multi-lingual text analysis (with research study) Multiple fields vs Multiple Cores vs Single Field Custom Scoring Payloads and on-the-fly bucket scoring Implementing a keyword spamming penalty Solr as a Cloud Service Scalable, customizable search for everybody Knowledge Discovery & Data Analytics
  • 3. My background Trey Grainger Search Technology Development Team Lead @ CareerBuilder.com Relevant Background: Search & Recommendations High-volume, N-tier Architectures NLP, Relevancy Tuning, user group testing & machine learning Fun Side Project: Founder and Site Architect @ Celiaccess.com
  • 4. CareerBuilder’s Search Scale Over 1 million new jobs each month Over 40 million resumes ~150 globally distributed search servers (in the U.S., Europe, & Asia) Several thousand unique, dynamically generated indexes Over a million searches an hour >100 Million Search Documents
  • 10. Recommendations We classify all content (Jobs, Resumes, etc.) and index the classified content into Solr We use a combination of collaborative filtering and classification techniques We utilize a custom scorer and payloads to apply higher bucket weights to more relevant content Recommendations are real-time and largely driven by search
  • 13. Multi-lingual Analysis Approach 1: Different Field Per Language Advantages: Simple, easiest to implement Disadvantages: My require keeping duplicate copies of your text per language If searching across each field (dismax style), slows search down, especially if handling many languages Approach 2: Different Solr Core per language Each core has your field defined with a different Analyzer chain specific to that core’s language Advantages: Searching can be completely language-agnostic and additional overhead to search more languages simultaneously is negligible Disadvantages : Multi-lingual documents require indexing to multiple cores, potentially messing up relevancy and adding complexity Have to write your own language-dependent sharding If you don’t already have distributed search, this adds complexity and overhead
  • 14. Multi-lingual Analysis Approach 3: All languages in one field Advantages: Only one field needed regardless of number of languages Avoids a field explosion or a Solr core explosion as you scale to handle more languages Disadvantages: Can end up with some “noise” in the index if you process most text in lots of languages (especially if stemming and not lemmatizing) Currently requires writing your own Tokenizer or Filter Strategy: 1) Copy token stream and create a stemmer/lemmatizer for each language 2) Pass the original into each stemmer/lemmatizer 3) Stack the outputs of each stemmer/lemmatizer Input: Output:
  • 15. Multi-lingual Analysis Case Study: Stemming vs. Lemmatization Example: dries >> dri vs dries >> dry Take-away : Lemmatization allows you to greatly increase recall while preserving the precision you lose with stemming i.e. English shows 92% increase in recall using Lemmatization with minimal impact on precision Measuring Recall Overlap Between Options
  • 16. Custom Scoring Search Terms can be boosted differently: q=web^2 development^5 AND jobtitle :(software engineer)^10 Some Fields can be weighted (scored) higher than others i.e. Field1 ^10, Field2 ^5, Field3 ^2, Field 4 ^.01 Content within Fields can be boosted differently design [ 1 ] / engineer [ 1 ] / really [ ] / great [ ] / job [ ] / ten[ 3 ] / years[ 3 ] / experience[ 3 ] / careerbuilder [ 2 ] / design [ 2 ], … Field1 : bucket=[1] boost= 10; Field2 : bucket=[2] boost=1.5; Field3 : bucket=[] weight= 1; Field4 : bucket=3 weight=1.5 We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:1.5;3:1.5   You can also do index-time boosting, but this reduces your ability to do query-side relevancy experiments and requires norms to always be on By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model
  • 17. Stopping Keyword Spamming We already subclass PayloadTermQuery and tie in custom scoring for our buckets weights For each payload “bucket” (or across all buckets), we can count the number of hits and penalize the score if a particular keyword appears too many times Payload scoring then essentially becomes BucketBoost( payloadBucket ) * HitMap( #hitsPerbucket ) By adjusting our HitMap function, we can thus generate any kind of relevancy curve for how much each additional term adds to (or subtracts from) the relevancy score for that document ex: Bell curve, Linear, Bi-linear, Linear with drop-off, custom map, etc.
  • 18. CareerBuilder’s Search Cloud Goals: Make search easy to use and accessible to all engineers (not just the search team) Allow schema changes without mucking with solr (on hundreds of servers) Make solr installs generic and independent of any particular implementation
  • 19. Creating a virtual search engine 3 Main Cloud Actions: Index, Search, Delete
  • 20. Creating a virtual search engine Creating a Schema
  • 21. Creating a virtual search engine Creating a Document Processing Results A QueryResult object comes back from the SearchEngine.Search method with all of the main types (search records, facets, meta info, etc) parsed out into objects Behind the Scenes: We have a distributed architecture handling queuing all documents to appropriate datacenters, feeding the clusters, and load-balancing searches between all available clusters for the given search pool.
  • 22. Knowledge Discovery & Data Analytics
  • 23. Knowledge Discovery & Data Analytics
  • 24. Knowledge Discovery & Data Analytics
  • 25. Knowledge Discovery & Data Analytics
  • 26. Knowledge Discovery & Data Analytics
  • 27. Knowledge Discovery & Data Analytics
  • 31. Take Aways Know how your linguistics affect precision and recall and choose wisely; know how to tweak for your domain. A flexible software api that turn Solr into a SAAS type cloud app can greatly increase agility and adoption of search. Search isn’t just about finding and navigating content… it can be used to learn from and create it, as well.
  • 32. Contact Trey Grainger [email_address] http://www.careerbuilder.com