SlideShare a Scribd company logo
Crowdsourced Query Augmentation through the 
Semantic Discovery of Domain-specific Jargon 
Khalifeh Aljadda, Mohammed Korayem, 
Trey Grainger, Chris Russell 
2014.10.28 - 2014 IEEE International Conference on Big Data - Washington, D.C.
Authors 
• Khalifeh AlJadda 
– Ph.D. Candidate, University of Georgia 
• Mohammed Korayem 
– Ph.D. Candidate, Indiana University 
• Trey Grainger 
– Director of Engineering, Search, CareerBuilder 
• Chris Russell 
– Engineering Lead, Relevancy & Recommendations, CareerBuilder
The problem 
• Traditional search engines (i.e. Lucene, Solr, Elasticsearch) tokenize text 
and find documents containing those tokens and linguistic variations: 
– User’s Search: machine learning 
Tokenization: ["machine", "learning"] => 
Stemming: ["machin", "learn"] 
Final Query: machin AND learn 
This could match a document for a “machinist” who has “learned” something. 
– software architect => … => software AND architect 
• Might identify a building architect requiring knowledge of specialized architecture software 
– account manager => … => account AND manag 
• Will match text such as “need to manage the process and account for any variances” 
• We need a way to identify and search for the meaning of keyword 
phrases, not just the individual text tokens 
– i.e. machine learning = "machine learning" OR "data scientist" OR 
"mahout" OR "svm" OR "neural networks" …
Goals for the proposed system 
• System should be language-agnostic. We don’t want custom NLP 
rules to be required for each language (we support dozens of 
languages). 
• The output of the system should be human-readable. We want to 
show user’s how we enhance their queries in language they will 
understand so they can modify our enhancements. 
• The system should be very high-precision (since end-users will be 
seeing and critiquing the output) and should be automatically 
updated based upon new data. 
• The system must be fast and scalable, handling billions of search 
log entries (offline) and processing millions of queries an hour in 
real-time
Alternate Techniques 
• Latent Semantic Indexing 
– Approach involves doing dimensionality reduction of text 
across your corpus to derive underlying relationships 
between terms: 
• i.e. java => programming, c# => programming, 
therefore they are related. 
– Pros: 
• Can be run automatically against your corpus of data to 
discover underlying (latent) relationships between 
terms, which requires very little human work 
– Cons: 
• The latent relationships often aren’t represented as a 
human would express them, so it would confuse users 
if they saw this information.
Alternate Techniques 
• Manual building of taxonomies 
– Approach requires hiring human data analysts to manually 
build, correct, and improve taxonomies 
– Pros: 
• high-precision relationships can be mapped depending 
upon the quality of your hired data analysts 
– Cons: 
• Requires human analysts to comb through hundreds of 
thousands of data points and generate lists of 
important phrases and relationships, which go stale 
• Requires expertise in every supported spoken language 
to rebuild taxonomies per-language
Example use case 
• User’s Query: 
machine learning research and development Portland, OR software engineer 
AND hadoop java 
• Traditional Search Engine Parsing: 
(machine AND learning AND research AND development AND portland) OR (software 
AND engineer AND hadoop AND java ) 
• Ideal Parsing: 
"machine learning" AND "research and development" AND "Portland, OR” AND 
"software engineer" AND hadoop AND java 
• Semantically Enhanced Query: 
("machine learning" OR "computer vision" OR "data mining" OR matlab) AND 
("research and development" OR "r&d") AND ("Portland, OR" OR "Portland, 
Oregon") AND ("software engineer" OR "software developer") AND (hadoop OR 
"big data" OR hbase OR hive) AND (java OR j2ee)
Proposed strategy 
1. Mine user search logs for a list of common phrases (“jargon”) 
within our domain. 
2. Perform collaborative filtering on the common jargon (“user’s who 
searched for that phrase also search for this phrase”) 
3. Remove noise through several methodologies: 
– Segment search phrases based upon the 
classification of users 
– Consider shared jargon used by multiple 
sides of our two-sided market (i.e. both 
Job Seekers and Recruiters utilize the 
same phrase) 
– Validate that the two “related” phrases 
actually co-occur in real content (i.e. within 
the same job or resume) with some 
frequency
Finding and Scoring Related Jargon 
● Implementation: 
Map/Reduce job which finds and scores similar searches run for the same users 
○ Jane searched for “registered nurse” and “r.n.” and “nurse”. 
○ Zeke searched for “java developer” and “scala” and “jvm” and “j2ee”
Finding and Scoring Related Jargon 
Similarity Score: 
To do the collaborative filtering, we look at two similarity measures: 
1. Search Co-occurrences - provides raw, real-world correlation 
2. Point-wise Mutual Information - examines probability of terms being 
related by contrasting individual vs joint distributions: 
Final Score:
Example output
Example output 
Cashier => retail, retail cashier, customer service, cashiers 
CDL => cdl driver, cdl a, driver 
Data Scientist => machine learning, big data
Final System Architecture
Follow-on work: Differentiating related Jargon 
Synonyms: cpa => Certified Public Accountant 
rn => Registered Nurse 
r.n. => Registered Nurse 
Ambiguous Terms*: driver => driver (trucking) ~80% 
driver => driver (software) ~20% 
Related Terms: r.n. => nursing, bsn 
hadoop => mapreduce, hive, pig 
*disambiguated based upon user and query context
Applicability of Methodology 
• Can be used to discover domain-specific jargon across most 
domains (not just employment search) 
• Can be used to discover related jargon in any language since 
the jargon and relationships is crowd-sourced at the phrase 
level 
• The high-precision results achieved by intersecting input from 
both sides of a two-sided market is optional. If you only have 
a single source of user queries, you will just get lower-precision 
mappings. 
• The only absolute requirement is sufficient search log history 
mapping users to multiple search phrases
Q&A
Semantic Search “under the hood”
Contact Info 
▪ Trey Grainger 
trey.grainger@careerbuilder.com 
@treygrainger 
Other presentations: 
http://www.treygrainger.com http://solrinaction.com 
Yes, WE ARE HIRING @CareerBuilder. Come talk with me if you are interested…

More Related Content

PPTX
The Intent Algorithms of Search & Recommendation Engines
PPTX
Self-learned Relevancy with Apache Solr
PDF
Reflected intelligence evolving self-learning data systems
PPTX
The Apache Solr Smart Data Ecosystem
PPTX
How to Build a Semantic Search System
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
PPTX
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
The Intent Algorithms of Search & Recommendation Engines
Self-learned Relevancy with Apache Solr
Reflected intelligence evolving self-learning data systems
The Apache Solr Smart Data Ecosystem
How to Build a Semantic Search System
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Reflected Intelligence: Lucene/Solr as a self-learning data system
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...

What's hot (20)

PPT
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
PDF
Enhancing relevancy through personalization & semantic search
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
PPTX
Building Search & Recommendation Engines
PPTX
South Big Data Hub: Text Data Analysis Panel
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
PDF
The Next Generation of AI-powered Search
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
PPTX
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
PDF
The Future of Search and AI
PPTX
The Semantic Knowledge Graph
PDF
Measuring Relevance in the Negative Space
PDF
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
PDF
Vespa, A Tour
PDF
AI, Search, and the Disruption of Knowledge Management
PDF
Haystacks slides
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
PPTX
The Apache Solr Semantic Knowledge Graph
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Enhancing relevancy through personalization & semantic search
The Relevance of the Apache Solr Semantic Knowledge Graph
Building Search & Recommendation Engines
South Big Data Hub: Text Data Analysis Panel
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
The Next Generation of AI-powered Search
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
The Future of Search and AI
The Semantic Knowledge Graph
Measuring Relevance in the Negative Space
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Vespa, A Tour
AI, Search, and the Disruption of Knowledge Management
Haystacks slides
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
The Apache Solr Semantic Knowledge Graph
Ad

Similar to Crowdsourced query augmentation through the semantic discovery of domain specific jargon (20)

PPTX
Query Understanding
PPT
Semantic Search
PPTX
Mining Web content for Enhanced Search
PPTX
Semantic Search tutorial at SemTech 2012
PPTX
Techniques For Deep Query Understanding
PPT
2_Capability.ppt
PDF
Webinar: Simpler Semantic Search with Solr
PDF
Query Understanding at LinkedIn [Talk at Facebook]
PDF
Semantic Search Tutorial at SemTech 2012
PPTX
From keyword-based search to language-agnostic semantic search
PDF
From Linked Data to Semantic Applications
PPTX
Large-Scale Semantic Search
PPTX
Semantic Search for Sourcing and Recruiting
PPT
Related Entity Finding on the Web
PPTX
Semtech bizsemanticsearchtutorial
PDF
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
PDF
Content Discovery Through Entity Driven Search
PPTX
Taxonomies in Search
PDF
Multi-language Content Discovery Through Entity Driven Search
Query Understanding
Semantic Search
Mining Web content for Enhanced Search
Semantic Search tutorial at SemTech 2012
Techniques For Deep Query Understanding
2_Capability.ppt
Webinar: Simpler Semantic Search with Solr
Query Understanding at LinkedIn [Talk at Facebook]
Semantic Search Tutorial at SemTech 2012
From keyword-based search to language-agnostic semantic search
From Linked Data to Semantic Applications
Large-Scale Semantic Search
Semantic Search for Sourcing and Recruiting
Related Entity Finding on the Web
Semtech bizsemanticsearchtutorial
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
Content Discovery Through Entity Driven Search
Taxonomies in Search
Multi-language Content Discovery Through Entity Driven Search
Ad

More from Trey Grainger (9)

PDF
Balancing the Dimensions of User Intent
PDF
Reflected Intelligence: Real world AI in Digital Transformation
PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
PPTX
Searching for Meaning
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
PDF
Building a real time big data analytics platform with solr
PPTX
Building a real time, solr-powered recommendation engine
Balancing the Dimensions of User Intent
Reflected Intelligence: Real world AI in Digital Transformation
Thought Vectors and Knowledge Graphs in AI-powered Search
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Searching for Meaning
Semantic & Multilingual Strategies in Lucene/Solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Building a real time big data analytics platform with solr
Building a real time, solr-powered recommendation engine

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx

Crowdsourced query augmentation through the semantic discovery of domain specific jargon

  • 1. Crowdsourced Query Augmentation through the Semantic Discovery of Domain-specific Jargon Khalifeh Aljadda, Mohammed Korayem, Trey Grainger, Chris Russell 2014.10.28 - 2014 IEEE International Conference on Big Data - Washington, D.C.
  • 2. Authors • Khalifeh AlJadda – Ph.D. Candidate, University of Georgia • Mohammed Korayem – Ph.D. Candidate, Indiana University • Trey Grainger – Director of Engineering, Search, CareerBuilder • Chris Russell – Engineering Lead, Relevancy & Recommendations, CareerBuilder
  • 3. The problem • Traditional search engines (i.e. Lucene, Solr, Elasticsearch) tokenize text and find documents containing those tokens and linguistic variations: – User’s Search: machine learning Tokenization: ["machine", "learning"] => Stemming: ["machin", "learn"] Final Query: machin AND learn This could match a document for a “machinist” who has “learned” something. – software architect => … => software AND architect • Might identify a building architect requiring knowledge of specialized architecture software – account manager => … => account AND manag • Will match text such as “need to manage the process and account for any variances” • We need a way to identify and search for the meaning of keyword phrases, not just the individual text tokens – i.e. machine learning = "machine learning" OR "data scientist" OR "mahout" OR "svm" OR "neural networks" …
  • 4. Goals for the proposed system • System should be language-agnostic. We don’t want custom NLP rules to be required for each language (we support dozens of languages). • The output of the system should be human-readable. We want to show user’s how we enhance their queries in language they will understand so they can modify our enhancements. • The system should be very high-precision (since end-users will be seeing and critiquing the output) and should be automatically updated based upon new data. • The system must be fast and scalable, handling billions of search log entries (offline) and processing millions of queries an hour in real-time
  • 5. Alternate Techniques • Latent Semantic Indexing – Approach involves doing dimensionality reduction of text across your corpus to derive underlying relationships between terms: • i.e. java => programming, c# => programming, therefore they are related. – Pros: • Can be run automatically against your corpus of data to discover underlying (latent) relationships between terms, which requires very little human work – Cons: • The latent relationships often aren’t represented as a human would express them, so it would confuse users if they saw this information.
  • 6. Alternate Techniques • Manual building of taxonomies – Approach requires hiring human data analysts to manually build, correct, and improve taxonomies – Pros: • high-precision relationships can be mapped depending upon the quality of your hired data analysts – Cons: • Requires human analysts to comb through hundreds of thousands of data points and generate lists of important phrases and relationships, which go stale • Requires expertise in every supported spoken language to rebuild taxonomies per-language
  • 7. Example use case • User’s Query: machine learning research and development Portland, OR software engineer AND hadoop java • Traditional Search Engine Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java ) • Ideal Parsing: "machine learning" AND "research and development" AND "Portland, OR” AND "software engineer" AND hadoop AND java • Semantically Enhanced Query: ("machine learning" OR "computer vision" OR "data mining" OR matlab) AND ("research and development" OR "r&d") AND ("Portland, OR" OR "Portland, Oregon") AND ("software engineer" OR "software developer") AND (hadoop OR "big data" OR hbase OR hive) AND (java OR j2ee)
  • 8. Proposed strategy 1. Mine user search logs for a list of common phrases (“jargon”) within our domain. 2. Perform collaborative filtering on the common jargon (“user’s who searched for that phrase also search for this phrase”) 3. Remove noise through several methodologies: – Segment search phrases based upon the classification of users – Consider shared jargon used by multiple sides of our two-sided market (i.e. both Job Seekers and Recruiters utilize the same phrase) – Validate that the two “related” phrases actually co-occur in real content (i.e. within the same job or resume) with some frequency
  • 9. Finding and Scoring Related Jargon ● Implementation: Map/Reduce job which finds and scores similar searches run for the same users ○ Jane searched for “registered nurse” and “r.n.” and “nurse”. ○ Zeke searched for “java developer” and “scala” and “jvm” and “j2ee”
  • 10. Finding and Scoring Related Jargon Similarity Score: To do the collaborative filtering, we look at two similarity measures: 1. Search Co-occurrences - provides raw, real-world correlation 2. Point-wise Mutual Information - examines probability of terms being related by contrasting individual vs joint distributions: Final Score:
  • 12. Example output Cashier => retail, retail cashier, customer service, cashiers CDL => cdl driver, cdl a, driver Data Scientist => machine learning, big data
  • 14. Follow-on work: Differentiating related Jargon Synonyms: cpa => Certified Public Accountant rn => Registered Nurse r.n. => Registered Nurse Ambiguous Terms*: driver => driver (trucking) ~80% driver => driver (software) ~20% Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig *disambiguated based upon user and query context
  • 15. Applicability of Methodology • Can be used to discover domain-specific jargon across most domains (not just employment search) • Can be used to discover related jargon in any language since the jargon and relationships is crowd-sourced at the phrase level • The high-precision results achieved by intersecting input from both sides of a two-sided market is optional. If you only have a single source of user queries, you will just get lower-precision mappings. • The only absolute requirement is sufficient search log history mapping users to multiple search phrases
  • 16. Q&A
  • 18. Contact Info ▪ Trey Grainger trey.grainger@careerbuilder.com @treygrainger Other presentations: http://www.treygrainger.com http://solrinaction.com Yes, WE ARE HIRING @CareerBuilder. Come talk with me if you are interested…