SlideShare a Scribd company logo
How to Build a Semantic
Search System
Trey Grainger
SVP of Engineering, Lucidworks
@treygrainger
#Activate18 #ActivateSearch
Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• Georgia Tech – MBA, Management of Technology
• Furman University – BA, Computer Science, Business, & Philosophy
• Stanford University – Information Retrieval & Web Search
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Advisor to Presearch, the decentralized search engine
• Lucene / Solr contributor
About Me
Agenda
•Philosophy of Semantic Search
•Technology for Semantic Search
•Q&A / Demo (time permitting)
Lucidworks Fusion powers search for the brightest companies in the
world.
Philosophy
of Semantic Search
How to Build a Semantic Search System
most often used in
reference to
“free text”
My Three Philosophical Assertions
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical
“structured data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
Assertion 1:
Unstructured data is actually
“hyper-structured” data. It is a
graph that contains much
more structure than typical
“structured data.”
Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon Musk 5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at Activate 2018. #ActivateSearch
(Activate) is being held in Montreal October 15-18,
2018. Trey got his masters from Georgia Tech.
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Unstructured Data
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Foreign Key?
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzy Foreign Key? (Entity Resolution)
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Fuzzier Foreign Key? (metadata, latent features)
Fuzzier Foreign Key? (metadata, latent features)
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Not so fast!
How to Build a Semantic Search System
How to Build a Semantic Search System
Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the Activate 2018.
#ActivateSearch
(Activate) is being held in Montreal
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Assertion 1 (Summary):
Unstructured data is actually
“hyper-structured” data. It is a
graph that contains much
more structure than typical
“structured data.”
Assertion 2:
That graph is very rich, but is a
compression of meaning into a
lossy format. Much of data science
is essentially the decompression
from this lossy format into a
reconstituted form.
01
Semantic Data Encoded into Free Text Content
How do we easily harness this
“semantic graph” of relationships
within unstructured information?
Search Engines are really good at
querying across character sequences,
term sequences, and documents
Example Queries:
c?o CTO, CEO, CFO, …
"VP Engineering"~2 “VP of Engineering”,
VP Engineering” ,“Engineering VP”,
“VP of Infrastructure Engineering”
(Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda,
Mohammed Korayem,
Andries Smith.“The Semantic
Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any
relationship within a domain”.
DSAA 2016.
Knowledge
Graph
field term postings
list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
Semantic Knowledge Graph
DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Graph Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_skill has_related_skill
has_related_skill
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
has_related_job_title
Search engines also do relevancy ranking
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency
saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document
normalization.
Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Assertion 2 (Summary):
That graph is very rich, but is a
compression of meaning into a
lossy format. Much of data science
is essentially the decompression
from this lossy format into a
reconstituted form.
Assertion 3:
Every instance of a word or phrase you
ever encounter has a unique meaning.
Thought Exercise
What do you think of when I say the
word “driver”?
What about “architect”?
Ambiguity
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Thought Exercise
What do you think of when I say the
word “Facebook”?
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
What does “love” mean?
http://localhost:8983/solr/thesaurus/skg
What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
My Three Assertions (Recap)
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical
“structured data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
Technology
for Semantic Search
So why all the philosophy?
Because it’s much more important to intuitively understand the
kinds of problem we’re trying to solve with Semantic Search than to
jump head-first into the Solution.
Because otherwise we may build the wrong thing, which can
sometimes be worse than not doing anything.
And once you have an intuitive sense of the problems you need to
solve, you can confidently use the tools I’m about to describe to
build the right solution for your specific domain.
So what’s the end goal here?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
"machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
Semantic Search Components:
• Apache Solr
• Solr Text Tagger
• Semantic Knowledge Graph
• Statistical Phrase Identifier
• Fusion Semantic Query Pipelines
• Fusion AI Synonyms Job
• Fusion AI Token & Phrase Spell Correction Job
• Fusion AI Head/Tail Analysis Job
• Fusion AI Phrase Identification Job
• Fusion Query Rules Engine
In 2018, Lucidworks has added the
following capabilities to Solr:
• Solr Text Tagger
• Semantic Knowledge Graph
• Statistical Phrase Identifier
which all integrate seamlessly
with the following in Fusion:
• Fusion Semantic Query Pipelines
• Fusion AI Synonyms Job
• Fusion AI Token & Phrase Spell Correction Job
• Fusion AI Head/Tail Analysis Job
• Fusion Phrase Identification Job
• Fusion Query Rules Engine
Through these tools, Fusion self-learns
domain-specific semantic relationships
… and enables domain experts to easily
accept or adjust the built in AI… …completely deferring to Fusion’s AI, or
trusting it above a certain
confidence level, or
even manually
approving every
suggestion.
Fusion AI Semantic Search Jobs
Semantic Query Pipeline
How to Build a Semantic Search System
How to Build a Semantic Search System
Released in Solr 7.5
How to Build a Semantic Search System
Semantic Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously built,
cleaned, and refined based upon common inputs from
interactions with real users of the system. We use the Solr Text
Tagger (already covered) for this at query time.*
2) Also invoke a probabilistic query parser to dynamically
identify unknown phrases using statistics from a corpus of data
(language model)
3) Final algorithm to choose the best merge when the two
approaches disagree.
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
Probabilistic Query Parsing
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
Released in Solr 7.5
How to Build a Semantic Search System
Released in Solr 7.4
How to Build a Semantic Search System
A few last thoughts to leave
you with…
Words of Advice:
You can’t improve what
you can’t measure.
Importance of Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
Southern Data Science
Traditional
Keyword
Search
Recommendations
Semantic
Search
User Intent
Personalized
Search
Augmented
Search
Domain-aware
Matching
Going beyond semantic search…
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
Thank you!
http://solrinaction.com
#Activate18 #ActivateSearch
Other presentations:
http://www.treygrainger.com
Discount code:ctwactivate18

More Related Content

PDF
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
PPTX
Lexical Semantics, Semantic Similarity and Relevance for SEO
PDF
40 Deep #SEO Insights for 2023
PPTX
Keyword Research and Topic Modeling in a Semantic Web
PPTX
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
PPTX
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
PPTX
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
PDF
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Lexical Semantics, Semantic Similarity and Relevance for SEO
40 Deep #SEO Insights for 2023
Keyword Research and Topic Modeling in a Semantic Web
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Coronavirus and Future of SEO: Digital Marketing and Remote Culture

What's hot (20)

PPTX
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
PPTX
How to Automatically Subcategorise Your Website Automatically With Python
PDF
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
PPTX
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
PPT
SEO & Patents Vrtualcon v. 3.0
PPTX
Semantic search
PPTX
Semantic Publishing and Entity SEO - Conteference 20-11-2022
PPTX
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
PDF
Antifragility in Digital Marketing
PPTX
Semantic seo and the evolution of queries
PPTX
Python for SEO
PPTX
Slawski New Approaches for Structured Data:Evolution of Question Answering
PDF
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
PDF
Automating Google Lighthouse
PDF
The Python Cheat Sheet for the Busy Marketer
PDF
Quality Content at Scale Through Automated Text Summarization of UGC
PDF
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
PDF
A beginner's guide to machine learning for SEOs - WTSFest 2022
PDF
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
How to Automatically Subcategorise Your Website Automatically With Python
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Natural Language Search with Knowledge Graphs (Activate 2019)
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
SEO & Patents Vrtualcon v. 3.0
Semantic search
Semantic Publishing and Entity SEO - Conteference 20-11-2022
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Antifragility in Digital Marketing
Semantic seo and the evolution of queries
Python for SEO
Slawski New Approaches for Structured Data:Evolution of Question Answering
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
Automating Google Lighthouse
The Python Cheat Sheet for the Busy Marketer
Quality Content at Scale Through Automated Text Summarization of UGC
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
A beginner's guide to machine learning for SEOs - WTSFest 2022
7 E-Commerce SEO Mistakes & How to Fix Them #DeepSEOCon
Ad

Similar to How to Build a Semantic Search System (20)

PPTX
Searching for Meaning
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
PPTX
The Semantic Knowledge Graph
PPTX
The Apache Solr Semantic Knowledge Graph
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
PPTX
From keyword-based search to language-agnostic semantic search
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
PDF
Reflected intelligence evolving self-learning data systems
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
PPTX
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
PPTX
Building a real time, solr-powered recommendation engine
PDF
AI, Search, and the Disruption of Knowledge Management
PDF
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
PDF
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
PDF
Test Trend Analysis : Towards robust, reliable and timely tests
PDF
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
PDF
Abcd iqs ssoftware-projects-mercecrosas
PDF
SDSC18 and DSATL Meetup March 2018
Searching for Meaning
The Relevance of the Apache Solr Semantic Knowledge Graph
The Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
Natural Language Search with Knowledge Graphs (Chicago Meetup)
From keyword-based search to language-agnostic semantic search
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Reflected intelligence evolving self-learning data systems
Natural Language Search with Knowledge Graphs (Haystack 2019)
Reflected Intelligence: Lucene/Solr as a self-learning data system
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Building a real time, solr-powered recommendation engine
AI, Search, and the Disruption of Knowledge Management
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
Test Trend Analysis : Towards robust, reliable and timely tests
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Abcd iqs ssoftware-projects-mercecrosas
SDSC18 and DSATL Meetup March 2018
Ad

More from Trey Grainger (18)

PDF
Balancing the Dimensions of User Intent
PDF
Reflected Intelligence: Real world AI in Digital Transformation
PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
PDF
The Next Generation of AI-powered Search
PDF
Measuring Relevance in the Negative Space
PDF
The Future of Search and AI
PPTX
The Intent Algorithms of Search & Recommendation Engines
PPTX
Building Search & Recommendation Engines
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
PPTX
Self-learned Relevancy with Apache Solr
PPTX
The Apache Solr Smart Data Ecosystem
PPTX
South Big Data Hub: Text Data Analysis Panel
PPTX
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
PDF
Enhancing relevancy through personalization & semantic search
PDF
Building a real time big data analytics platform with solr
Balancing the Dimensions of User Intent
Reflected Intelligence: Real world AI in Digital Transformation
Thought Vectors and Knowledge Graphs in AI-powered Search
The Next Generation of AI-powered Search
Measuring Relevance in the Negative Space
The Future of Search and AI
The Intent Algorithms of Search & Recommendation Engines
Building Search & Recommendation Engines
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Self-learned Relevancy with Apache Solr
The Apache Solr Smart Data Ecosystem
South Big Data Hub: Text Data Analysis Panel
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Semantic & Multilingual Strategies in Lucene/Solr
Crowdsourced query augmentation through the semantic discovery of domain spec...
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Enhancing relevancy through personalization & semantic search
Building a real time big data analytics platform with solr

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PDF
medical staffing services at VALiNTRY
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
history of c programming in notes for students .pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Introduction to Artificial Intelligence
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
System and Network Administraation Chapter 3
PDF
Digital Strategies for Manufacturing Companies
ai tools demonstartion for schools and inter college
medical staffing services at VALiNTRY
Design an Analysis of Algorithms II-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Understanding Forklifts - TECH EHS Solution
history of c programming in notes for students .pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How to Migrate SBCGlobal Email to Yahoo Easily
2025 Textile ERP Trends: SAP, Odoo & Oracle
Navsoft: AI-Powered Business Solutions & Custom Software Development
Introduction to Artificial Intelligence
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How to Choose the Right IT Partner for Your Business in Malaysia
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
How Creative Agencies Leverage Project Management Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
System and Network Administraation Chapter 3
Digital Strategies for Manufacturing Companies

How to Build a Semantic Search System

  • 1. How to Build a Semantic Search System Trey Grainger SVP of Engineering, Lucidworks @treygrainger #Activate18 #ActivateSearch
  • 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • Georgia Tech – MBA, Management of Technology • Furman University – BA, Computer Science, Business, & Philosophy • Stanford University – Information Retrieval & Web Search Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Advisor to Presearch, the decentralized search engine • Lucene / Solr contributor About Me
  • 3. Agenda •Philosophy of Semantic Search •Technology for Semantic Search •Q&A / Demo (time permitting)
  • 4. Lucidworks Fusion powers search for the brightest companies in the world.
  • 7. most often used in reference to “free text”
  • 8. My Three Philosophical Assertions 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 9. Assertion 1: Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.”
  • 10. Structured Data Employees Table id name company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key
  • 11. Unstructured Data Trey Grainger works at Lucidworks. He is speaking at Activate 2018. #ActivateSearch (Activate) is being held in Montreal October 15-18, 2018. Trey got his masters from Georgia Tech.
  • 12. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Unstructured Data
  • 13. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Foreign Key?
  • 14. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzy Foreign Key? (Entity Resolution)
  • 15. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzier Foreign Key? (metadata, latent features)
  • 16. Fuzzier Foreign Key? (metadata, latent features) Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Not so fast!
  • 19. Giant Graph of Relationships... Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail
  • 20. Assertion 1 (Summary): Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.”
  • 21. Assertion 2: That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form.
  • 22. 01 Semantic Data Encoded into Free Text Content
  • 23. How do we easily harness this “semantic graph” of relationships within unstructured information?
  • 24. Search Engines are really good at querying across character sequences, term sequences, and documents Example Queries: c?o CTO, CEO, CFO, … "VP Engineering"~2 “VP of Engineering”, VP Engineering” ,“Engineering VP”, “VP of Infrastructure Engineering” (Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
  • 25. /solr/collection/select/?q=apache solr Term Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents
  • 26. id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … …
  • 28. DOI: 10.1109/DSAA.2016.51 Conference: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Graph Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_skill has_related_skill has_related_skill has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title
  • 29. Search engines also do relevancy ranking Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
  • 30. Scoring of Node Relationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 31. Assertion 2 (Summary): That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form.
  • 32. Assertion 3: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 33. Thought Exercise What do you think of when I say the word “driver”? What about “architect”?
  • 34. Ambiguity Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 35. Use Case: Query Disambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 36. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 37. Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  • 38. Thought Exercise What do you think of when I say the word “Facebook”?
  • 39. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label
  • 40. What does “love” mean? http://localhost:8983/solr/thesaurus/skg
  • 41. What does “love” mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg
  • 42. What does “love” mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  • 43. My Three Assertions (Recap) 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 45. So why all the philosophy? Because it’s much more important to intuitively understand the kinds of problem we’re trying to solve with Semantic Search than to jump head-first into the Solution. Because otherwise we may build the wrong thing, which can sometimes be worse than not doing anything. And once you have an intuitive sense of the problems you need to solve, you can confidently use the tools I’m about to describe to build the right solution for your specific domain.
  • 46. So what’s the end goal here? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: "machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  • 47. Semantic Search Components: • Apache Solr • Solr Text Tagger • Semantic Knowledge Graph • Statistical Phrase Identifier • Fusion Semantic Query Pipelines • Fusion AI Synonyms Job • Fusion AI Token & Phrase Spell Correction Job • Fusion AI Head/Tail Analysis Job • Fusion AI Phrase Identification Job • Fusion Query Rules Engine
  • 48. In 2018, Lucidworks has added the following capabilities to Solr: • Solr Text Tagger • Semantic Knowledge Graph • Statistical Phrase Identifier
  • 49. which all integrate seamlessly with the following in Fusion: • Fusion Semantic Query Pipelines • Fusion AI Synonyms Job • Fusion AI Token & Phrase Spell Correction Job • Fusion AI Head/Tail Analysis Job • Fusion Phrase Identification Job • Fusion Query Rules Engine
  • 50. Through these tools, Fusion self-learns domain-specific semantic relationships
  • 51. … and enables domain experts to easily accept or adjust the built in AI… …completely deferring to Fusion’s AI, or trusting it above a certain confidence level, or even manually approving every suggestion.
  • 52. Fusion AI Semantic Search Jobs
  • 58. Semantic Query Parsing Identification of phrases in queries using two steps: 1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. We use the Solr Text Tagger (already covered) for this at query time.* 2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model) 3) Final algorithm to choose the best merge when the two approaches disagree. *K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
  • 59. Probabilistic Query Parsing Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
  • 64. A few last thoughts to leave you with…
  • 65. Words of Advice: You can’t improve what you can’t measure.
  • 66. Importance of Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements Southern Data Science
  • 68. Trey Grainger trey.grainger@lucidworks.com @treygrainger Thank you! http://solrinaction.com #Activate18 #ActivateSearch Other presentations: http://www.treygrainger.com Discount code:ctwactivate18