SlideShare a Scribd company logo
Querying your database in natural language
PyData – Silicon Valley 2014
Daniel F. Moisset – dmoisset@machinalis.com
Data is everywhere
Collecting data is not the problem, but what to do with it
Any operation starts with selecting/filtering data
A classical approach
Used by:
●
Google
●
Wikipedia
●
Lucene/Solr
Performance can be improved:
●
Stemming/synonyms
●
Sorting data by relevance
Search
A classical approach
Used by:
●
Google
●
Wikipedia
●
Lucene/Solr
Performance can be improved:
●
Stemming/synonyms
●
Sorting data by relevance
Search
Limits of keyword based approaches
Query Languages
●
SQL
●
Many NOSQL approaches
●
SPARQL
●
MQL
Allow complex, accurate
queries
SELECT array_agg(players), player_teams
FROM (
SELECT DISTINCT t1.t1player AS players, t1.player_teams
FROM (
SELECT
p.playerid AS t1id,
concat(p.playerid,':', p.playername, ' ') AS t1player,
array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams
FROM player p
LEFT JOIN plays pl ON p.playerid = pl.playerid
GROUP BY p.playerid, p.playername
) t1
INNER JOIN (
SELECT
p.playerid AS t2id,
array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams
FROM player p
LEFT JOIN plays pl ON p.playerid = pl.playerid
GROUP BY p.playerid, p.playername
) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <> t2.t2id
) innerQuery
GROUP BY player_teams
Natural Language Queries
Getting popular:
●
Wolfram Alpha
●
Apple Siri
●
Google Now
Pros and cons:
●
Very accessible, trivial
learning curve
●
Still weak in its coverage:
most applications have a
list of “sample questions”
Outline of this talk: the Quepy approach
●
Overview of our solution
●
Simple example
●
DSL
●
Parser
●
Question Templates
●
Quepy applications
●
Benefits
●
Limitations
Quepy
●
Open Source (BSD License)
https://github.com/machinalis/quepy
●
Status: usable, 2 demos available (dbpedia + freebase)
Online demo at: http://quepy.machinalis.com/
●
Complete documentation:
http://quepy.readthedocs.org/en/latest/
●
You're welcome to get involved!
Overview of the approach
●
Parsing
●
Match + Intermediate representation
●
Query generation & DSL
“What is the airspeed velocity of an unladen swallow?”
What|what|WP is|be|VBZ the|the|DT
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Overview of the approach
●
Parsing
●
Match + Intermediate representation
●
Query generation & DSL
“What is the airspeed velocity of an unladen swallow?”
What|what|WP is|be|VBZ the|the|DT
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Parsing
●
Not done at character level but at a word level
●
Word = Token + Lemma + POS
“is” → is|be|VBZ (VBZ means “verb, 3rd
person, singular, present
tense”)
“swallows” → swallows|swallow|NNS (NNS means “Noun, plural”)
●
NLTK is smart enough to know that “swallows” here means the
bird (noun) and not the action (verb)
●
Question rule = “regular expressions”
Token("what") + Lemma("be") + Question(Pos("DT")) + Plus(Pos(“NN”))
The word “what” followed by any variant of the “to be” verb, optionally followed by a
determiner (articles, “all”, “every”), followed by one or more nouns
Intermediate representation
● Graph like, with some known values and some holes (x0
,
x1
, …). Always has a “root” (house shaped in the picture)
●
Similar to knowledge databases
●
Easy to build from Python code
Code generator
●
Built-in for MQL
●
Built-in for SPARQL
●
Possible approaches for SQL, other languages
●
DSL - guided
● Outputs the query string (Quepy does not connect to a
database)
Code examples
DSL
class DefinitionOf(FixedRelation):
Relation = 
"/common/topic/description"
reverse = True
class IsMovie(FixedType):
fixedtype = "/film/film"
class IsPerformance(FixedType):
fixedtype = "/film/performance"
class PerformanceOfActor(FixedRelation):
relation = "/film/performance/actor"
class HasPerformance(FixedRelation):
relation = "/film/film/starring"
class NameOf(FixedRelation):
relation = "/type/object/name"
reverse = True
DSL
class DefinitionOf(FixedRelation):
Relation = 
"/common/topic/description"
reverse = True
class IsMovie(FixedType):
fixedtype = "/film/film"
class IsPerformance(FixedType):
fixedtype = "/film/performance"
class PerformanceOfActor(FixedRelation):
relation = "/film/performance/actor"
class HasPerformance(FixedRelation):
relation = "/film/film/starring"
class NameOf(FixedRelation):
relation = "/type/object/name"
reverse = True
DSL
Given a thing x0
, its definition:
DefinitionOf(x0)
Given an actor x2
, movies where x2
acts:
performances = IsPerformance() + PerformanceOfActor(x2)
movies = IsMovie() + HasPerformance(performances)
x3 = NameOf(movies)
Parsing: Particles and templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") + Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))
def interpret(self, match):
label = DefinitionOf(match.thing)
return label
class Thing(Particle):
regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS"))
def interpret(self, match):
return HasKeyword(match.words.tokens)
Parsing: Particles and templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") + Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))
def interpret(self, match):
label = DefinitionOf(match.thing)
return label
class Thing(Particle):
regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS"))
def interpret(self, match):
return HasKeyword(match.words.tokens)
Parsing: “movies starring <actor>”
●
More DSL:
class IsPerson(FixedType):
fixedtype = "/people/person"
fixedtyperelation = "/type/object/type"
class IsActor(FixedType):
fixedtype = "Actor"
fixedtyperelation = "/people/person/profession"
Parsing: A more complex particle
●
And then a new Particle:
class Actor(Particle):
regex = Plus(Pos("NN") | Pos("NNS") | Pos("NNP") | Pos("NNPS"))
def interpret(self, match):
name = match.words.tokens
return IsPerson() + IsActor() + HasKeyword(name)
Parsing: A more complex template
class ActedOnQuestion(QuestionTemplate):
acted_on = (Lemma("appear") | Lemma("act") | Lemma("star"))
movie = (Lemma("movie") | Lemma("movies") | Lemma("film"))
regex = (Question(Lemma("list")) + movie + Lemma("with") + Actor()) |
(Question(Pos("IN")) + (Lemma("what") | Lemma("which")) +
movie + Lemma("do") + Actor() + acted_on + Question(Pos("."))) |
(Question(Lemma("list")) + movie + Lemma("star") + Actor())
“list movies with Harrison Ford”
“list films starring Harrison Ford”
“In which film does Harrison Ford appear?”
Parsing: A more complex template
class ActedOnQuestion(QuestionTemplate):
# ...
def interpret(self, match):
performance = IsPerformance() + PerformanceOfActor(match.actor)
movie = IsMovie() + HasPerformance(performance)
movie_name = NameOf(movie)
return movie_name
Apps: gluing it all together
●
You build a Python package with quepy startapp myapp
●
There you add dsl and questions templates
●
Then configure it editing myapp/settings.py (output query
language, data encoding)
You can use that with:
app = quepy.install("myapp")
question = "What is love?"
target, query, metadata = app.get_query(question)
db.execute(query)
The good things
●
Effort to add question templates is small (minutes-hours),
and the benefit is linear wrt effort
●
Good for industry applications
●
Low specialization required to extend
●
Human work is very parallelizable
●
Easy to get many people to work on questions
●
Better for domain specific databases
The good things
●
Effort to add question templates is small (minutes-hours),
and the benefit is linear wrt effort
●
Good for industry applications
●
Low specialization required to extend
●
Human work is very parallelizable
●
Easy to get many people to work on questions
●
Better for domain specific databases
Limitations
●
Better for domain specific databases
●
It won't scale to massive amounts of question templates
(they start to overlap/contradict each other)
●
Hard to add computation (compare: Wolfram Alpha) or
deduction (can be added in the database)
●
Not very fast (this is an implementation, not design issue)
●
Requires a structured database
Limitations
●
Better for domain specific databases
●
It won't scale to massive amounts of question templates
(they start to overlap/contradict each other)
●
Hard to add computation (compare: Wolfram Alpha) or
deduction (can be added in the database)
●
Not very fast (this is an implementation, not design issue)
●
Requires a structured database
Future directions
●
Testing this under other databases
●
Improving performance
●
Collecting uncovered questions, add machine learning to
learn new patterns.
Q & A
You can also reach me at:
dmoisset@machinalis.com
Twitter: @dmoisset
http://machinalis.com/
Thanks!

More Related Content

PDF
Vector Search for Data Scientists.pdf
PDF
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
PDF
GraphRAG is All You need? LLM & Knowledge Graph
PDF
Building a Data Science as a Service Platform in Azure with Databricks
PPT
Competitive war games
PPTX
ShEx by Example
PPTX
Mohamed Sabri: Operationalize machine learning with Kubeflow
PDF
The Gong.io Discovery Call Training Deck
Vector Search for Data Scientists.pdf
Developing a Knowledge Graph of your Competency, Skills, and Knowledge at NASA
GraphRAG is All You need? LLM & Knowledge Graph
Building a Data Science as a Service Platform in Azure with Databricks
Competitive war games
ShEx by Example
Mohamed Sabri: Operationalize machine learning with Kubeflow
The Gong.io Discovery Call Training Deck

What's hot (19)

PDF
The Gong.io Objection Handling Training Deck
PDF
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
PDF
Future of Data Engineering
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Introduction to Data Engineering
PPTX
Introduction to Presales Consulting and Proposal Authoring
PPTX
Top 10 ib coordinator interview questions and answers
PDF
The Gong.io Demo Call Training Deck
PDF
The Knowledge Graph Explosion
PDF
Customer Event Hub - the modern Customer 360° view
PDF
Introduction of Knowledge Graphs
PDF
Full-stack Data Scientist
PDF
عرض تقديمي (Drugs used in dentistry) part 1
PDF
The Art of Growth Hacking by Neil Patel
PDF
Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way
PPTX
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
PPTX
LOD (linked open data) part 2 lod 구축과 현황
PDF
Graphs for Enterprise Architects
PDF
Layout lm paper review
The Gong.io Objection Handling Training Deck
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Future of Data Engineering
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Introduction to Data Engineering
Introduction to Presales Consulting and Proposal Authoring
Top 10 ib coordinator interview questions and answers
The Gong.io Demo Call Training Deck
The Knowledge Graph Explosion
Customer Event Hub - the modern Customer 360° view
Introduction of Knowledge Graphs
Full-stack Data Scientist
عرض تقديمي (Drugs used in dentistry) part 1
The Art of Growth Hacking by Neil Patel
Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
LOD (linked open data) part 2 lod 구축과 현황
Graphs for Enterprise Architects
Layout lm paper review
Ad

Viewers also liked (18)

ODP
Querying your database in natural language by Daniel Moisset PyData SV 2014
PPTX
NLIDB(Natural Language Interface to DataBases)
PDF
Iepy pydata-amsterdam-2016
PPTX
Running Natural Language Queries on MongoDB
PDF
Natural Language Processing with Graph Databases and Neo4j
PDF
Introduction to Web Analytics
PPTX
Module 1 introduction to web analytics
PPTX
Introduction to Web Analytics - Zach Olsen Stukent Expert Session
PDF
Topic 1 Introduction to web analytics
PDF
A practical introduction to Web analytics for technical communicators
PDF
Natural Language Processing and Graph Databases in Lumify
PPTX
SAS University Edition - Getting Started
PPT
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
PDF
Introduction to web analytics and the Google analytics platform pdf
PPT
An Introduction to Web Analytics
PDF
Project humix overview - For Raspberry pi community meetup
PDF
Bridging the gap from data science to service
PPTX
台灣樹莓派 2016/12/26 #17 站在Nas的中心呼喊物聯網 QNAP QIoT
Querying your database in natural language by Daniel Moisset PyData SV 2014
NLIDB(Natural Language Interface to DataBases)
Iepy pydata-amsterdam-2016
Running Natural Language Queries on MongoDB
Natural Language Processing with Graph Databases and Neo4j
Introduction to Web Analytics
Module 1 introduction to web analytics
Introduction to Web Analytics - Zach Olsen Stukent Expert Session
Topic 1 Introduction to web analytics
A practical introduction to Web analytics for technical communicators
Natural Language Processing and Graph Databases in Lumify
SAS University Edition - Getting Started
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
Introduction to web analytics and the Google analytics platform pdf
An Introduction to Web Analytics
Project humix overview - For Raspberry pi community meetup
Bridging the gap from data science to service
台灣樹莓派 2016/12/26 #17 站在Nas的中心呼喊物聯網 QNAP QIoT
Ad

Similar to Quepy (20)

PPTX
Clojure/conj 2017
PDF
Fast REST APIs Development with MongoDB
PDF
Python utan-stodhjul-motorsag
PDF
retrieval augmentation generation presentation slide part2
PPTX
Is your excel production code?
PDF
Intro to Machine Learning with TF- workshop
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
The Flow of TensorFlow
PPTX
GreenDao Introduction
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PDF
Session 1.5 supporting virtual integration of linked data with just-in-time...
PPTX
Getting started with tensor flow datasets
PDF
React.js Basics - ConvergeSE 2015
PPTX
The Tidyverse and the Future of the Monitoring Toolchain
PDF
DSLs for fun and profit by Jukka Välimaa
PDF
Scala: Functioneel programmeren in een object georiënteerde wereld
PPTX
30 分鐘學會實作 Python Feature Selection
PDF
Ejercicios de estilo en la programación
PPTX
30 分鐘學會實作 Python Feature Selection
PDF
Penerapan text mining menggunakan python
Clojure/conj 2017
Fast REST APIs Development with MongoDB
Python utan-stodhjul-motorsag
retrieval augmentation generation presentation slide part2
Is your excel production code?
Intro to Machine Learning with TF- workshop
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
The Flow of TensorFlow
GreenDao Introduction
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Session 1.5 supporting virtual integration of linked data with just-in-time...
Getting started with tensor flow datasets
React.js Basics - ConvergeSE 2015
The Tidyverse and the Future of the Monitoring Toolchain
DSLs for fun and profit by Jukka Välimaa
Scala: Functioneel programmeren in een object georiënteerde wereld
30 分鐘學會實作 Python Feature Selection
Ejercicios de estilo en la programación
30 分鐘學會實作 Python Feature Selection
Penerapan text mining menggunakan python

Recently uploaded (20)

PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PPTX
Introduction to Windows Operating System
PPTX
Custom Software Development Services.pptx.pptx
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
Website Design Services for Small Businesses.pdf
PPTX
Trending Python Topics for Data Visualization in 2025
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
chapter 5 systemdesign2008.pptx for cimputer science students
PPTX
"Secure File Sharing Solutions on AWS".pptx
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Oracle Fusion HCM Cloud Demo for Beginners
DNT Brochure 2025 – ISV Solutions @ D365
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Time Tracking Features That Teams and Organizations Actually Need
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Introduction to Windows Operating System
Custom Software Development Services.pptx.pptx
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Wondershare Recoverit Full Crack New Version (Latest 2025)
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Weekly report ppt - harsh dattuprasad patel.pptx
Website Design Services for Small Businesses.pdf
Trending Python Topics for Data Visualization in 2025
Computer Software and OS of computer science of grade 11.pptx
chapter 5 systemdesign2008.pptx for cimputer science students
"Secure File Sharing Solutions on AWS".pptx
Embracing Complexity in Serverless! GOTO Serverless Bengaluru

Quepy

  • 1. Querying your database in natural language PyData – Silicon Valley 2014 Daniel F. Moisset – dmoisset@machinalis.com
  • 2. Data is everywhere Collecting data is not the problem, but what to do with it Any operation starts with selecting/filtering data
  • 3. A classical approach Used by: ● Google ● Wikipedia ● Lucene/Solr Performance can be improved: ● Stemming/synonyms ● Sorting data by relevance Search
  • 4. A classical approach Used by: ● Google ● Wikipedia ● Lucene/Solr Performance can be improved: ● Stemming/synonyms ● Sorting data by relevance Search
  • 5. Limits of keyword based approaches
  • 6. Query Languages ● SQL ● Many NOSQL approaches ● SPARQL ● MQL Allow complex, accurate queries SELECT array_agg(players), player_teams FROM ( SELECT DISTINCT t1.t1player AS players, t1.player_teams FROM ( SELECT p.playerid AS t1id, concat(p.playerid,':', p.playername, ' ') AS t1player, array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams FROM player p LEFT JOIN plays pl ON p.playerid = pl.playerid GROUP BY p.playerid, p.playername ) t1 INNER JOIN ( SELECT p.playerid AS t2id, array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams FROM player p LEFT JOIN plays pl ON p.playerid = pl.playerid GROUP BY p.playerid, p.playername ) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <> t2.t2id ) innerQuery GROUP BY player_teams
  • 7. Natural Language Queries Getting popular: ● Wolfram Alpha ● Apple Siri ● Google Now Pros and cons: ● Very accessible, trivial learning curve ● Still weak in its coverage: most applications have a list of “sample questions”
  • 8. Outline of this talk: the Quepy approach ● Overview of our solution ● Simple example ● DSL ● Parser ● Question Templates ● Quepy applications ● Benefits ● Limitations
  • 9. Quepy ● Open Source (BSD License) https://github.com/machinalis/quepy ● Status: usable, 2 demos available (dbpedia + freebase) Online demo at: http://quepy.machinalis.com/ ● Complete documentation: http://quepy.readthedocs.org/en/latest/ ● You're welcome to get involved!
  • 10. Overview of the approach ● Parsing ● Match + Intermediate representation ● Query generation & DSL “What is the airspeed velocity of an unladen swallow?” What|what|WP is|be|VBZ the|the|DT airspeed|airspeed|NN velocity|velocity|NN of|of|IN an|an|DT unladen|unladen|JJ swallow|swallow|NN SELECT DISTINCT ?x1 WHERE { ?x0 kingdom "Animal". ?x0 name "unladen swallow". ?x0 airspeed ?x1. }
  • 11. Overview of the approach ● Parsing ● Match + Intermediate representation ● Query generation & DSL “What is the airspeed velocity of an unladen swallow?” What|what|WP is|be|VBZ the|the|DT airspeed|airspeed|NN velocity|velocity|NN of|of|IN an|an|DT unladen|unladen|JJ swallow|swallow|NN SELECT DISTINCT ?x1 WHERE { ?x0 kingdom "Animal". ?x0 name "unladen swallow". ?x0 airspeed ?x1. }
  • 12. Parsing ● Not done at character level but at a word level ● Word = Token + Lemma + POS “is” → is|be|VBZ (VBZ means “verb, 3rd person, singular, present tense”) “swallows” → swallows|swallow|NNS (NNS means “Noun, plural”) ● NLTK is smart enough to know that “swallows” here means the bird (noun) and not the action (verb) ● Question rule = “regular expressions” Token("what") + Lemma("be") + Question(Pos("DT")) + Plus(Pos(“NN”)) The word “what” followed by any variant of the “to be” verb, optionally followed by a determiner (articles, “all”, “every”), followed by one or more nouns
  • 13. Intermediate representation ● Graph like, with some known values and some holes (x0 , x1 , …). Always has a “root” (house shaped in the picture) ● Similar to knowledge databases ● Easy to build from Python code
  • 14. Code generator ● Built-in for MQL ● Built-in for SPARQL ● Possible approaches for SQL, other languages ● DSL - guided ● Outputs the query string (Quepy does not connect to a database)
  • 16. DSL class DefinitionOf(FixedRelation): Relation = "/common/topic/description" reverse = True class IsMovie(FixedType): fixedtype = "/film/film" class IsPerformance(FixedType): fixedtype = "/film/performance" class PerformanceOfActor(FixedRelation): relation = "/film/performance/actor" class HasPerformance(FixedRelation): relation = "/film/film/starring" class NameOf(FixedRelation): relation = "/type/object/name" reverse = True
  • 17. DSL class DefinitionOf(FixedRelation): Relation = "/common/topic/description" reverse = True class IsMovie(FixedType): fixedtype = "/film/film" class IsPerformance(FixedType): fixedtype = "/film/performance" class PerformanceOfActor(FixedRelation): relation = "/film/performance/actor" class HasPerformance(FixedRelation): relation = "/film/film/starring" class NameOf(FixedRelation): relation = "/type/object/name" reverse = True
  • 18. DSL Given a thing x0 , its definition: DefinitionOf(x0) Given an actor x2 , movies where x2 acts: performances = IsPerformance() + PerformanceOfActor(x2) movies = IsMovie() + HasPerformance(performances) x3 = NameOf(movies)
  • 19. Parsing: Particles and templates class WhatIs(QuestionTemplate): regex = Lemma("what") + Lemma("be") + Question(Pos("DT")) + Thing() + Question(Pos(".")) def interpret(self, match): label = DefinitionOf(match.thing) return label class Thing(Particle): regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS")) def interpret(self, match): return HasKeyword(match.words.tokens)
  • 20. Parsing: Particles and templates class WhatIs(QuestionTemplate): regex = Lemma("what") + Lemma("be") + Question(Pos("DT")) + Thing() + Question(Pos(".")) def interpret(self, match): label = DefinitionOf(match.thing) return label class Thing(Particle): regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS")) def interpret(self, match): return HasKeyword(match.words.tokens)
  • 21. Parsing: “movies starring <actor>” ● More DSL: class IsPerson(FixedType): fixedtype = "/people/person" fixedtyperelation = "/type/object/type" class IsActor(FixedType): fixedtype = "Actor" fixedtyperelation = "/people/person/profession"
  • 22. Parsing: A more complex particle ● And then a new Particle: class Actor(Particle): regex = Plus(Pos("NN") | Pos("NNS") | Pos("NNP") | Pos("NNPS")) def interpret(self, match): name = match.words.tokens return IsPerson() + IsActor() + HasKeyword(name)
  • 23. Parsing: A more complex template class ActedOnQuestion(QuestionTemplate): acted_on = (Lemma("appear") | Lemma("act") | Lemma("star")) movie = (Lemma("movie") | Lemma("movies") | Lemma("film")) regex = (Question(Lemma("list")) + movie + Lemma("with") + Actor()) | (Question(Pos("IN")) + (Lemma("what") | Lemma("which")) + movie + Lemma("do") + Actor() + acted_on + Question(Pos("."))) | (Question(Lemma("list")) + movie + Lemma("star") + Actor()) “list movies with Harrison Ford” “list films starring Harrison Ford” “In which film does Harrison Ford appear?”
  • 24. Parsing: A more complex template class ActedOnQuestion(QuestionTemplate): # ... def interpret(self, match): performance = IsPerformance() + PerformanceOfActor(match.actor) movie = IsMovie() + HasPerformance(performance) movie_name = NameOf(movie) return movie_name
  • 25. Apps: gluing it all together ● You build a Python package with quepy startapp myapp ● There you add dsl and questions templates ● Then configure it editing myapp/settings.py (output query language, data encoding) You can use that with: app = quepy.install("myapp") question = "What is love?" target, query, metadata = app.get_query(question) db.execute(query)
  • 26. The good things ● Effort to add question templates is small (minutes-hours), and the benefit is linear wrt effort ● Good for industry applications ● Low specialization required to extend ● Human work is very parallelizable ● Easy to get many people to work on questions ● Better for domain specific databases
  • 27. The good things ● Effort to add question templates is small (minutes-hours), and the benefit is linear wrt effort ● Good for industry applications ● Low specialization required to extend ● Human work is very parallelizable ● Easy to get many people to work on questions ● Better for domain specific databases
  • 28. Limitations ● Better for domain specific databases ● It won't scale to massive amounts of question templates (they start to overlap/contradict each other) ● Hard to add computation (compare: Wolfram Alpha) or deduction (can be added in the database) ● Not very fast (this is an implementation, not design issue) ● Requires a structured database
  • 29. Limitations ● Better for domain specific databases ● It won't scale to massive amounts of question templates (they start to overlap/contradict each other) ● Hard to add computation (compare: Wolfram Alpha) or deduction (can be added in the database) ● Not very fast (this is an implementation, not design issue) ● Requires a structured database
  • 30. Future directions ● Testing this under other databases ● Improving performance ● Collecting uncovered questions, add machine learning to learn new patterns.
  • 31. Q & A You can also reach me at: dmoisset@machinalis.com Twitter: @dmoisset http://machinalis.com/

Editor's Notes

  • #2: Hello everyone, my name is Daniel Moisset. I work at Machinalis, a company based in Argentina which builds data processing solutions for other companies. I&amp;apos;m not a native English speaker, so please just wave a bit if I&amp;apos;m not speaking clearly or just not making any sense. The topic I want to introduce today is about the use of natural language to query databases and a tool that implements a possible approach to solve this Issue Let me start by trying to show you why this problem is relevant. .
  • #3: The problem I&amp;apos;ll discuss today is not about how to get your data. If you&amp;apos;re here, chances are you have more data that you can handle. The big problem today is to put to work all the data that comes from different sources and is piling up in some database. And of course, the first step at least of that problem is getting the data you want, that is, making &amp;quot;queries&amp;quot;. Of course you&amp;apos;ll want to do more than queries later, but selecting the information you want is typically the first step
  • #4: A typical approach for large bodies of text-based data is the “keyword” based approach. The basic idea is that the user provides a list of keywords, and the items that contain those keywords are retrieved. There are a lot of well known tricks to improve this, like detecting the relevance of documents with respect to user keywords, doing some preprocessing of the input and the index so I can find documents without an exact keyword match but a similar word instead, etc. This approach has proven very successful in many different contexts, with Google as a leading example of a large database that probably all of us query frequently using keyword-based queries, and many tools to build search bars into your software. It works so well that you might wonder if there&amp;apos;s any significant improvement to make by trying a different approach.
  • #5: Keyword-based lookups are really good when you know what you&amp;apos;re looking for, typically the name of the entity you&amp;apos;re interested in, or some entity that is uniquely related to that other entity. It&amp;apos;s very simple to get information about Albert Einstein, or figuring out who proposed the Theory of Relativity even if I don&amp;apos;t remember Albert Einstein&amp;apos;s name.
  • #6: However, it&amp;apos;s not easy to Google &amp;quot;What&amp;apos;s the name of that place in California with a lot of movie studios?&amp;quot; &amp;quot;The one with the big white sign in the hill?&amp;quot;. None of the keywords I used to formulate that question are very good, and other similar formulations will not help us. It&amp;apos;s not a problem of having the data, even if I have a database containing records about movie studios and their locations, but a problem of how you interact with the database. Another problem of keyword-based lookups is that it is heavily dependent on data which is mainly textual. It works fine for the web, but if I have a database with flight schedules for many airlines, a keyword based search will provide me with a very limited interface for making queries. Even with a database with a lot of text, like the schedule for the conference, it&amp;apos;s not easy to answer questions like &amp;quot;Which PyData speakers are affiliated with the sponsors&amp;quot; (without doing it manually)
  • #7: The solution we have for this problem, which may be summarized as &amp;quot;finding data by the stuff related to it&amp;quot; are query languages. We have many of those, depending on how we want to structure our data. All of these allow us to write very accurate and very complicated queries. And by “us” I mean the people in this room, which are developers and data scientists. Which is the weakness of this approach: it&amp;apos;s not an interface that you can provide to end-users. There&amp;apos;s a lot of data that needs to be made available to people who can&amp;apos;t or won&amp;apos;t learn a complex language to access the information. Not because they&amp;apos;re stupid, but because their field of expertise is another one.
  • #8: That leaves us with a need to query structured, possibly non textual, related information in a way that does not require much expertise to the person making the queries. And a straightforward way to solve that need, is allowing the data to be queried in the language that the user already knows. Which brings us to the motivation for this talk. Natural language is getting as a popular way to make queries and/or enter commands. It provides a very user friendly experience, even when most current tools are somewhat limited in the coverage they can provide. By “coverage” here I mean how many of the relevant questions are actually understood by the computer. Currently, successful applications like the ones I show here have a guide to the user describing which forms of questions are &amp;quot;valid&amp;quot;
  • #9: After this introduction and the motivation to the problem, let me outline where I&amp;apos;m trying to get to during this talk: Some very smart people who work with me studied different approaches to a solution and came up with a tool called Quepy which implements that approach. Of course it&amp;apos;s not the only possible approach, but it has several nice properties that are valuable to us in an industrial context. I&amp;apos;ll describe the approach in general and get to a quick overview on how to code a simple quepy app. Then I&amp;apos;ll discuss what we most like about quepy, and the limits to the scope of the problem it solves.
  • #10: Just in case you&amp;apos;re eager to see the code instead of listening to me, all of it is available and online, so I&amp;apos;ll leave this slide for 10 seconds so you can get a picture, and then move on.
  • #11: At it&amp;apos;s core, the quepy approach is not unlike a compiler. The input is a string with a question, which is sent through a parser that builds a data structure, called an &amp;quot;intermediate representation&amp;quot;. That representation is then converted to a database query, which is the output from quepy. The parsing is guided by rules provided by the application writer, which describes what kind of questions are valid.
  • #12: The conversion is guided by some declarative information about the structure of the database that the application writer must define. We call this definition the &amp;quot;DSL&amp;quot;, for Domain Specific Language. As you might have noted from this description, what we built is not an universal solution that you can throw over your database, but something that requires programming customization, both regarding on how to interact with the user and how to interact with your database.
  • #13: Let&amp;apos;s take a deeper look at the parser. The first step of the parser provided by Quepy is splitting the text into parts, a process also known as tokenization. Once this is done you have a sequence of word objects, containing information on each word: the token, which is the original word as appears in the text, the lemma, which is the root word for the token (the base verb &amp;quot;speak&amp;quot; for a wordlike &amp;quot;speaking&amp;quot;), and a part of speech tag, which indicates if the word is a noun, an adjective, a verb, etc. This list of words is then matched against a set of question templates. Each question template defines a pattern, which is something that looks like a regular expression, where patterns can describe property matches over the token, lemma, and/or part of speech.
  • #14: Let&amp;apos;s assume a valid match on the question template. In that case, the question template provides a little piece of code that builds the intermediate representation. The intermediate representation of a query is a small graph, where vertices are entities in the database, edges are relations between entities, and both vertices and edges can be labeled or left open. There&amp;apos;s one special vertex called the &amp;quot;head&amp;quot; which is always open, that indicates what is the value for the &amp;quot;answer&amp;quot;. This is an abstract, backend independent representation of the query, although is thought mainly to use with knowledge databases, which usually have this graph structure and allow finding matching subgraphs. Quepy provides a way to build this trees from python code in a way that&amp;apos;s quite more natural than just describing the structure top down. Trees are built by composing tree parts that have some meaningful semantics on your domain. Those components, along with the mapping of those semantics to your database schema form what we call the DSL
  • #15: From the internal representation tree, and the DSL information it is possibleto automatically build a query string that can be sent to your database. At this time, we have built query generators for SPARQL, which is the defacto standard for knowledge databases, and MQL, the Metaweb Query Language (used by Google&amp;apos;s Freebase). It might be possible to build custom generators for other languages, or use some kind of adapter (I know there are SPARQL endpoints that you can put in front of a SQL database for example). The DSL information needed here is somewhat schema specific but is very simple to define, in a declarative way.
  • #16: Let me show you some code examples, making queries on freebase with a couple of sample templates questions. We want to answer &amp;quot;What are bananas?&amp;quot; and &amp;quot;In which movies did Harrison Ford appear&amp;quot;. We will be doing this on Freebase; but don&amp;apos;t worry, there&amp;apos;s no need for you to know the Freebase schema to understand this talk. We&amp;apos;ll cover the information we need as we go. I&amp;apos;m going to show you some complete code, but this is not a tutorial so I&amp;apos;m not going to go over line by line explaining what everything does. The code I&amp;apos;m showing has the purpose of displaying what are the different parts that you&amp;apos;ll need to put together and how much (or how little) work is needed to build each.
  • #17: To build this example, the easiest way is to start with the DSL. We&amp;apos;ll start defining some simple concepts that look naturally related to the queries we want to make. Let&amp;apos;s take a look at the `DefinitionOf` class. What we&amp;apos;re saying here is how to get the definition of something. In freebase, entities are related to their definitions by the &amp;quot;slash common slash topic slash description&amp;quot; attribute (this is why we say that this is a `FixedRelation`; in freebase, attributes are also represented as relations). The &amp;quot;reverse equals true&amp;quot; indicates that we actually fix the left side of the relation to a known value, and want to learn about the right side. Without it, this would be the opposite query, give me an object given its definition.
  • #18: This is all the DSL we need to answer &amp;quot;What are bananas?&amp;quot;. The other query we wanted to make is quite more complex. Our database has movies, where each movie can have many related entities called &amp;quot;performances&amp;quot;. Each performance relates to an actor, a character, etc. So we define some basic relations to identify the type of some entities using `FixedType`. `IsMovie` describe entities having freebase type &amp;quot;slash film slash film&amp;quot;, and `IsPerformance` helps us recognizing these &amp;quot;performance&amp;quot; objects. To link both types of entities, the `PerformanceOfActor` queries which performances have a given actor and `HasPerformance` allows us to query which movie has a given performance. At last, in freebase movies are complex objects, but when we show a result to the user we want to show him a movie name so `NameOf` gets the &amp;quot;slash type slash object slash name&amp;quot; attribute of a movie, which is the movie title.
  • #19: The intermediate representation of queries is built on instances of these objects. For example, given an actor “a”, this expression gives the movies with “a” (slide). Note that the operations on the bottom are abstract operations between queries which build a larger query, none of this is touching the database but just building a tree.
  • #20: Let&amp;apos;s now see how to code the parser for the queries mentioned before. For each kind of question we can build a &amp;quot;question template&amp;quot;. The first thing that a question template specifies is how to match the questions. The matching has to be flexible enough to capture variants of the question like &amp;quot;what is X&amp;quot;, &amp;quot;what are X&amp;quot;, &amp;quot;what is an X&amp;quot;, &amp;quot;what is X?&amp;quot; which you can see we write on the regex here: We have a &amp;quot;what&amp;quot; like word, followed by some form of the verb &amp;quot;to be&amp;quot;, optionally followed by a &amp;quot;determiner&amp;quot; which is a word like &amp;quot;a&amp;quot;, &amp;quot;an&amp;quot; the&amp;quot;, followed by a thing which is what we want to look up, and followed by a question mark. Note that I said &amp;quot;a thing&amp;quot; without being too explicit on what that means. Quepy allows you to define &amp;quot;particles&amp;quot;, which mean pieces of the question that you want to capture and that follow a particular pattern.
  • #21: Note that at the bottom I have defined what a Thing is, the definition consisting also in one regular expression but also an intermediate representation for it. In this case, a thing is an optional adjective followed by one or more nouns. The semantics of a thing are given by the interpret method, where HasKeyword is a quepy builtin with essentially the semantics of &amp;quot;the object with this primary key&amp;quot;. It&amp;apos;s shown in the slides as a dashed line. Our question template regex refers to Thing(), so in its interpret method it will have access to the already built graph for the matched thing. So if we ask &amp;quot;What is a banana?&amp;quot;, you&amp;apos;ll end up with a valid match that builds the graph on the right, which corresponds to the appropiate query.
  • #22: Let&amp;apos;s work on the more complex example. The first thing we&amp;apos;ll require is some additional DSL to write the &amp;quot;Actor&amp;quot; particle. In freebase, there&amp;apos;s no actor type, but there&amp;apos;s a &amp;quot;person type&amp;quot; and then an actor profession. That allows us to define &amp;quot;IsPerson&amp;quot; (that is objects with the person type) and &amp;quot;IsActor&amp;quot; (that is objects with the actor profession)
  • #23: This allows us to define the Actor particle, which matches a sequence of nouns, and represent an object that is a person, works as an actor, and has as identifier the name in the match.
  • #24: The regex for this questions is more complex because we allow several different forms like the ones shown at the bottom. We allow several synonym verbs to be used like star vs act vs appear. We also allow synonyms like film and movie. Note that it&amp;apos;s more clear to write this by defining intermediate regular expressions, but no Particle definitions is needed if you don&amp;apos;t want to capture the word used. There are possibly more ways to ask this question, but once you figure those out it&amp;apos;s pretty easy to add those to the pattern. The pattern you see here is a simplified version of the pattern you&amp;apos;ll find on the demo we have in the github repo, but I simplified it to make it shorter to read.
  • #25: Once you&amp;apos;ve captured the actor, you just need to define, using the DSL, how to answer the query. Note that the definition here is very readable: we find performance objects referring to the matched actor, then we find movies with that performance, and then we find the names of those movies. Again, I described this sequentially, but you&amp;apos;re actually describing declaratively how to build a query
  • #26: Quepy also provide some tools to help you with their boilerplate, which are not very interesting to describe but I just wanted you to know that they are there. There&amp;apos;s the concept of a quepy app which is a python module where you fill out the DSL, question templates, settings like whether you want sparql or mql, etc. Once you have that you can import that python module with quepy dot install and get the query for a natural language question ready to send to your database.
  • #27: As you have seen, the approach we&amp;apos;ve used for the problem is very simple, but it has some good properties I&amp;apos;d like to highlight. The first one, that is very important for us as a company that needs to build products based on this tool, is that you can add effort incrementally and get results that benefit the application, so it&amp;apos;s very low risk. This is different from machine learning or statistical approaches where you can use a lot of project time building a model and you might end up hitting gold, or you might end up with something that adds 0 visible results to a product. So, as much as we love machine learning where I work, we refrained ourselves from using it, getting something that&amp;apos;&amp;apos;s not state-of-the-art in terms of coverage, but it is a very safe approach. Which is great value when interacting with customers
  • #28: Other good part about this is that extending or improving requires work that can be done by a developer who doesn&amp;apos;t need a strong linguistic specialization. So it&amp;apos;s easy to get a large team working on improving an application. And many people can work at the same time, because question templates are really modular and not an opaque construct as machine learning models. This approach works well in domain specific databases, where there&amp;apos;s a limited amount of relationships relevant within the data. For very general databases like freebase and dbpedia, if you want to answer general questions, you will find out that users will start making up questions that fal outside your question templates.
  • #29: And that&amp;apos;s also one of the weaknesses of this. If you have a general database, you&amp;apos;ll have an explosion in the amount of relevant queries and templates, which starts to produce problems between contradicting rules. Note that the limit here is not the amount of entities in your dataset, but the amount of relationships between them. The way this idea works also makes a bit hard if you want to integrate computation or deduction. The latter can be partly solved by using knowledge databases that have some deduction builtin, and apply that when they get a query so it&amp;apos;s something that you can work around
  • #30: Something that&amp;apos;s a limit of the implementation, but could be improved is the performance of the conversion. What we have is something that works for us in contexts where we don&amp;apos;t have many queries in a short time, but would need some improvements if you want to provide a service available to a wide public. The last point that can be a limitation is the need of a structured database, which is something one doesn&amp;apos;t always have access to. We actually built quepy as a component on a larger project, but we&amp;apos;re also working on the other side of this problem with a tool called iepy,
  • #32: So that&amp;apos;s all I have. I&amp;apos;ll take a few questions and of course you can get in touch in me later today or online for more information about this and other related work. Thanks for listening, and thenks to the people organizing this great conference.