SlideShare a Scribd company logo
Foundational Research
Propelled by Text Analytics
Benny Kimelfeld
LogicBlox
Preamble
• Myself:
– Ph.D. @ HebrewU (DB uncertainty + search)
– IBM Almaden (DB theory, IR, Text Analytics)
– LogicBlox (ML in DB, Prob. Programming)
– Technion IL (Associate Prof., next year)
• This talk:
 Infrastructure for text analytics
+ DB theory, formal languages, NLP, data mining,
computational complexity, …
2
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
3
Text Analytics Matters
Some important applications are based on the
analysis of text-centric data; for example:
Semantic Search
Semantic understanding & indexing of
content to better match user's intent
Life-Science Mining
Extract knowledge bases from
scientific publications
e-Commerce
Comparison Shopping extracts &
compares inventory from online sources
CRM / BI
Monitor customer’s social-media activity
for sentiment & business leads
Log Analysis
Summarize, visualize and analyze logs
produced by machines
4
Database Management Systems
• Old news: Data management is involved!
– Data semantics, query/analysis semantics, storage,
query evaluation, indices, consistency, transactions,
backup, privacy, recovery, …
– From-scratch engineering is highly challenging
• Motivation to the concept of a general-purpose
Database Management System
– Most notably: relational model (pioneered by Edgar F.
Codd in 1969) and SQL
5
“Big Data” Phenomena
Proprietary data in orgs.
(enterprises, governments, …)
Proliferation of publically open
data sources (Web, social, …)
Past: Present:
Massive-data analyses incurred
high machinery/personnel cost
Business models (cloud, crowd,
opensource) facilitate analyses
Data structured/controlled by
admins, e-forms, software, …
Uncontrolled data from humans’
free text, heterogeneous kbs, …
Analyses by specialized teams
of heavily trained experts
Analyses by a wide community
featuring a wide range of skills
6
“By 2018, the United States alone could face a shortage of 140,000
to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of
big data to make effective decisions.”
“Big data: The next frontier for innovation, competition, and productivity”
McKinsey Report, May 2011
We need dev. & management systems to
facilitate value extraction from Big Data
by a wide range of users / skills
7
Core Task: Information Extraction (IE)
“Information Extraction (IE) is the name given to any process
which selectively structures and combines data which is found,
explicitly stated or implied, in one or more texts. The final
output of the extraction process varies; in every case, however, it
can be transformed so as to populate some type of database.”
J. Cowie and Y. Wilks., Handbook of
Natural Language Processing, 2000
“Information extraction is the identification, and consequent or concurrent
classification and structuring into semantic classes, of specific
information found in unstructured data sources, such as natural language
text, making the information more suitable for information processing tasks.”
M. F. Moens, Information Extraction: Algorithms
and Prospects in a Retrieval Context, 2006
→data-in-text
(unstructured)
data-in-db
(structured)
In short:
8
Popular Classes of IE Tasks
• Named Entity Recognition
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
person person organization
organization
9
Popular Classes of IE Tasks
AdvisedB
y
WorksIn
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
• Named Entity Recognition
• Relation Extraction
10
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Graduation
Where?
Who?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
11
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Education
Start End
Graduation
When?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
12
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
SameEntity
SameEntity
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
• Coreference Resolution
13
ariu
lmaden
A
m. com
Yunyao Li
IBM Research - Almaden
San Jose, CA
yunyaol i @us. i bm. com
Frederick R. Reiss
IBM Research - Almaden
San Jose, CA
f r r ei ss@us. i bm. com
stract
a” analytics over unstruc-
enewed interest in infor-
E). We surveyed the land-
ies and identified amajor
industry and academia:
ominatesthecommercial
garded as dead-end tech-
mia. We believe the dis-
he way in which the two
ethebenefits and costsof
mia’s perception that rule-
research challenges. We
mportance of rule-based
Commercial*Vendors*(2013)*
NLP*Papers*
(200392012)*
100%$
50%$
0%$
3.5%*
21%$
75%$
Rule,$
Based$
Hybrid$
Machine$
Learning$
Based$
45%*
22%$
33%$
Implementa@ons*of*En@ty*Extrac@on*
Large*Vendors*
67%*
17%$
17%$
All*Vendors*
IE Paradigms: Rules & Statistics
• Rules
• ML classification
• Probabilistic graphical models
• Soft logic
[Chiticariu, Li, Reiss, EMNLP’13]
• EMNLP, ACL, NAACL, 2003-
2012
• 54 industrial vendors (Who’s
Who in Text Analytics, 2012)
“[…] rules are effective,
interpretable, and are easy
to customize by non-experts
to cope with errors.”
Gupta & Manning, CONLL’14
14
+
NLP
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
15
Xlog: Datalog for IE
• Extension of (non-recursive) Datalog
• Use case: DBLife (db research kb: dblife.cs.wisc.edu)
• Data types: string, document, span
– Focus on single-document programs
• “Procedural predicates” (p-predicates) are user-defined
functions that produce relations over spans
– Example: sentence(doc, span)
• Query-plan optimization
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Same string, different spans
Span [42,47)
16
Xlog Example
“Declarative Information Extraction using Datalog with Embedded Extraction Predicates”
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Regex.
(string)
Unary
regex
formula
Binary
regex
formula
17
• Datalog syntax
– Types: string, span
• Built in collection of p-predicates
– Various types of built-in regex formulas
– Linguistic: deep parsing, coreference
resolution, named-entity extractor
Instaread: Datalog + NLP
Binary regex
formulas
Unary regex
formulas
[Hoffmann, 2012]
18
IBM SystemT: SQL for IE
• Engine for AQL: SQL-like declarative IE lang.
– AQL = Annotation Query Language
• SystemT = AQL + Runtime + Dev. Tooling
– [Chiticariu et al., ACL 2010]: position SystemT as a
high-quality and high-efficiency IE solution
– System and IDE demos in ACL 2011, SIGMOD 2011
• Commercial product, high academic presence
– Integration on public financial records [Hernández et al., EDBT’13,
Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10,
ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et
al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et
al., Interact’13], social media [Sindhwani et al., IBM Journal 2011]
19
SystemT’s AQL Example
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
regex + join w/ previous views
projection
union
Cleaning
Unary regex formulas
20
Formal Framework
• Repeated concept: Extend a relational query
language with text transducers (p-predicates,
usually regex formulas)
• Research challenge: theoretical underpinnings
of this combined document/relation model
• Expressive power
– Query-plan optimization: Can we rewrite an operator via “easier”
building blocks?
– System extensions: Can we express a new operation using
existing ones, or prove impossibility?
• Next: a formal framework
– With Fagin, Reiss, Vansummeren, PODS’13, JACM
21
22
Terminology
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Company CEO CompanyCEO
[1,14)
(Kaspersky Lab)
[19,36)
(Eugene Kaspersky)
[1,36)
[42,47)
(Intel)
[52,65)
(Paul Otellini)
[42,65)
Relation over spans from the document
Document
Span [52,65)
Document Spanners
Document d Relation over the spans of d
Kaspersky Lab CEO Eugene
Kaspersky said Intel CEO Paul Otellini
and the Intel board had no idea what
they were in for when the company
announced it was acquiring McAfee
on August 19, 2010.
x y z
[1,14) [30,36) [1,36)
[42,47) [52,65) [42,65)
[102,110) [115,125) [102,125)
Document Spanner: a function that maps every
doc. (string) into a relation over the doc.’s spans
More formally:
• Finite alphabet of symbols
• A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d
• The relation has a fixed signature (set of attributes)
− The attributes come from an infinite domain of variables x, y, z, …
23
Spanners as Regex Formulas
• Regular expression with embedded variables
• Examples:
• Restriction: each “evaluation” (parse tree) assigns
one span to each variable (see [Fagin et al., PODS’13])
Ordinary regex Span variable
 .* x{dddd} .*
 .* in w{Alabama | Alaska | Arizona | …} .*
 (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | …
Representation system for spanners
24
Spanners as Datalog w/ Regex
• Non-recursive Datalog (NR-Datalog)
• Operate over a document (not a relational db)
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Carter_from_Plains,_Georgia,_Washington
_from_Westmoreland,_Virginia
x z
[1,7)
Carter
[13,28)
Plains,_Georgia
[30,40)
Washington
[46,69)
Westmoreland,_Virginia
EDBs = Spanners!
Another representation
system for spanners
Quer
y goal
25
Spanners as Automata
0,1 0 1
Ordinary
NFA
1 0 0 1 1 1 0 1
Var-Stack
Automaton
1 0 0 1 1 1 0 1x{
y{
}
}
y{x{ } }
Var-Set
Automaton
1 0 0 1 1 1 0 1x{
}y
y{x{ }x }y
}x
0,1 0 1
0,1 0 1
• In an accepting run, each variable opens and later closes exactly once
⇒ Each accepting run defines an assignment to the variables
• Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples
Close most recent
Close x
y
x
x
y
Another representation system for spanners
y{
26
Study of Expressive Power
Spanners definable by
regex formulas=
Spanners definable by
var-stack automata
Spanners definable by
var-set automata =
Spanners definable by
NR Datalog w/ regex formulas
27
x{
y{
}
}
0,1 0 1
x{
}y
}x
0,1 0 1
y{
.*x{.*}_from_z{.*}.*}
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Consequences
• Connections between Datalog+regex
spanners and other language formalisms
– Classic string relations [Berstel 79]
– Graph queries (CRPQs) [Cruz et al. 87]
• Extension with string equality & difference
– Expressiveness / closure properties
• Principles for cleaning inconsistencies
– Follow up work [PODS’14]
– Next…
28
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
29
Next, highlight 3 lines of foundational research that
were motivated by our work on text analytics:
1. Database inconsistency w/ repair priorities
2. Frequent subgraph mining
3. Update propagation
30
• Extractors may produce inconsistent results
– Data artifacts
– Developer limitations
• Rather than repairing the existing extractors,
common practice is to clean (intermediate) results
– SystemT “consolidators” [Chiticariu et al.10]
– GATE/JAPE “controls” [Cunningham 02]
– Implicit in other rule systems, e.g., WHISK [Soderland 99]
– POSIX regex disambiguation [Fowler 03]
Cleaning IE Inconsistencies
33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303
Person2
Person1
Address1
31
SystemT Consolidators
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
Other policies
built in
32
Five GATE/JAPE Controls
All
Once
First
AppeltBrin
.* x{dd+} .*Sequence 12345 and sequence 12.
Document Spanner
Screenshots from GATE UI
33
Cleaning via Prioritized Repairs
• Problem: existing policies are ad-hoc; how to
expose a language for user declaration?
• [Fagin, K, Reiss, Vansummeren 2014]: spanner
formalism for declarative cleaning
• Key: prioritized repairs [Staworko, et al. 12]
• Idea: Extend extraction programs with
– Denial constraints: which facts are in conflict?
– Priority declarations: preference between facts
• Captures SystemT, GATE, WHISK, POSIX, …
• We are now trying to improve our understanding
of prioritized repairs…
34
Prioritized Repairs: Definition
Database
Denial
Constraints
Collection of facts Which sets of facts
cannot co-exist?
Priority
Relation
Binary “is preferred to”
relation
• [Arenas, Bertossi, Chomicki 99]: Inconsistent DB
represents a set of (equally likely) “repairs”
 Then we can ask for the “possible” or “consistent” query answers
• [Staworko, Chomicki, Marcinkowski 12] add priorities:
• Let A and B be two consistent subsets of the database
• Say that A improves B if we can obtain A from B by a
“profitable” exchange of facts (precision later…)
• A repair is a consistent subset that cannot be improved
Inconsistent Database Instance
35
Example
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
Violated constraints (functional
dependencies):
• professor  university, city
(“key constraint”)
• university  city
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
“Ordinary” repairs [Arenas et al. 99]
Tuple priority  some repairs can be discarded [Staworko et al.] 36
A improves B if we get A from B by removing tuples & adding
tuple; each removed preferred to by some added
Complexity of Testing Improvability
Theorem:
 In the case of a single functional dependency
or two keys per relation, improvability can be
tested in polynomial time
 In any other combination of FDs, the
problem is NP-complete!
university faculty dean
UChile Economics Agosin
Technion CS Yavneh
Stanford Law Magill
two keys
37
Can a consistent subset be improved?
Recent work (unpublished)
w/ Fagin & Kolaitis
IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
1. Apply
dependency
parsing
38
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
I
want
buy
gift advisor
1. Apply
dependency
parsing
2. Find freq.
recurring
patterns
39
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
= 3
g1 g2 g3 g4
Freq.
Freq. Max.
Freq.
Max.
Maximal Frequent Subgraphs
Complexity Study
• Naturally, there has been a lot of work on this problem
– SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], …
• But little was known about the computational complexity
• Studied: impact of assumptions on comp. complexity
– Graph properties (e.g., trees, treewidth, etc.)
– Label repeatability
– Bounded #results desired
– Bounded threshold
• This work led to novel complexity results and a new
methodologies for mining maximal subgraphs
– [K & Kolaitis, ACM PODS’13, ACM TODS]
• Next, some complexity nuggets 
41
Complexity Nuggets
• Good news: If labels do not repeat in each input
graph, then there are PTime solutions when
– The threshold is bounded; or
– Graphs are trees & few results are desired
• In general graphs w/o label repetition, you can
find 2 results in PTime
– Bad news: But finding 3rd is NP-hard!
– Bad news: And if labels repeat and graphs are
trees, then finding 2nd is already NP-hard!
• Even for a bounded threshold
42
Improving Dictionaries w/ Feedback
text fragments
(sentences, tables, rows, …)
join
IBM , San Jose
company
occurrences
address
occurrencescompanies, countries, …
Apple , CupertinoIBM , Armonk
IE IE
IE
auto. suggest a “good”
fix to the IE program
Web data
“good” = small effect
on other results
Yahoo! , Cupertino Goo
43
View Updates
• View-update problem: Translate an update on a view to
an update on the base relations
• Deletion propagation as a special case
– Update is delete(a set of view tuples)
• Motivation:
– Classic: database/view maintenance
• DB access only through views, hidden join keys, etc.
– Debugging
• [K&al.12]: deletion propagation for debugging text extractors
– Database causality [Meliou&al.10]
• Intuition: good propagation provides a good explanation of why we
have the tuples to begin with
• [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis,
Repairs and View-Updates in Databases”
44
Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
[Cui&Widom01; Buneman&al.02]
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
45
Example: File Access
= ⋈
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
46
Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
Decision variant is NP-complete [Buneman et al. 02]
47
Trichotomy in Complexity
We have established a precise (easily testable) criterion
that partition all cases into 3 categories:
1. The problem is solvable in PTime, and even via a
straightforward algorithm [Buneman et al. 2001]
2. The problem is NP-hard, but constant-ratio
approximable in PTime (ILP relaxation)
3. The problem is inapproximable for every ratio
Fix a schema (w/ fds) and a CQ w/o self joins
What is the complexity of finding a solution with a minimal side effect?
[K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14]
48
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
49
Summary
• Text analytics & IE
• Rule systems for IE
• A formal framework for rules, relating IE to
traditional DB concepts such as Datalog
• Research directions motivated by IE
– Prioritized repairs
– Graph mining
– Update propagation
50
Outlook: DB w/ Deep Text Support
• We need a uniform & elegant data/query model to
combine structured data & text; usefulness for querying
both text and relations
• We need a principled, simple & transparent probability
model + effective quality + practical execution cost
• We need to balance between automation and control:
from full specification by experts to feature generation for
nonexperienced
– Maximally realize the potential of every developer!
– LogicBlox is working on incorporating ML in Datalog!
51
BACKUP SLIDES
52
Room for Both
Statistical
Solution
Rule
System
Feature Engineering
Model Space, Runtime
Cleaning + Post Proc.
Cleaning + Post Proc.
Building blocks
(e.g., dictionaries, NER)
“What doesn’t work: Anything requiring high
precision and full automation”
Feldman & Ungar, KDD’08 tutorial on text mining
53
String DB, Spanners, Interval Algebra
Kaspersky Kaspersky
Intel Otellini
IBM Rometty
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
String Databases Interval AlgebraSpanners
Atomic value: string Atomic value: span
(pointing to doc)
Atomic value: interval
(no text)
Join by string conditions
(e.g., x is a substring of y)
Join by interval conditions
(e.g., x is a sub-interval of y)
Join by interval+string
conditions (e.g., x a
token in y)
Apps: text predicates in DBs
[Grahne & al. 99] [Benedikt &
al. 03], string manipulation
[Bonner & Mecca 98]
[Ginsburg and Wang 98]
App: IE Apps: temporal reasoning
[Allen 83] [Vilain & Kautz
86] [Nebel & Bürckert 95]
[Krokhin et al. 03]
54
55
Imp. 1: Connection to Known Concepts
• Connection to Recognizable Relations [Berstel 79]
– These are unions of cross products of regular languages
– THM: The class of regular spanners is closed under
a string-selection predicate iff the predicate is a
recognizable relation
• Connection to CRPQs [Cruz et al. 87]
– Conjunctive Regular Path Queries have been studied as a
query language for labeled graphs
– THM: Regular spanners have the same expressive
power as unions of CRPQs on paths “with marked
endpoints”
• Up to some simple and necessary adaptation between the models
S I G M O D
Path with marked endpoints
Imp. 2: Adding String Equality
NR Datalog w/ regex formulas
Regular Spanners
Regularstr= Spanners
+ String-equality predicate
(+substring-of, prefix-of, …)
…application from Jane Doe,
social 012-345-6789, on Mar
20th… identified as John Doe,
012-345-6789, ask us to…
x1 x2
[117,125)
(Jane Doe)
[875,883)
(John Doe)
⋮ ⋮
NameSSN(x,y) := …
SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2)
Same string,
different spans
56
Difference with String Equality
• Are regularstr= spanners closed under difference?
– Why should they? Only positive operators are used…
– However, regex formulas (our EDBs) can introduce
“negative” operations (NFAs closed under complement)
• THM: The class of regular spanners is closed under
difference
• PROP: The class of regularstr= spanners is closed
under string-inequality selection
• THM: The class of regularstr= spanners is closed
under string-containment selection, but then, not
under non-string-containment selection!
• COR: The class of regularstr= is not closed under
difference
57
Formal Optimization Problem
Fixed: • Schema S w/ fun. dependencies
• Conjunctive query Q
Input: • Database instance I over S
• Set A⊆ Q(I) of answers to delete
Output: J ⊆ I s.t. Q(J) ∩ A = ∅
Goal: Minimize |(Q(I) – A) – Q(J)|
Side Effect
58

More Related Content

PPTX
Searching for Meaning
PDF
Reflected Intelligence: Real world AI in Digital Transformation
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
PPTX
South Big Data Hub: Text Data Analysis Panel
PPS
Semantic Web in Action: Ontology-driven information search, integration and a...
PPTX
Information retrieval introduction
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
PPTX
Top Data Analysts in the world you can find at Xpert
Searching for Meaning
Reflected Intelligence: Real world AI in Digital Transformation
Natural Language Search with Knowledge Graphs (Chicago Meetup)
South Big Data Hub: Text Data Analysis Panel
Semantic Web in Action: Ontology-driven information search, integration and a...
Information retrieval introduction
The Relevance of the Apache Solr Semantic Knowledge Graph
Top Data Analysts in the world you can find at Xpert

What's hot (20)

PDF
SemTecBiz 2012: Corporate Semantic Web
PDF
Konsep Dasar Information Retrieval - Edi faizal
PPTX
Semantic Web: introduction & overview
PPTX
PPTX
Model of information retrieval (3)
PDF
Fqas09
PDF
BIG DATA RESEARCH
PDF
Managing Metadata for Science and Technology Studies: the RISIS case
PPTX
The Apache Solr Semantic Knowledge Graph
PPTX
20 most popular data scientists
PDF
CS6007 information retrieval - 5 units notes
PPTX
Data science.chapter-1,2,3
PPT
Information retrieval
PPT
18231979 Data Mining
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PPT
Aggregation for searching complex information spaces
PDF
PDF
Lecture 01 Data Mining
DOC
Ci2004-10.doc
SemTecBiz 2012: Corporate Semantic Web
Konsep Dasar Information Retrieval - Edi faizal
Semantic Web: introduction & overview
Model of information retrieval (3)
Fqas09
BIG DATA RESEARCH
Managing Metadata for Science and Technology Studies: the RISIS case
The Apache Solr Semantic Knowledge Graph
20 most popular data scientists
CS6007 information retrieval - 5 units notes
Data science.chapter-1,2,3
Information retrieval
18231979 Data Mining
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Aggregation for searching complex information spaces
Lecture 01 Data Mining
Ci2004-10.doc
Ad

Similar to Text Analytics - JCC2014 Kimelfeld (20)

DOCX
Post 1What is text analytics How does it differ from text mini
DOCX
Post 1What is text analytics How does it differ from text mini.docx
PPTX
Department of Commerce App Challenge: Big Data Dashboards
PDF
Synthesys Technical Overview
PPT
DBLP-SSE: A DBLP Search Support Engine
PDF
Data and Information Integration: Information Extraction
PDF
Open IE tutorial 2018
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PDF
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
PDF
Rule-based Information Extraction for Airplane Crashes Reports
PDF
Rule-based Information Extraction for Airplane Crashes Reports
PPT
Introduction to question answering for linked data & big data
PDF
Introduction Lecture 01 Data Science.pdf
PDF
Search Solutions 2011: Successful Enterprise Search By Design
PDF
PDF
CS8080_IRT__UNIT_I_NOTES.pdf
PDF
Web_Mining_Overview_Nfaoui_El_Habib
PPTX
BrightTALK - Semantic AI
PDF
Questions On The And Football
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini.docx
Department of Commerce App Challenge: Big Data Dashboards
Synthesys Technical Overview
DBLP-SSE: A DBLP Search Support Engine
Data and Information Integration: Information Extraction
Open IE tutorial 2018
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
Introduction to question answering for linked data & big data
Introduction Lecture 01 Data Science.pdf
Search Solutions 2011: Successful Enterprise Search By Design
CS8080_IRT__UNIT_I_NOTES.pdf
Web_Mining_Overview_Nfaoui_El_Habib
BrightTALK - Semantic AI
Questions On The And Football
Ad

More from Pedro Contreras Flores (20)

PPTX
El dilema de las redes sociales
PDF
Tipos de sistemas de información
PPTX
Servicio de información para bibliotecas
PPTX
Gestión del conocimiento
PPTX
Business intelligence (bi) y big data0
PPTX
Bibliotecas moviles y calidad
PDF
Sistemas y servicios de informacion intro
PPTX
Plataforma de Digitalización
PPT
Red de transporte urbano
PPT
Hormigas arfificiales - Mauro San Martín
PPT
Tecnologías de la información
PPT
Modelamiento y simulación
PPT
Complementos de programación
PPT
4 memoria dinamica
PPT
3 recursividad
PPT
2 punteros y lenguaje c
PPT
Programación grafica en lenguaje c
PPT
El dilema de las redes sociales
Tipos de sistemas de información
Servicio de información para bibliotecas
Gestión del conocimiento
Business intelligence (bi) y big data0
Bibliotecas moviles y calidad
Sistemas y servicios de informacion intro
Plataforma de Digitalización
Red de transporte urbano
Hormigas arfificiales - Mauro San Martín
Tecnologías de la información
Modelamiento y simulación
Complementos de programación
4 memoria dinamica
3 recursividad
2 punteros y lenguaje c
Programación grafica en lenguaje c

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Microsoft 365 products and services descrption
PPTX
Business_Capability_Map_Collection__pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
DOCX
Factor Analysis Word Document Presentation
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Global Data and Analytics Market Outlook Report
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Inferential Statistics.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Leprosy and NLEP programme community medicine
[EN] Industrial Machine Downtime Prediction
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Microsoft 365 products and services descrption
Business_Capability_Map_Collection__pptx
Qualitative Qantitative and Mixed Methods.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Factor Analysis Word Document Presentation
Optimise Shopper Experiences with a Strong Data Estate.pdf
Global Data and Analytics Market Outlook Report
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Inferential Statistics.pptx
DU, AIS, Big Data and Data Analytics.ppt
ISS -ESG Data flows What is ESG and HowHow
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Leprosy and NLEP programme community medicine

Text Analytics - JCC2014 Kimelfeld

  • 1. Foundational Research Propelled by Text Analytics Benny Kimelfeld LogicBlox
  • 2. Preamble • Myself: – Ph.D. @ HebrewU (DB uncertainty + search) – IBM Almaden (DB theory, IR, Text Analytics) – LogicBlox (ML in DB, Prob. Programming) – Technion IL (Associate Prof., next year) • This talk:  Infrastructure for text analytics + DB theory, formal languages, NLP, data mining, computational complexity, … 2
  • 3. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 3
  • 4. Text Analytics Matters Some important applications are based on the analysis of text-centric data; for example: Semantic Search Semantic understanding & indexing of content to better match user's intent Life-Science Mining Extract knowledge bases from scientific publications e-Commerce Comparison Shopping extracts & compares inventory from online sources CRM / BI Monitor customer’s social-media activity for sentiment & business leads Log Analysis Summarize, visualize and analyze logs produced by machines 4
  • 5. Database Management Systems • Old news: Data management is involved! – Data semantics, query/analysis semantics, storage, query evaluation, indices, consistency, transactions, backup, privacy, recovery, … – From-scratch engineering is highly challenging • Motivation to the concept of a general-purpose Database Management System – Most notably: relational model (pioneered by Edgar F. Codd in 1969) and SQL 5
  • 6. “Big Data” Phenomena Proprietary data in orgs. (enterprises, governments, …) Proliferation of publically open data sources (Web, social, …) Past: Present: Massive-data analyses incurred high machinery/personnel cost Business models (cloud, crowd, opensource) facilitate analyses Data structured/controlled by admins, e-forms, software, … Uncontrolled data from humans’ free text, heterogeneous kbs, … Analyses by specialized teams of heavily trained experts Analyses by a wide community featuring a wide range of skills 6
  • 7. “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” “Big data: The next frontier for innovation, competition, and productivity” McKinsey Report, May 2011 We need dev. & management systems to facilitate value extraction from Big Data by a wide range of users / skills 7
  • 8. Core Task: Information Extraction (IE) “Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts. The final output of the extraction process varies; in every case, however, it can be transformed so as to populate some type of database.” J. Cowie and Y. Wilks., Handbook of Natural Language Processing, 2000 “Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, making the information more suitable for information processing tasks.” M. F. Moens, Information Extraction: Algorithms and Prospects in a Retrieval Context, 2006 →data-in-text (unstructured) data-in-db (structured) In short: 8
  • 9. Popular Classes of IE Tasks • Named Entity Recognition From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. person person organization organization 9
  • 10. Popular Classes of IE Tasks AdvisedB y WorksIn From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. • Named Entity Recognition • Relation Extraction 10
  • 11. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. Graduation Where? Who? • Named Entity Recognition • Relation Extraction • Event Extraction 11
  • 12. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. Education Start End Graduation When? • Named Entity Recognition • Relation Extraction • Event Extraction • Temporal IE 12
  • 13. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. SameEntity SameEntity • Named Entity Recognition • Relation Extraction • Event Extraction • Temporal IE • Coreference Resolution 13
  • 14. ariu lmaden A m. com Yunyao Li IBM Research - Almaden San Jose, CA yunyaol i @us. i bm. com Frederick R. Reiss IBM Research - Almaden San Jose, CA f r r ei ss@us. i bm. com stract a” analytics over unstruc- enewed interest in infor- E). We surveyed the land- ies and identified amajor industry and academia: ominatesthecommercial garded as dead-end tech- mia. We believe the dis- he way in which the two ethebenefits and costsof mia’s perception that rule- research challenges. We mportance of rule-based Commercial*Vendors*(2013)* NLP*Papers* (200392012)* 100%$ 50%$ 0%$ 3.5%* 21%$ 75%$ Rule,$ Based$ Hybrid$ Machine$ Learning$ Based$ 45%* 22%$ 33%$ Implementa@ons*of*En@ty*Extrac@on* Large*Vendors* 67%* 17%$ 17%$ All*Vendors* IE Paradigms: Rules & Statistics • Rules • ML classification • Probabilistic graphical models • Soft logic [Chiticariu, Li, Reiss, EMNLP’13] • EMNLP, ACL, NAACL, 2003- 2012 • 54 industrial vendors (Who’s Who in Text Analytics, 2012) “[…] rules are effective, interpretable, and are easy to customize by non-experts to cope with errors.” Gupta & Manning, CONLL’14 14 + NLP
  • 15. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 15
  • 16. Xlog: Datalog for IE • Extension of (non-recursive) Datalog • Use case: DBLife (db research kb: dblife.cs.wisc.edu) • Data types: string, document, span – Focus on single-document programs • “Procedural predicates” (p-predicates) are user-defined functions that produce relations over spans – Example: sentence(doc, span) • Query-plan optimization [Shen, Doan, Naughton, Ramakrishnan, VLDB 2007] Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. Same string, different spans Span [42,47) 16
  • 17. Xlog Example “Declarative Information Extraction using Datalog with Embedded Extraction Predicates” [Shen, Doan, Naughton, Ramakrishnan, VLDB 2007] Regex. (string) Unary regex formula Binary regex formula 17
  • 18. • Datalog syntax – Types: string, span • Built in collection of p-predicates – Various types of built-in regex formulas – Linguistic: deep parsing, coreference resolution, named-entity extractor Instaread: Datalog + NLP Binary regex formulas Unary regex formulas [Hoffmann, 2012] 18
  • 19. IBM SystemT: SQL for IE • Engine for AQL: SQL-like declarative IE lang. – AQL = Annotation Query Language • SystemT = AQL + Runtime + Dev. Tooling – [Chiticariu et al., ACL 2010]: position SystemT as a high-quality and high-efficiency IE solution – System and IDE demos in ACL 2011, SIGMOD 2011 • Commercial product, high academic presence – Integration on public financial records [Hernández et al., EDBT’13, Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10, ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et al., Interact’13], social media [Sindhwani et al., IBM Journal 2011] 19
  • 20. SystemT’s AQL Example [Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010] regex + join w/ previous views projection union Cleaning Unary regex formulas 20
  • 21. Formal Framework • Repeated concept: Extend a relational query language with text transducers (p-predicates, usually regex formulas) • Research challenge: theoretical underpinnings of this combined document/relation model • Expressive power – Query-plan optimization: Can we rewrite an operator via “easier” building blocks? – System extensions: Can we express a new operation using existing ones, or prove impossibility? • Next: a formal framework – With Fagin, Reiss, Vansummeren, PODS’13, JACM 21
  • 22. 22 Terminology Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. Company CEO CompanyCEO [1,14) (Kaspersky Lab) [19,36) (Eugene Kaspersky) [1,36) [42,47) (Intel) [52,65) (Paul Otellini) [42,65) Relation over spans from the document Document Span [52,65)
  • 23. Document Spanners Document d Relation over the spans of d Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. x y z [1,14) [30,36) [1,36) [42,47) [52,65) [42,65) [102,110) [115,125) [102,125) Document Spanner: a function that maps every doc. (string) into a relation over the doc.’s spans More formally: • Finite alphabet of symbols • A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d • The relation has a fixed signature (set of attributes) − The attributes come from an infinite domain of variables x, y, z, … 23
  • 24. Spanners as Regex Formulas • Regular expression with embedded variables • Examples: • Restriction: each “evaluation” (parse tree) assigns one span to each variable (see [Fagin et al., PODS’13]) Ordinary regex Span variable  .* x{dddd} .*  .* in w{Alabama | Alaska | Arizona | …} .*  (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | … Representation system for spanners 24
  • 25. Spanners as Datalog w/ Regex • Non-recursive Datalog (NR-Datalog) • Operate over a document (not a relational db) Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ] State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*] Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*] CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*] Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y) RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z) Carter_from_Plains,_Georgia,_Washington _from_Westmoreland,_Virginia x z [1,7) Carter [13,28) Plains,_Georgia [30,40) Washington [46,69) Westmoreland,_Virginia EDBs = Spanners! Another representation system for spanners Quer y goal 25
  • 26. Spanners as Automata 0,1 0 1 Ordinary NFA 1 0 0 1 1 1 0 1 Var-Stack Automaton 1 0 0 1 1 1 0 1x{ y{ } } y{x{ } } Var-Set Automaton 1 0 0 1 1 1 0 1x{ }y y{x{ }x }y }x 0,1 0 1 0,1 0 1 • In an accepting run, each variable opens and later closes exactly once ⇒ Each accepting run defines an assignment to the variables • Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples Close most recent Close x y x x y Another representation system for spanners y{ 26
  • 27. Study of Expressive Power Spanners definable by regex formulas= Spanners definable by var-stack automata Spanners definable by var-set automata = Spanners definable by NR Datalog w/ regex formulas 27 x{ y{ } } 0,1 0 1 x{ }y }x 0,1 0 1 y{ .*x{.*}_from_z{.*}.*} Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ] State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*] Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*] CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*] Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y) RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
  • 28. Consequences • Connections between Datalog+regex spanners and other language formalisms – Classic string relations [Berstel 79] – Graph queries (CRPQs) [Cruz et al. 87] • Extension with string equality & difference – Expressiveness / closure properties • Principles for cleaning inconsistencies – Follow up work [PODS’14] – Next… 28
  • 29. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 29
  • 30. Next, highlight 3 lines of foundational research that were motivated by our work on text analytics: 1. Database inconsistency w/ repair priorities 2. Frequent subgraph mining 3. Update propagation 30
  • 31. • Extractors may produce inconsistent results – Data artifacts – Developer limitations • Rather than repairing the existing extractors, common practice is to clean (intermediate) results – SystemT “consolidators” [Chiticariu et al.10] – GATE/JAPE “controls” [Cunningham 02] – Implicit in other rule systems, e.g., WHISK [Soderland 99] – POSIX regex disambiguation [Fowler 03] Cleaning IE Inconsistencies 33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303 Person2 Person1 Address1 31
  • 32. SystemT Consolidators [Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010] Other policies built in 32
  • 33. Five GATE/JAPE Controls All Once First AppeltBrin .* x{dd+} .*Sequence 12345 and sequence 12. Document Spanner Screenshots from GATE UI 33
  • 34. Cleaning via Prioritized Repairs • Problem: existing policies are ad-hoc; how to expose a language for user declaration? • [Fagin, K, Reiss, Vansummeren 2014]: spanner formalism for declarative cleaning • Key: prioritized repairs [Staworko, et al. 12] • Idea: Extend extraction programs with – Denial constraints: which facts are in conflict? – Priority declarations: preference between facts • Captures SystemT, GATE, WHISK, POSIX, … • We are now trying to improve our understanding of prioritized repairs… 34
  • 35. Prioritized Repairs: Definition Database Denial Constraints Collection of facts Which sets of facts cannot co-exist? Priority Relation Binary “is preferred to” relation • [Arenas, Bertossi, Chomicki 99]: Inconsistent DB represents a set of (equally likely) “repairs”  Then we can ask for the “possible” or “consistent” query answers • [Staworko, Chomicki, Marcinkowski 12] add priorities: • Let A and B be two consistent subsets of the database • Say that A improves B if we can obtain A from B by a “profitable” exchange of facts (precision later…) • A repair is a consistent subset that cannot be improved Inconsistent Database Instance 35
  • 36. Example professor university city Monica ubiobio Concepción Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago Violated constraints (functional dependencies): • professor  university, city (“key constraint”) • university  city professor university city Monica ubiobio Concepción Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago professor university city Monica ubiobio Concepción Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago “Ordinary” repairs [Arenas et al. 99] Tuple priority  some repairs can be discarded [Staworko et al.] 36 A improves B if we get A from B by removing tuples & adding tuple; each removed preferred to by some added
  • 37. Complexity of Testing Improvability Theorem:  In the case of a single functional dependency or two keys per relation, improvability can be tested in polynomial time  In any other combination of FDs, the problem is NP-complete! university faculty dean UChile Economics Agosin Technion CS Yavneh Stanford Law Magill two keys 37 Can a consistent subset be improved? Recent work (unpublished) w/ Fagin & Kolaitis
  • 38. IE with Recurring Patterns I want to buy my advisor a gift. I really want to buy a gift to my advisor. I want to buy a gift to the secretary and to my advisor. 1. Apply dependency parsing 38 [Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
  • 39. IE with Recurring Patterns I want to buy my advisor a gift. I really want to buy a gift to my advisor. I want to buy a gift to the secretary and to my advisor. I want buy gift advisor 1. Apply dependency parsing 2. Find freq. recurring patterns 39 [Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
  • 40. = 3 g1 g2 g3 g4 Freq. Freq. Max. Freq. Max. Maximal Frequent Subgraphs
  • 41. Complexity Study • Naturally, there has been a lot of work on this problem – SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], … • But little was known about the computational complexity • Studied: impact of assumptions on comp. complexity – Graph properties (e.g., trees, treewidth, etc.) – Label repeatability – Bounded #results desired – Bounded threshold • This work led to novel complexity results and a new methodologies for mining maximal subgraphs – [K & Kolaitis, ACM PODS’13, ACM TODS] • Next, some complexity nuggets  41
  • 42. Complexity Nuggets • Good news: If labels do not repeat in each input graph, then there are PTime solutions when – The threshold is bounded; or – Graphs are trees & few results are desired • In general graphs w/o label repetition, you can find 2 results in PTime – Bad news: But finding 3rd is NP-hard! – Bad news: And if labels repeat and graphs are trees, then finding 2nd is already NP-hard! • Even for a bounded threshold 42
  • 43. Improving Dictionaries w/ Feedback text fragments (sentences, tables, rows, …) join IBM , San Jose company occurrences address occurrencescompanies, countries, … Apple , CupertinoIBM , Armonk IE IE IE auto. suggest a “good” fix to the IE program Web data “good” = small effect on other results Yahoo! , Cupertino Goo 43
  • 44. View Updates • View-update problem: Translate an update on a view to an update on the base relations • Deletion propagation as a special case – Update is delete(a set of view tuples) • Motivation: – Classic: database/view maintenance • DB access only through views, hidden join keys, etc. – Debugging • [K&al.12]: deletion propagation for debugging text extractors – Database causality [Meliou&al.10] • Intuition: good propagation provides a good explanation of why we have the tuples to begin with • [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis, Repairs and View-Updates in Databases” 44
  • 45. Example: File Access GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access(u,f) :– UserGroup(u,g), GroupFile(g,f) Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! [Cui&Widom01; Buneman&al.02] Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt = ⋈ 45
  • 46. Example: File Access = ⋈ GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt Access(u,f) :– UserGroup(u,g), GroupFile(g,f) [Cui&Widom01; Buneman&al.02] Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! 46
  • 47. Example: File Access GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt = ⋈ Access(u,f) :– UserGroup(u,g), GroupFile(g,f) [Cui&Widom01; Buneman&al.02] Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! Decision variant is NP-complete [Buneman et al. 02] 47
  • 48. Trichotomy in Complexity We have established a precise (easily testable) criterion that partition all cases into 3 categories: 1. The problem is solvable in PTime, and even via a straightforward algorithm [Buneman et al. 2001] 2. The problem is NP-hard, but constant-ratio approximable in PTime (ILP relaxation) 3. The problem is inapproximable for every ratio Fix a schema (w/ fds) and a CQ w/o self joins What is the complexity of finding a solution with a minimal side effect? [K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14] 48
  • 49. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 49
  • 50. Summary • Text analytics & IE • Rule systems for IE • A formal framework for rules, relating IE to traditional DB concepts such as Datalog • Research directions motivated by IE – Prioritized repairs – Graph mining – Update propagation 50
  • 51. Outlook: DB w/ Deep Text Support • We need a uniform & elegant data/query model to combine structured data & text; usefulness for querying both text and relations • We need a principled, simple & transparent probability model + effective quality + practical execution cost • We need to balance between automation and control: from full specification by experts to feature generation for nonexperienced – Maximally realize the potential of every developer! – LogicBlox is working on incorporating ML in Datalog! 51
  • 53. Room for Both Statistical Solution Rule System Feature Engineering Model Space, Runtime Cleaning + Post Proc. Cleaning + Post Proc. Building blocks (e.g., dictionaries, NER) “What doesn’t work: Anything requiring high precision and full automation” Feldman & Ungar, KDD’08 tutorial on text mining 53
  • 54. String DB, Spanners, Interval Algebra Kaspersky Kaspersky Intel Otellini IBM Rometty [10,20) [16,26) [32,37) [50,58) [105,108) [121,128) [10,20) [16,26) [32,37) [50,58) [105,108) [121,128) String Databases Interval AlgebraSpanners Atomic value: string Atomic value: span (pointing to doc) Atomic value: interval (no text) Join by string conditions (e.g., x is a substring of y) Join by interval conditions (e.g., x is a sub-interval of y) Join by interval+string conditions (e.g., x a token in y) Apps: text predicates in DBs [Grahne & al. 99] [Benedikt & al. 03], string manipulation [Bonner & Mecca 98] [Ginsburg and Wang 98] App: IE Apps: temporal reasoning [Allen 83] [Vilain & Kautz 86] [Nebel & Bürckert 95] [Krokhin et al. 03] 54
  • 55. 55 Imp. 1: Connection to Known Concepts • Connection to Recognizable Relations [Berstel 79] – These are unions of cross products of regular languages – THM: The class of regular spanners is closed under a string-selection predicate iff the predicate is a recognizable relation • Connection to CRPQs [Cruz et al. 87] – Conjunctive Regular Path Queries have been studied as a query language for labeled graphs – THM: Regular spanners have the same expressive power as unions of CRPQs on paths “with marked endpoints” • Up to some simple and necessary adaptation between the models S I G M O D Path with marked endpoints
  • 56. Imp. 2: Adding String Equality NR Datalog w/ regex formulas Regular Spanners Regularstr= Spanners + String-equality predicate (+substring-of, prefix-of, …) …application from Jane Doe, social 012-345-6789, on Mar 20th… identified as John Doe, 012-345-6789, ask us to… x1 x2 [117,125) (Jane Doe) [875,883) (John Doe) ⋮ ⋮ NameSSN(x,y) := … SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2) Same string, different spans 56
  • 57. Difference with String Equality • Are regularstr= spanners closed under difference? – Why should they? Only positive operators are used… – However, regex formulas (our EDBs) can introduce “negative” operations (NFAs closed under complement) • THM: The class of regular spanners is closed under difference • PROP: The class of regularstr= spanners is closed under string-inequality selection • THM: The class of regularstr= spanners is closed under string-containment selection, but then, not under non-string-containment selection! • COR: The class of regularstr= is not closed under difference 57
  • 58. Formal Optimization Problem Fixed: • Schema S w/ fun. dependencies • Conjunctive query Q Input: • Database instance I over S • Set A⊆ Q(I) of answers to delete Output: J ⊆ I s.t. Q(J) ∩ A = ∅ Goal: Minimize |(Q(I) – A) – Q(J)| Side Effect 58