Text Analytics - JCC2014 Kimelfeld

Foundational Research
Propelled by Text Analytics
Benny Kimelfeld
LogicBlox

Preamble
• Myself:
– Ph.D. @ HebrewU (DB uncertainty + search)
– IBM Almaden (DB theory, IR, Text Analytics)
– LogicBlox (ML in DB, Prob. Programming)
– Technion IL (Associate Prof., next year)
• This talk:
 Infrastructure for text analytics
+ DB theory, formal languages, NLP, data mining,
computational complexity, …
2

• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
3

Text Analytics Matters
Some important applications are based on the
analysis of text-centric data; for example:
Semantic Search
Semantic understanding & indexing of
content to better match user's intent
Life-Science Mining
Extract knowledge bases from
scientific publications
e-Commerce
Comparison Shopping extracts &
compares inventory from online sources
CRM / BI
Monitor customer’s social-media activity
for sentiment & business leads
Log Analysis
Summarize, visualize and analyze logs
produced by machines
4

Database Management Systems
• Old news: Data management is involved!
– Data semantics, query/analysis semantics, storage,
query evaluation, indices, consistency, transactions,
backup, privacy, recovery, …
– From-scratch engineering is highly challenging
• Motivation to the concept of a general-purpose
Database Management System
– Most notably: relational model (pioneered by Edgar F.
Codd in 1969) and SQL
5

“Big Data” Phenomena
Proprietary data in orgs.
(enterprises, governments, …)
Proliferation of publically open
data sources (Web, social, …)
Past: Present:
Massive-data analyses incurred
high machinery/personnel cost
Business models (cloud, crowd,
opensource) facilitate analyses
Data structured/controlled by
admins, e-forms, software, …
Uncontrolled data from humans’
free text, heterogeneous kbs, …
Analyses by specialized teams
of heavily trained experts
Analyses by a wide community
featuring a wide range of skills
6

“By 2018, the United States alone could face a shortage of 140,000
to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of
big data to make effective decisions.”
“Big data: The next frontier for innovation, competition, and productivity”
McKinsey Report, May 2011
We need dev. & management systems to
facilitate value extraction from Big Data
by a wide range of users / skills
7

Core Task: Information Extraction (IE)
“Information Extraction (IE) is the name given to any process
which selectively structures and combines data which is found,
explicitly stated or implied, in one or more texts. The final
output of the extraction process varies; in every case, however, it
can be transformed so as to populate some type of database.”
J. Cowie and Y. Wilks., Handbook of
Natural Language Processing, 2000
“Information extraction is the identification, and consequent or concurrent
classification and structuring into semantic classes, of specific
information found in unstructured data sources, such as natural language
text, making the information more suitable for information processing tasks.”
M. F. Moens, Information Extraction: Algorithms
and Prospects in a Retrieval Context, 2006
→data-in-text
(unstructured)
data-in-db
(structured)
In short:
8

Popular Classes of IE Tasks
• Named Entity Recognition
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
person person organization
organization
9

AdvisedB
y
WorksIn
from Princeton.
• Relation Extraction
10

from Princeton.
Graduation
Where?
Who?
• Event Extraction
11

from Princeton.
Education
Start End
Graduation
When?
• Temporal IE
12

from Princeton.
SameEntity
SameEntity
• Temporal IE
• Coreference Resolution
13

ariu
lmaden
A
m. com
Yunyao Li
IBM Research - Almaden
San Jose, CA
yunyaol i @us. i bm. com
Frederick R. Reiss
IBM Research - Almaden
San Jose, CA
f r r ei ss@us. i bm. com
stract
a” analytics over unstruc-
enewed interest in infor-
E). We surveyed the land-
ies and identiﬁed amajor
industry and academia:
ominatesthecommercial
garded as dead-end tech-
mia. We believe the dis-
he way in which the two
ethebeneﬁts and costsof
mia’s perception that rule-
research challenges. We
mportance of rule-based
Commercial*Vendors*(2013)*
NLP*Papers*
(200392012)*
100%$
50%$
0%$
3.5%*
21%$
75%$
Rule,$
Based$
Hybrid$
Machine$
Learning$
Based$
45%*
22%$
33%$
Implementa@ons*of*En@ty*Extrac@on*
Large*Vendors*
67%*
17%$
17%$
All*Vendors*
IE Paradigms: Rules & Statistics
• Rules
• ML classification
• Probabilistic graphical models
• Soft logic
[Chiticariu, Li, Reiss, EMNLP’13]
• EMNLP, ACL, NAACL, 2003-
2012
• 54 industrial vendors (Who’s
Who in Text Analytics, 2012)
“[…] rules are effective,
interpretable, and are easy
to customize by non-experts
to cope with errors.”
Gupta & Manning, CONLL’14
14
+
NLP

Outline
15

Xlog: Datalog for IE
• Extension of (non-recursive) Datalog
• Use case: DBLife (db research kb: dblife.cs.wisc.edu)
• Data types: string, document, span
– Focus on single-document programs
• “Procedural predicates” (p-predicates) are user-defined
functions that produce relations over spans
– Example: sentence(doc, span)
• Query-plan optimization
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Same string, different spans
Span [42,47)
16

Xlog Example
“Declarative Information Extraction using Datalog with Embedded Extraction Predicates”
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Regex.
(string)
Unary
regex
formula
Binary
regex
formula
17

• Datalog syntax
– Types: string, span
• Built in collection of p-predicates
– Various types of built-in regex formulas
– Linguistic: deep parsing, coreference
resolution, named-entity extractor
Instaread: Datalog + NLP
Binary regex
formulas
Unary regex
formulas
[Hoffmann, 2012]
18

IBM SystemT: SQL for IE
• Engine for AQL: SQL-like declarative IE lang.
– AQL = Annotation Query Language
• SystemT = AQL + Runtime + Dev. Tooling
– [Chiticariu et al., ACL 2010]: position SystemT as a
high-quality and high-efficiency IE solution
– System and IDE demos in ACL 2011, SIGMOD 2011
• Commercial product, high academic presence
– Integration on public financial records [Hernández et al., EDBT’13,
Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10,
ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et
al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et
al., Interact’13], social media [Sindhwani et al., IBM Journal 2011]
19

SystemT’s AQL Example
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
regex + join w/ previous views
projection
union
Cleaning
Unary regex formulas
20

Formal Framework
• Repeated concept: Extend a relational query
language with text transducers (p-predicates,
usually regex formulas)
• Research challenge: theoretical underpinnings
of this combined document/relation model
• Expressive power
– Query-plan optimization: Can we rewrite an operator via “easier”
building blocks?
– System extensions: Can we express a new operation using
existing ones, or prove impossibility?
• Next: a formal framework
– With Fagin, Reiss, Vansummeren, PODS’13, JACM
21

22
Terminology
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Company CEO CompanyCEO
[1,14)
(Kaspersky Lab)
[19,36)
(Eugene Kaspersky)
[1,36)
[42,47)
(Intel)
[52,65)
(Paul Otellini)
[42,65)
Relation over spans from the document
Document
Span [52,65)

Document Spanners
Document d Relation over the spans of d
Kaspersky Lab CEO Eugene
Kaspersky said Intel CEO Paul Otellini
and the Intel board had no idea what
they were in for when the company
announced it was acquiring McAfee
on August 19, 2010.
x y z
[1,14) [30,36) [1,36)
[42,47) [52,65) [42,65)
[102,110) [115,125) [102,125)
Document Spanner: a function that maps every
doc. (string) into a relation over the doc.’s spans
More formally:
• Finite alphabet of symbols
• A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d
• The relation has a fixed signature (set of attributes)
− The attributes come from an infinite domain of variables x, y, z, …
23

Spanners as Regex Formulas
• Regular expression with embedded variables
• Examples:
• Restriction: each “evaluation” (parse tree) assigns
one span to each variable (see [Fagin et al., PODS’13])
Ordinary regex Span variable
 .* x{dddd} .*
 .* in w{Alabama | Alaska | Arizona | …} .*
 (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | …
Representation system for spanners
24

Spanners as Datalog w/ Regex
• Non-recursive Datalog (NR-Datalog)
• Operate over a document (not a relational db)
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Carter_from_Plains,_Georgia,_Washington
_from_Westmoreland,_Virginia
x z
[1,7)
Carter
[13,28)
Plains,_Georgia
[30,40)
Washington
[46,69)
Westmoreland,_Virginia
EDBs = Spanners!
Another representation
system for spanners
Quer
y goal
25

Spanners as Automata
0,1 0 1
Ordinary
NFA
1 0 0 1 1 1 0 1
Var-Stack
Automaton
1 0 0 1 1 1 0 1x{
y{
}
}
y{x{ } }
Var-Set
Automaton
1 0 0 1 1 1 0 1x{
}y
y{x{ }x }y
}x
0,1 0 1
0,1 0 1
• In an accepting run, each variable opens and later closes exactly once
⇒ Each accepting run defines an assignment to the variables
• Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples
Close most recent
Close x
y
x
x
y
Another representation system for spanners
y{
26

Study of Expressive Power
Spanners definable by
regex formulas=
var-stack automata
var-set automata =
NR Datalog w/ regex formulas
27
x{
y{
}
}
0,1 0 1
x{
}y
}x
0,1 0 1
y{
.*x{.*}_from_z{.*}.*}
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)

Consequences
• Connections between Datalog+regex
spanners and other language formalisms
– Classic string relations [Berstel 79]
– Graph queries (CRPQs) [Cruz et al. 87]
• Extension with string equality & difference
– Expressiveness / closure properties
• Principles for cleaning inconsistencies
– Follow up work [PODS’14]
– Next…
28

Outline
29

Next, highlight 3 lines of foundational research that
were motivated by our work on text analytics:
1. Database inconsistency w/ repair priorities
2. Frequent subgraph mining
3. Update propagation
30

• Extractors may produce inconsistent results
– Data artifacts
– Developer limitations
• Rather than repairing the existing extractors,
common practice is to clean (intermediate) results
– SystemT “consolidators” [Chiticariu et al.10]
– GATE/JAPE “controls” [Cunningham 02]
– Implicit in other rule systems, e.g., WHISK [Soderland 99]
– POSIX regex disambiguation [Fowler 03]
Cleaning IE Inconsistencies
33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303
Person2
Person1
Address1
31

SystemT Consolidators
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
Other policies
built in
32

Five GATE/JAPE Controls
All
Once
First
AppeltBrin
.* x{dd+} .*Sequence 12345 and sequence 12.
Document Spanner
Screenshots from GATE UI
33

Cleaning via Prioritized Repairs
• Problem: existing policies are ad-hoc; how to
expose a language for user declaration?
• [Fagin, K, Reiss, Vansummeren 2014]: spanner
formalism for declarative cleaning
• Key: prioritized repairs [Staworko, et al. 12]
• Idea: Extend extraction programs with
– Denial constraints: which facts are in conflict?
– Priority declarations: preference between facts
• Captures SystemT, GATE, WHISK, POSIX, …
• We are now trying to improve our understanding
of prioritized repairs…
34

Prioritized Repairs: Definition
Database
Denial
Constraints
Collection of facts Which sets of facts
cannot co-exist?
Priority
Relation
Binary “is preferred to”
relation
• [Arenas, Bertossi, Chomicki 99]: Inconsistent DB
represents a set of (equally likely) “repairs”
 Then we can ask for the “possible” or “consistent” query answers
• [Staworko, Chomicki, Marcinkowski 12] add priorities:
• Let A and B be two consistent subsets of the database
• Say that A improves B if we can obtain A from B by a
“profitable” exchange of facts (precision later…)
• A repair is a consistent subset that cannot be improved
Inconsistent Database Instance
35

Example
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
Violated constraints (functional
dependencies):
• professor  university, city
(“key constraint”)
• university  city
“Ordinary” repairs [Arenas et al. 99]
Tuple priority  some repairs can be discarded [Staworko et al.] 36
A improves B if we get A from B by removing tuples & adding
tuple; each removed preferred to by some added

Complexity of Testing Improvability
Theorem:
 In the case of a single functional dependency
or two keys per relation, improvability can be
tested in polynomial time
 In any other combination of FDs, the
problem is NP-complete!
university faculty dean
UChile Economics Agosin
Technion CS Yavneh
Stanford Law Magill
two keys
37
Can a consistent subset be improved?
Recent work (unpublished)
w/ Fagin & Kolaitis

IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
1. Apply
dependency
parsing
38
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.

IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
I
want
buy
gift advisor
1. Apply
dependency
parsing
2. Find freq.
recurring
patterns
39
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.

= 3
g1 g2 g3 g4
Freq.
Freq. Max.
Freq.
Max.
Maximal Frequent Subgraphs

Complexity Study
• Naturally, there has been a lot of work on this problem
– SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], …
• But little was known about the computational complexity
• Studied: impact of assumptions on comp. complexity
– Graph properties (e.g., trees, treewidth, etc.)
– Label repeatability
– Bounded #results desired
– Bounded threshold
• This work led to novel complexity results and a new
methodologies for mining maximal subgraphs
– [K & Kolaitis, ACM PODS’13, ACM TODS]
• Next, some complexity nuggets 
41

Complexity Nuggets
• Good news: If labels do not repeat in each input
graph, then there are PTime solutions when
– The threshold is bounded; or
– Graphs are trees & few results are desired
• In general graphs w/o label repetition, you can
find 2 results in PTime
– Bad news: But finding 3rd is NP-hard!
– Bad news: And if labels repeat and graphs are
trees, then finding 2nd is already NP-hard!
• Even for a bounded threshold
42

Improving Dictionaries w/ Feedback
text fragments
(sentences, tables, rows, …)
join
IBM , San Jose
company
occurrences
address
occurrencescompanies, countries, …
Apple , CupertinoIBM , Armonk
IE IE
IE
auto. suggest a “good”
fix to the IE program
Web data
“good” = small effect
on other results
Yahoo! , Cupertino Goo
43

View Updates
• View-update problem: Translate an update on a view to
an update on the base relations
• Deletion propagation as a special case
– Update is delete(a set of view tuples)
• Motivation:
– Classic: database/view maintenance
• DB access only through views, hidden join keys, etc.
– Debugging
• [K&al.12]: deletion propagation for debugging text extractors
– Database causality [Meliou&al.10]
• Intuition: good propagation provides a good explanation of why we
have the tuples to begin with
• [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis,
Repairs and View-Updates in Databases”
44

Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
[Cui&Widom01; Buneman&al.02]
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
45

= ⋈
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
46

GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
Decision variant is NP-complete [Buneman et al. 02]
47

Trichotomy in Complexity
We have established a precise (easily testable) criterion
that partition all cases into 3 categories:
1. The problem is solvable in PTime, and even via a
straightforward algorithm [Buneman et al. 2001]
2. The problem is NP-hard, but constant-ratio
approximable in PTime (ILP relaxation)
3. The problem is inapproximable for every ratio
Fix a schema (w/ fds) and a CQ w/o self joins
What is the complexity of finding a solution with a minimal side effect?
[K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14]
48

Outline
49

Summary
• Text analytics & IE
• Rule systems for IE
• A formal framework for rules, relating IE to
traditional DB concepts such as Datalog
• Research directions motivated by IE
– Prioritized repairs
– Graph mining
– Update propagation
50

Outlook: DB w/ Deep Text Support
• We need a uniform & elegant data/query model to
combine structured data & text; usefulness for querying
both text and relations
• We need a principled, simple & transparent probability
model + effective quality + practical execution cost
• We need to balance between automation and control:
from full specification by experts to feature generation for
nonexperienced
– Maximally realize the potential of every developer!
– LogicBlox is working on incorporating ML in Datalog!
51

Room for Both
Statistical
Solution
Rule
System
Feature Engineering
Model Space, Runtime
Cleaning + Post Proc.
Cleaning + Post Proc.
Building blocks
(e.g., dictionaries, NER)
“What doesn’t work: Anything requiring high
precision and full automation”
Feldman & Ungar, KDD’08 tutorial on text mining
53

String DB, Spanners, Interval Algebra
Kaspersky Kaspersky
Intel Otellini
IBM Rometty
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
String Databases Interval AlgebraSpanners
Atomic value: string Atomic value: span
(pointing to doc)
Atomic value: interval
(no text)
Join by string conditions
(e.g., x is a substring of y)
Join by interval conditions
(e.g., x is a sub-interval of y)
Join by interval+string
conditions (e.g., x a
token in y)
Apps: text predicates in DBs
[Grahne & al. 99] [Benedikt &
al. 03], string manipulation
[Bonner & Mecca 98]
[Ginsburg and Wang 98]
App: IE Apps: temporal reasoning
[Allen 83] [Vilain & Kautz
86] [Nebel & Bürckert 95]
[Krokhin et al. 03]
54

55
Imp. 1: Connection to Known Concepts
• Connection to Recognizable Relations [Berstel 79]
– These are unions of cross products of regular languages
– THM: The class of regular spanners is closed under
a string-selection predicate iff the predicate is a
recognizable relation
• Connection to CRPQs [Cruz et al. 87]
– Conjunctive Regular Path Queries have been studied as a
query language for labeled graphs
– THM: Regular spanners have the same expressive
power as unions of CRPQs on paths “with marked
endpoints”
• Up to some simple and necessary adaptation between the models
S I G M O D
Path with marked endpoints

Imp. 2: Adding String Equality
NR Datalog w/ regex formulas
Regular Spanners
Regularstr= Spanners
+ String-equality predicate
(+substring-of, prefix-of, …)
…application from Jane Doe,
social 012-345-6789, on Mar
20th… identified as John Doe,
012-345-6789, ask us to…
x1 x2
[117,125)
(Jane Doe)
[875,883)
(John Doe)
⋮ ⋮
NameSSN(x,y) := …
SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2)
Same string,
different spans
56

Difference with String Equality
• Are regularstr= spanners closed under difference?
– Why should they? Only positive operators are used…
– However, regex formulas (our EDBs) can introduce
“negative” operations (NFAs closed under complement)
• THM: The class of regular spanners is closed under
difference
• PROP: The class of regularstr= spanners is closed
under string-inequality selection
• THM: The class of regularstr= spanners is closed
under string-containment selection, but then, not
under non-string-containment selection!
• COR: The class of regularstr= is not closed under
difference
57

Formal Optimization Problem
Fixed: • Schema S w/ fun. dependencies
• Conjunctive query Q
Input: • Database instance I over S
• Set A⊆ Q(I) of answers to delete
Output: J ⊆ I s.t. Q(J) ∩ A = ∅
Goal: Minimize |(Q(I) – A) – Q(J)|
Side Effect
58

Text Analytics - JCC2014 Kimelfeld

More Related Content

What's hot (20)

Similar to Text Analytics - JCC2014 Kimelfeld (20)

More from Pedro Contreras Flores (20)

Recently uploaded (20)

Text Analytics - JCC2014 Kimelfeld