SystemT: Declarative Information Extraction

Scalable Natural Language Processing (SNaP)
IBM Research | Almaden
SystemT:
Declarative Information Extraction
Yunyao Li

Text analytics
is the key for
Unlocking big data value
Customer 360º View
Social Media Analytics, CRM Analytics
Security & Privacy
Email Analytics, Data Redaction, Digital Piracy
Operations Analysis
Machine Data Analytics
Financial Analytics
Systemic risk analysis, Event monitoring
Healthcare Analytics
Patient care, drug discovery
and many more …

Text Analytics
Text Analytics vs Information Extraction
3
Information
Extraction
Information
Extraction
ClusteringClustering
ClassificationClassification
RegressionRegression
Core operations such
as pos tagging, deep
parsing, sentence
detection etc.
Core operations such
as pos tagging, deep
parsing, sentence
detection etc.
Information extraction: A key enabling technology for
text analytics
•Input: Raw text and features from low-level natural
language processing operations
•Output: Structured data suitable for general purpose
analysis software

Information Extraction All Over
… hotel close to Seaworld in Orlando?
Planning to go this coming weekend…
Social Media
Customer 360º
Security & Privacy
Operations Analysis
Financial Analytics
my social security # is 400-05-4356
Email
Machine
Data
Financial
Statements
Medical
Records
News
CRM
Patents
Product:Hotel Location:Orlando
Type:SSN Value:400054356
Storage Module 2 Drive 1 fault
Event:DriveFail Loc:mod2.Slot1

5
Extraction Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)

6
Extraction Example: Sentiment Analysis
Mcdonalds mcnuggets are fake as shit but they so delicious.Mcdonalds mcnuggets are fake as shit but they so delicious.
We should do something cool like go to Z (kidding).We should do something cool like go to Z (kidding).
Makin chicken fries at home bc everyone sucks!Makin chicken fries at home bc everyone sucks!
You are never too old for Disney movies.You are never too old for Disney movies.
Social Media
Different domain  Different ball game!
Bank X got me ****ed up today!Bank X got me ****ed up today!
Customer Reviews
Not a pleasant client experience. Please fix ASAP.Not a pleasant client experience. Please fix ASAP.
I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.
X... fixing something that wasn't brokenX... fixing something that wasn't broken
Analyst Research Reports
Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%
IBM announced 4Q2012 earnings of $5.13 per share, compared with
4Q2011 earnings of $4.62 per share, an increase of 11 percent
IBM announced 4Q2012 earnings of $5.13 per share, compared with
4Q2011 earnings of $4.62 per share, an increase of 11 percent
We continue to rate shares of MSFT neutral.We continue to rate shares of MSFT neutral.
Sell EUR/CHF at market for a decline to 1.31000…Sell EUR/CHF at market for a decline to 1.31000…
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
Equity
Fixed Income
Currency

Enterprise Requirements
• Accuracy
– Garbage-in garbage-out: Usefulness of analytics results is tied to the quality of
extraction
• Scalability
– Large data volumes, often orders of magnitude larger than classical NLP corpora
• Social Media:
– Twitter alone has 400M+ messages / day; 1TB+ per day
• Financial Data:
– SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few
MBs
• Machine Data:
– One application server under moderate load at medium logging level  1GB of logs per day
• Transparency
– Every customer’s data and problems are unique in some way
– Need to quickly implement new business logic
• Usability
– Building an accurate IE system is labor-intensive
– Professional programmers are expensive

8
Core OperationsCore OperationsExecution EngineExecution Engine
SystemT’s 5 Layered Architecture
Compiler + OptimizerCompiler + Optimizer
Extraction LanguageExtraction Language
Generic LibrariesGeneric Libraries
Tooling + UITooling + UI
20+ languages
currently
supported
NE libraries,
Sentiment,
informal-text
normalizer, …
Scalable
Embeddable
Runtime
AQL
Cost based
optimizer
Eclipse tooling:
editor, workflow
analysis, pattern
discovery, regex
learner, profiler,…

9
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Easy to Develop & Maintain
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation

10
State-of-the-art
industrial system
State-of-the-art
open-source system
SystemT
Scalability via Optimization
SystemT needs less than 10% of the
cores to keep up with Twitter’s daily
feed without drop in quality
10x50x

Products utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasks
Products fully exposingProducts fully exposing
the AQL language and enginethe AQL language and engine
Products fully exposingProducts fully exposing
the AQL language and enginethe AQL language and engine
SystemT in IBM Products
InfoSphere
BigInsights
InfoSphere
Streams
Social
Media
Analytics
eDiscovery
Analyzer
Optim Data
Lifecycle
Management
Solutions
SmartCloud
Analytics:
Log Analysis
Watson
Discovery
Advisor
Installed on 400+ client sites

0-I II III IV
Time to develop a drug; 12 -15 years
Avg cost to develop a drug: 1.2 billion
Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile
Laboratory
50,000 +
Compounds
Pre-Clinical
250 Compounds
Clinical
5 Compounds
FDA
approval
1 compound
“..Toxicity and Serious Adverse Events in Late Stage Drug Development
are the Major Causes of Drug Failure”
Intervention should happen at critical transition bottlenecks between
stages (most likely to impact outcome)
Drug Development Pipeline

LaboratoryLaboratory Pre-clinicalPre-clinical ClinicalClinical FDA ApprovalFDA Approval BedsideBedside
Objective: enable data-driven decision making
more efficient clinical trial design, data analytics and drug success / failure
predictions
Case Study: Drug Discovery

Structure
Indication
Indication
Mode of action
/ target
Effect level
Side Effect
360° view on drugs allows for comparison between drugs, and therefore allow to
predict drug behavior
…Structured and
unstructured data
sources
Extract known facts about a drug

Patents also have (Manually Created)
Chemical Complex Work Units (CWU’s)
As text
Chemical names
found in the text of
documents
As bitmap images
Pictures of chemicals
found in the document
Images
Unstructured data contain chemical compound information in many different forms
Chemical
nomenclature can
be daunting
Information Extraction Example

Structure
Renders content searchable and avoid redundant and expensive tests
Building block for automation of the discovery and decision process
Internal reports
Public Medline
articles with limited
structured info
POPULATION
gender FEMALE,MALE
species RAT
strain Sprague Dawley
precondition pregnant
From unstructured to structured content
 Handle variants in entity reference
 Disambiguate based on context, etc…
 Extract relationships
 Assemble complex concepts
Combine rule-based extraction with Machine Learning
 Handle variants in entity reference
 Disambiguate based on context, etc…
 Extract relationships
 Assemble complex concepts
Combine rule-based extraction with Machine Learning

EfficacyEfficacyEfficacyEfficacyResults
StudyStudy
Authorship InterventionInterventionAuthorship InterventionAuthorship Intervention
PopulationPopulationPopulationPopulationPopulation
Study
name
Size Population
description
Time
interval
Intervention Phase Type Authorship
Author Author
affiliation
Funding
org
Compound Target Admin
route
dosage Type (preventive
/therapeutic)
Species Subspecies Gender Age Pre-existing
conditions
Physical
attributes
Geographic
Location
1.Extractmentions2.Assemblementions
intoinformation
In-life
Observations
Hematology Clinical chem Comparative
Macroscopic Results
Comparative
Microscopic Resutls

Preliminary Results
SUGGEST NEW P53 KINASES
2. Cluster proteins by their literature distance: known p53
kinases cluster (green) and suggest others that may also
phosphorylate p53.
GREEN – p53 Kinases
RED/ORANGE/YELLOW – Predicted Targets
1. Quantify the “literature
distances” between
proteins
BMPR2CHEK2
Bag of Words Model
Example Use Case: Drug Development

Summary
• SystemT
– Declarative IE system based on an algebraic framework
– Expressivity and performance advantages over grammar-
based IE systems
– Text-specific optimizations
– Key enabling technology for Big Data analytics applications
• Ongoing Work
– Tooling support for non-programmers
– Improved optimization strategies
– New operators for advanced features (e.g. co-reference
resolution, text normalization)

Thank you!
• For more information…
– Visit our website: https://ibm.biz/BdF4GQ
– Get a free copy:
• IBM BigInsights Quick start Edition http://www-
01.ibm.com/software/data/infosphere/biginsights/quick-start/
• IBM Academic Initiative: http://www-
03.ibm.com/ibm/university/academic/pub/page/academic_initiative.
InfoSphere BigInsights 2.1.1 (and Streams 3.0) are freely available for academic
purposes.
– Take free online classes: BigDataUniversity.com
– Watch YouTube videos:
• IBM InfoSphere BigInsights Text Analytics Channel:
http://www.youtube.com/playlist?list=PL2C66C6E455AD0216
• IBM Big Data Channel: http://www.youtube.com/user/ibmbigdata
• IBM InfoSphere Streams Channel: http://www.youtube.com/playlist?
list=PLCF04A48C22F34B19
– Contact me
• yunyaoli@us.ibm.com
SystemT Team:
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu
SystemT Team:
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu

SystemT: Declarative Information Extraction

More Related Content

What's hot (20)

Similar to SystemT: Declarative Information Extraction (20)

More from Yunyao Li (20)

Recently uploaded (20)

SystemT: Declarative Information Extraction

Editor's Notes