Neural NLP Models of Information Extraction

Unrestricted © Siemens AG 2017
Neural NLP Models of Information Extraction
Presenter: Pankaj Gupta | PhD with Prof. Hinrich Schütze | Research Scientist
University of Munich (LMU) | Siemens AG, Munich Germany
Venue: Google AI, New York City | 25 Mar, 2019

January 2019Page 2 Machine Intelligence / Siemens AI Lab
About Me: Affiliations
time
Bachelors
(B.Tech-IT)
2006-10 2010-13
Senior
Software
Developer
Masters
(MSc-CS)
&
Research
Assistant
Bachelor
Internship
2009 2013-15
Working
Student
&
Master
Thesis
2013-15
Starting
PhD
2015
PhD
Research
Intern
(4 months)
2016-17
Research
Scientist -
NLP/ML
2017-Now
PhD
Submission
2019
Master Thesis: Deep Learning Methods for the Extraction of Relations in Natural Language Text
PhD Thesis Title (tentative): Neural Models of Information Extraction from Natural Language Text
Reach me: https://sites.google.com/view/gupta-pankaj/

About Me: Research
Neural Topic ModelingNeural Relation Extraction
Interpretability
➢ Intra- and Inter-sentential RE
➢ Joint Entity & RE
➢ Weakly-supervised
Bootstrapping RE
➢ Autoregressive TMs
➢ Word Embeddings Aware TM
➢ Language Structure Aware TM (textTOvec)
➢ Multi-view Transfer Learning in TM
Interpretable RE
Interpretable
topics
Transfer Learning
Lifelong Learning
➢ Explaining RNN predictions

Outline
Two Tracks:
1/2 Track: Relation Extraction
Neural Relation Extraction Within and Across Sentence Boundaries
Pankaj Gupta, Subburam Rajaram, Hinrich Schütze, Thomas Runkler. In AAAI-2019.
2/2 Track: Topic Modeling & Representation Learning (Briefly)
Document Informed Neural Autoregressive Topic Models with Distributional Prior
Pankaj Gupta, Yatin Chaudhary, Florian Buettner, Hinrich Schütze. In AAAI-2019
textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with
Distributed Compositional Prior
Pankaj Gupta, Yatin Chaudhary, Florian Buettner, Hinrich Schütze. To appear in ICLR-2019.

Neural Relation Extraction Within and Across
Sentence Boundaries
Introduction: Relation Extraction spanning sentence boundaries
Proposed Methods
➢ Inter-sentential Dependency-Based Neural Networks (iDepNN)
→ Inter-sentential Shortest Dependency Path (iDepNN-SDP)
→ Inter-sentential Augmented Dependency Path (iDepNN-ADP)
Evaluation and Analysis
➢ State-of-the-art comparison
➢ Error analysis

Sentence Boundaries
Proposed Methods
➢ Error analysis

Introduction: Relation Extraction (RE)
Binary Relation Extraction(RE):
- identify semantic relationship between a pair of nominals or entities e1 and e2 in a given text snippet, S

Paul Allen has started a company and named [Vern Raburn]e1 its [president]e2 .
relation: per-post(e1,e2)

Need for Relation Extraction (RE)
→ Large part of the information is expressed in free text, e.g. in web pages, blogs, social media, etc.
→ Need for automatic systems to extract the relevant information in form of Structured KB
• Entity Extraction
• Relation Extraction
• Structure the unstructured text
• Knowledge Graph Construction
• In web search, retrieval, Q&A, etc.
Information Extraction
Entity Extraction: Detect entities such as person, organization, location, product, technology, sensor, etc.
Relation Extraction: Detect relation between the given entities or nominals
End-to-End Knowledge Base Population
Text Documents Knowledge GraphIE Engine
SensorSensor
Competitor-of
Sensor

Relation Extraction
(Based on location of entities)

intra-sentential
(entities within sentence boundary)
most
prior works
Relation Extraction

intra-sentential
most
prior works
This work
Relation Extraction
inter-sentential
(entities across sentence boundary(s))

intra-sentential
most
prior works
This work
Relation Extraction
Paul Allen has started a company and named [Vern Raburn]e1 its [president]e2 .
relation: per-post(e1,e2)
Example
inter-sentential

intra-sentential
most
prior works
This work
Relation Extraction
Paul Allen has started a company and named [Vern Raburn]e1 its president. The
company, to be called [Paul Allen Group]e2 will be based in Bellevue, Washington.
relation: per-org(e1,e2)
inter-sentential
Example

intra-sentential
most
prior works
This work
Relation Extraction
relation: ??
inter-sentential
MISSED relationships:
Impact the system
performance, leading
to POOR RECALL

intra-sentential
most
prior works
This work
Relation Extraction
inter-sentential
Capture relationship
between entities at
distance across
sentence boundaries

Challenges in Inter-sentential Relation Extraction (RE)
This work
Paul Allen has started a company and named [Vern Raburn]e1 its president.
The company will coordinate the overall strategy for the group of high-tech
companies that Mr. Allen owns or holds a significant stake in, will be based in
Bellevue, Washington and called [Paul Allen Group]e2 .
inter-sentential

This work
inter-sentential
NOISY text in relationships
spanning sentence boundaries:
POOR PRECISION

This work
inter-sentential
NOISY text in relationships
spanning sentence boundaries:
POOR PRECISION
Robust system to
tackle false positives
in inter-sentential RE
Need

Motivation: Dependency Based Relation Extraction
1. Dependency Parse trees effective in extracting relationships
limited to single sentences, i.e., intra-sentential relationships
2. Shortest Dependency Path (SDP) between entities in parse trees effective in RE
ignore additional information relevant in relation identification
4. Tree-RNNs effective in modeling relations via recursive compositionality
3. Augmented Dependency Path (ADP) precisely models relationships

Sentences and their dependency graphs
company, to be called [Paul Allen Group] e2 will be based in Bellevue, Washington.
Shortest Dependency Path (SDP)
between root to entity e1

Sentences and their dependency graphs
Shortest Dependency Path (SDP)
between root to entity e2

Sentences and their dependency graphs Inter-sentential Shortest Dependency Path
(iSDP) across sentence boundary.
iSDP
→ Connection between the roots of adjacent sentences by NEXTS

subtreeiSDP

Exploit these properties in inter-
sentential RE via a unified neural
framework of:
→ bi-RNN modeling SDP
→ RecNN modeling ADP

Contribution
Propose a novel neural approach for Inter-sentential Relation Extraction

Contribution
1. Neural architecture based on dependency parse trees
➔ named as inter-sentential Dependency-based Neural Network (iDepNN)
2. Unified neural framework of a bidirectional RNN (biRNN) and Recursive NN (RecNN)
3. Extract relations within and across sentence boundaries by modeling:
➔ shortest dependency path (SDP) using biRNN
➔ augmented dependency path (ADP) using RecRNN
Contribution

Contribution
Contribution

Contribution
Contribution
1. precisely extract relationships within and across sentence boundaries
2. show a better balance in precision and recall with an improved F1 score
Benefits

Proposed Approach:
Neural Intra- and inter-sentential RE

Proposed Approach: Intra- and inter-sentential RE
Inter-sentential Dependency-based Neural Network variants: iDepNN-SDP and iDepNN-ADP
1. Modeling Inter-sentential
Shortest Dependency Path
Dependency Subtrees
1+2: Modeling Inter-sentential
Augmented Dependency Path

Inter-sentential Dependency-based Neural Network variants: iDepNN-SDP and iDepNN-ADP

Inter-sentential Dependency-based Neural Network variants: iDepNN-SDP

Inter-sentential Dependency-based Neural Network variants: iDepNN-ADP
subtree
Dependency Subtrees

Compute subtree embedding
subtree

1+2: Modeling Inter-sentential
Augmented Dependency Path
→ Offers precise
structure
→ Offers additional
information in
classifying relation

Evaluation and Analysis: Datasets
Datasets
➢ evaluate on four datasets from medical and news domain
Count of intra- and inter-sentential relationships in datasets

Datasets

Datasets
Result
discussed in
this talk
Lives_In → Two arguments, the bacterium and the location
where, location → an Habitat (e.g., microbial ecology such as hosts, environment, food, etc.)
or a Geographical entity (e.g., geographical and organization places)
Data: http://2016.bionlp-st.org/tasks/bb2

Evaluation and Analysis: Datasets + Baselines
Datasets
Baselines:
→ SVM, graphLSTM, i-biRNN and i-biLSTM
Result
discussed in
this talk
graphLSTMs: Peng et. al., 2017. Cross-Sentence N-ary Relation Extraction with Graph LSTMs.

Results (Precision / Recall / F1)
Sentence range,
k = 0 ➔ Intra-sentential
k > 0 ➔ Inter-sentential

Results (Precision / Recall / F1): Intra-sentential Training
iDepNN-ADP is precise in inter-sentential RE
than both SVM and graphLSTM
precise

Results (Precision / Recall / F1): Intra-sentential Training
iDepNN-ADP outperforms both SVM and graphLSTM
in terms of F1 in inter-sentential RE due to a better
balance in precision and recall
F1

Results (Precision / Recall / F1): Inter-sentential Training
iDepNN-ADP outperforms both SVM and graphLSTM
in terms of P and F1 in inter-sentential RE
F1

Results (Precision / Recall / F1): Inter-sentential Training
F1

Results (Precision / Recall / F1): Ensemble

Ensemble with Thresholding on Prediction Probability
Ensemble scores at various thresholds
p: output probability
pr: the count of predictions.

Official Scores: State-of-the-art Comparison
Ensemble scores at various thresholds
p: output probability
pr: the count of predictions. Official results on test set: Comparison with the
published systems in the BioNLP ST 2016.

Error Analysis: BioNLP ST 2016 dataset

Few false positives
in iDepNN-ADP,
compared to both
SVM and graphLSTM
iDepNN-ADP
SVM
graphLSTM

Key Takeaways
➢ Propose a novel neural approach iDepNN for Inter-sentential Relation Extraction
➢ Precisely extract relations within and across sentence boundaries by modeling:
➔ shortest dependency path (SDP) using biRNN, i.e., iDepNN-SDP
➔ augmented dependency path (ADP) using RecRNN, i.e., iDepNN-ADP
➢ Demonstrate a better balance in precision and recall with an improved F1 score
➢ Evaluate on 4 datasets from news and medical domains
➢ Achieve a gain of 5.2% (0.587 vs 0.558) in F1 over the winning team (out of 11 teams)
in BioNLP Shared Task (ST) 2016
Code and Data: https://github.com/pgcool/Cross-sentence-Relation-Extraction-iDepNN

Outline
1/2 Tracks: Relation Extraction
Neural Relation Extraction Within and Across Sentence Boundaries
Pankaj Gupta, Subburam Rajaram, Hinrich Schütze, Thomas Runkler. In AAAI-2019.
Active Research in Information Extraction:
→ Neural Models of Lifelong Learning for Information Extraction
→ Weakly-supervised Neural Bootstrapping for Relation Extraction
(PhD Student: Mr. Usama Yaseen)

Research Outline
Interpretability
Bootstrapping RE
Interpretable RE
Interpretable
topics
Transfer Learning
Lifelong Learning

Outline (Brief Introduction)
2/2 Tracks: Topic Modeling & Representation Learning
Multi-view and Multi-source Transfers in Neural Topic Modeling
Pankaj Gupta, Yatin Chaudhary, Hinrich Schütze. Under review.

TL;DR → Improved Topic Modeling with full-contexts and pre-trained word embeddings
TL;DR → Improved Topic modeling with language structures (e.g., word ordering, local syntax
and semantic information); Composite Model of a neural topic and neural language model
TL;DR → Improved Topic modeling with knowledge transfer via local as well as global semantics

textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language
with Distributed Compositional Prior

TL;DR → Improved Topic Modeling with context-awareness and pre-trained word embeddings
TL;DR → Improved Topic modeling with knowledge transfer via local and global semantics

Topic Modeling
➢ statistical modeling that examines how words co-occur across a collection of documents, and
➢ automatically discovers coherent groups of words (i.e., themes or topics) that best explain the corpus
➢ Each document is composed of a mixture of topics, and each topic is composed of a collection of words
Source: http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

Document Informed Neural Autoregressive
Topic Models with Distributional Prior (AAAI-19)
Need for Distributional Semantics / Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences

Small number of
Word co-occurrences
Lack of Context
Difficult to learn good representations Generate Incoherent Topics

Small number of
Word co-occurrences
Lack of Context
Topic1: price, wall, china, fall, shares
Topic2: shares, price, profits, rises, earnings
coherent
example topics for ‘trading’ incoherent

Small number of
Word co-occurrences
Lack of Context
Topic1: price, wall, china, fall, shares
Topic2: shares, price, profits, rises, earnings
coherent
example topics for ‘trading’ incoherent
TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS
(encodes semantic and syntactic relatedness in words in a vector space)

Small number of
Word co-occurrences
Lack of Context
TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS
(encodes semantic and syntactic relatedness in words in a vector space)
→ trading
No word
overlap
(e.g., 1-hot-
encoding)
Same
topic
class→ trading

mixture weights
➢ introduce weighted pre-trained
word embedding aggregation at
each autoregressive step k
➢ E, pretrained emb as fixed prior
➢ generate topics with embeddings
➢ learn a complementary textual
representation
Baseline Model
Proposed Model
Glove

IR-precision (on short-text datasets)
→ Precision at different retrieval fractions and higher the better
Evaluation: Applicability (Information Retrieval)

Take Away of this work:
➢ Leveraging full contextual information in neural autoregressive topic model
➢ Introducing distributional priors via pre-trained word embeddings
➢ Gain of 5.2% (404 vs 426) in perplexity,
2.8% (.74 vs .72) in topic coherence,
11.1% (.60 vs .54) in precision at retrieval fraction 0.02,
5.2% (.664 vs .631) in F1 for text categorization
on avg over 15 datasets
➢ Learning better word/document representation for short/long texts
Tryout: The code and data are available at https://github.com/pgcool/iDocNADEe
@PankajGupta262

Local vs Global Semantics
Language Models have Local View (semantics):
→ A vector-space representation for each word, based on the local word collocation patterns
→ Due to word-word co-occurrence, limited by a window-size (e.g., word2vec) or sentence (e.g. ELMo)
→ Information beyond the limited context is not exposed
→ Good at capturing local syntactic and semantic information

→ Difficulties in capturing long-range dependencies

Topic Models have Global View (semantics):
→ Due to document-word occurrences (i.e., words are similar if these words similarly appear in documents)
→ Access to document context, not limited by local context
→ Good at capturing thematic structures or long-range dependencies in document collection
Topic models have global view in the sense that each topic is
learned by leveraging statistical information across documents

Topic models have global view in the sense that each topic is
learned by leveraging statistical information across documents

→ No Language structures (e.g., word ordering, local syntactical and semantic information, etc.)
→ Difficulties in capturing short-range dependencies
Same
unigram
statistics,
but
different
topics
Source Text Sense/Topic
Market falls into bear territory → “trading”
Bear falls into market territory → “trading”
Language structure helps in determining actual meaning !!!

textTOvec: Deep Contextulized Neural Autoregressive Topics
Models of Language With Distributed Compositional Prior (ICLR-19)
Incorporate language structures in Topic Models
→ accounting word ordering, latent syntactical and semantic features
→ improving word and document representations, including polysemy
Improving Topic Modeling for short-text and long-text documents
via contextualized features and external knowledge
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics

Advantages of Composite Modeling:
→ introduce language structure into neural autoregressive
topic models via a LSTM-LM, such as word ordering,
language concepts and long-range dependencies.
→ probability of each word is a function of global and local
contexts, modeled via DocNADE and LSTM-LM, respectively.
→ offers learning complementary semantics by combining joint
word and latent topic learning in a unified neural
autoregressive framework.
contextualized-Document Neural Autoregressive
Distribution Estimator (ctx-DocNADE) with pre-
trained word embedding (ctx-DocNADEe)

Code: https://github.com/pgcool/textTOvec

Outline: Topic Modeling (Brief Introduction)
TL;DR → Improved Topic Modeling with context-awareness and pre-trained word embeddings
TL;DR → Improved Topic modeling with knowledge transfer via local and global semantics

Outline: Topic Modeling
Active Research in Topic Modeling & Representation Learning:
→ Multi-view and Multi-source Transfers in Neural Topic Modeling
→Improving Language Models with Global Semantics via Neural Composite Networks
→Lifelong Neural Topic Learning
(PhD Student: Mr. Yatin Chaudhary)

Summary & Thanks !!
Interpretability
Bootstrapping RE
Interpretable RE
Interpretable
topics
Transfer Learning
Lifelong Learning
ReachMe / Talks: https://sites.google.com/view/gupta-pankaj/

Neural NLP Models of Information Extraction

More Related Content

Similar to Neural NLP Models of Information Extraction (20)

More from Pankaj Gupta, PhD (8)

Recently uploaded (20)

Neural NLP Models of Information Extraction