SlideShare a Scribd company logo
Seminar at University of Leipzig
8 September, 2016, Leipzig, Germany
Noun Sense Induction and Disambiguation using
Graph-Based Distributional Semantics
Alexander Panchenko, Johannes Simon, Martin Riedl and Chris Biemann
Technische Universität Darmstadt, LT Group, Computer Science Department, Germany
September 7, 2016 | 1
Summary
▶ Panchenko A., Simon J., Riedl M., Biemann C. "Noun Sense Induction and
Disambiguation using Graph-Based Distributional Semantics". In
Proceedings of the KONVENS 2016, Bochum, Germany
▶ An approach to word sense induction and disambiguation.
▶ The method is unsupervised and knowledge-free.
▶ Sense induction by clustering of word similarity networks
▶ Feature aggregation w.r.t. the induced inventory.
▶ Comparable to the state-of-the-art unsupervised WSD (SemEval’13
participants and various sense embeddings).
▶ Open source implementation: github.com/tudarmstadt-lt/JoSimText
September 7, 2016 | 2
Motivation for Unsupervised Knowledge-Free
Word Sense Disambiguation
▶ A word sense disambiguation (WSD) system:
▶ Input: word and its context.
▶ Output: a sense of this word.
▶ Surveys: Agirre and Edmonds (2007) and Navigli (2009).
▶ Knowledge-based approaches that rely on hand-crafted resources, such as
WordNet.
▶ Supervised approaches learn from hand-labeled training data, such as SemCor.
▶ Problem 1: hand-crafted lexical resources and training data are expensive to
create, often inconsistent and domain-dependent.
▶ Problem 2: These methods assume a fixed sense inventory:
▶ senses emerge and disappear over time.
▶ different applications require different granularities the sense inventory.
▶ An alternative route is the unsupervised knowledge-free approach.
▶ learn an interpretable sense inventory
▶ learn a disambiguation model
September 7, 2016 | 3
Contribution
▶ The contribution is a framework that relies on induced inventories as a pivot
for learning contextual feature representations and disambiguation.
▶ We rely on the JoBimText framework and distributional semantics (Biemann
and Riedl, 2013) adding a word sense disambiguation functionality on top of it.
▶ The advantage of our method, compared to prior art, is that it can integrate
several types of context features in an unsupervised way.
▶ The method achieves state-of-the-art results in unsupervised WSD.
September 7, 2016 | 4
Method: Data-Driven Noun Sense Modelling
1. Computation of a distributional thesaurus
▶ using distributional semantics
2. Word sense induction
▶ using ego-network clustering of related words
3. Building a disambiguation model of the induced senses
▶ by feature aggregation w.r.t. the induced sense inventory
September 7, 2016 | 5
Method: Distributional Thesaurus of Nouns us-
ing the JoBimText framework
▶ A distributional thesaurus (DT) is a graph of word similarities, such as
“(Python, Java, 0.781)”.
▶ We used the JoBimText framework (Biemann and Riedl, 2013):
▶ efficient computation of nearest neighbours for all words
▶ providing state-of-the-art performance (Riedl, 2016)
▶ For each noun in the corpus get 200 most similar nouns
September 7, 2016 | 6
Method: Distributional Thesaurus of Nouns us-
ing the JoBimText framework (cont.)
▶ For each noun in the corpus get l = 200 most similar nouns:
1. Extract word, feature and word-feature frequencies.
▶ Using dependency-based features, such as amod(•, grilled) or prep_for(•,
dinner) using the Malt parser (Nivre et al., 2007)
▶ Collapsing of dependencies in the same way as the Stanford dependencies.
2. Discard rare words, features and word-features (t < 3).
3. Normalize word-feature scores using the Local Mutual Information (LMI):
LMI(i, j) = fij · PMI(i, j) = fij · log
fij
∑
i,j fij
fi∗ · f∗j
4. Ranking word features by LMI.
5. Prune all, but p = 1000 most significant features per word.
6. Word similarities are computed as a number of common features for two words:
sim(ti , tj ) = |k : fik > 0 ∧ fjk > 0|
7. Return l = 200 most related words per word.
September 7, 2016 | 7
Method: Noun Sense Induction via Ego-Network
Clustering
▶ The "furniture" and the "data" sense clusters of the word "table".
▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006).
September 7, 2016 | 8
Method: Noun Sense Induction via Ego-Network
Clustering (cont.)
▶ Process one word per iteration
▶ Construct an ego-network of the word:
▶ use dependency-based distributional word similarities
▶ the ego-network size (N): the number of related words
▶ the ego-network connectivity (n): how strongly the neighbours are related; this
parameter controls granularity of sense inventory.
▶ Graph clustering using the Chinese Whispers algorithm.
September 7, 2016 | 9
Method: Disambiguation of Induced Noun
Senses
▶ Learning a disambiguation model P(si |C) for each of the induced senses
si ∈ S of the target word w in context C = {c1, ..., cm}.
▶ We use the Naïve Bayes model:
P(si |C) =
P(si )
∏|C|
j=1 P(cj |si )
P(c1, ..., cm)
,
▶ The best sense given the context C:
s∗
i = arg max
si ∈S
P(si )
|C|
∏
j=1
P(cj |si ).
September 7, 2016 | 10
Method: Disambiguation of Induced Noun
Senses (cont.)
▶ The prior probability of each sense is computed based on the largest cluster
heuristic:
P(si ) =
|si |
∑
si ∈S |si |
.
▶ Extract sense representations by aggregation of features from all words of the
cluster si .
▶ Probability of the feature cj given the sense si :
P(cj |si ) =
1 − α
Λi
|si |
∑
k
λk
f(wk , cj )
f(wk )
+ α,
▶ To normalize the score we divide it by the sum of all the weights Λi =
∑|si |
k λk :
▶ α is a small number, e.g. 10−5
, added for smoothing.
September 7, 2016 | 11
Method: Disambiguation of Induced Noun
Senses (cont.)
▶ To calculate a WSD model we need to extract from corpus:
1. the distributional thesaurus;
2. sense clusters;
3. word-feature frequencies.
▶ Sense representations are obtained by “averaging” of feature representations
of words in the sense clusters.
September 7, 2016 | 12
Feature Extraction: Single Models
▶ The method requires sparse word-feature counts f(wk , cj ).
▶ We demonstrate the approach on the four following types of features:
1. Features based on sense clusters: Cluster
▶ Features: words from the induced sense clusters;
▶ Weights: similarity scores.
2. Dependency features: Deptarget, Depall
▶ Features: syntactic dependencies attached to the word, e.g. “subj(•,type)” or
“amod(digital,•)”
▶ Weights: LMI scores of the scores.
3. Dependency word features: Depword
▶ Features: words extracted from all syntactic dependencies attached to a target word.
For instance, the feature “subj(•,write)” would result in the feature “write”.
▶ Weights: LMI scores.
4. Trigram features: Trigramtarget, Trigramall
▶ Features: pairs of left and right words around the target word, e.g. “typing_•_or” and
“digital_•_.”.
▶ Weights: LMI scores.
September 7, 2016 | 13
Feature Combination: Combined Models
▶ Feature-level Combination of Features
▶ Union context features of different types, such as dependencies and trigrams.
▶ “Stack” feature spaces.
▶ Meta-level Combination of Features
1. Independent sense classifications by single models
2. Aggregation of predictions with:
▶ Majority selects the sense si selected by the largest number of single models.
▶ Ranks. First, results of single model classification are ranked by their confidence
ˆP(si |C): the most suitable sense to the context obtains rank one and so on. Finally,
we assign the sense with the least sum of ranks.
▶ Sum. This strategy assigns the sense with the largest sum of classification
confidences i.e.,
∑
i
ˆP(si |Ci
k ), where i is the number of the single model.
September 7, 2016 | 14
Corpora used for experiments
# Tokens Size Text Type
Wikipedia 1.863 · 109
11.79 Gb encyclopaedic
ukWaC 1.980 · 109
12.05 Gb Web pages
Table: Corpora used for training our models.
September 7, 2016 | 15
Results: Evaluation on the “Python-Ruby-
Jaguar” (PRJ) dataset: 3 words, 60 contexts, 2
senses per word
▶ A simple dataset: 60 contexts, 2 homonyms per word.
▶ The models based on the meta-combinations are not shown for brevity as they
did not improve performance of the presented models in terms of F-score.
September 7, 2016 | 16
Results: Evaluation on the TWSI dataset: 1012
nouns, 145140 contexts, 2.33 senses per word
September 7, 2016 | 17
Results: the TWSI dataset: effect of the corpus
choice on the WSD performance
▶ 10 best models according to the F-score on the TWSI dataset
▶ Trained on Wikipedia and ukWaC corpora
September 7, 2016 | 18
Results: Evaluation on the SemEval 2013 Task
13 dataset: 20 nouns, 1848 contexts
September 7, 2016 | 19
Thank you!
September 7, 2016 | 20
Word Embeddings for WSD using Graph-Based
Distributional Semantics
▶ Pelevina M., Arefiev N., Biemann C., Panchenko A. "Making Sense of Word
Embeddings". In Proceedings of the 1st Workshop on Representation
Learning for NLP. ACL 2016, Berlin, Germany. Best Paper Award
▶ An approach to learn word sense embeddings.
▶ The same approach as presented above, but using word2vec instead of
JoBimText: dense vs sparse feature representations.
September 7, 2016 | 21
Overview of the contribution
Prior methods:
▶ Induce inventory by clustering of word instances (Li and Jurafsky, 2015)
▶ Use existing inventories (Rothe and Schütze, 2015)
Our method:
▶ Input: word embeddings
▶ Output: word sense embeddings
▶ Word sense induction by clustering of word ego-networks
▶ Word sense disambiguation based on the induced sense representations
September 7, 2016 | 22
Learning Word Sense Embeddings
September 7, 2016 | 23
Word Sense Induction: Ego-Network Clustering
▶ The "furniture" and the "data" sense clusters of the word "table".
▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006).
September 7, 2016 | 24
Neighbours of Word and Sense Vectors
Vector Nearest Neighbours
table
tray, bottom, diagram, bucket, brackets, stack, bas-
ket, list, parenthesis, cup, trays, pile, playfield,
bracket, pot, drop-down, cue, plate
table#0
leftmost#0, column#1, randomly#0, tableau#1, top-
left0, indent#1, bracket#3, pointer#0, footer#1, cur-
sor#1, diagram#0, grid#0
table#1
pile#1, stool#1, tray#0, basket#0, bowl#1, bucket#0,
box#0, cage#0, saucer#3, mirror#1, birdcage#0,
hole#0, pan#1, lid#0
▶ Neighbours of the word “table" and its senses produced by our method.
▶ The neighbours of the initial vector belong to both senses.
▶ The neighbours of the sense vectors are sense-specific.
September 7, 2016 | 25
Word Sense Disambiguation
1. Context Extraction
▶ use context words around the target word
2. Context Filtering
▶ based on context word’s relevance for disambiguation
3. Sense Choice
▶ maximize similarity between context vector and sense vector
September 7, 2016 | 26
Word Sense Disambiguation: Example
September 7, 2016 | 27
Evaluation on SemEval 2013 Task 13 dataset:
comparison to the state-of-the-art
Model Jacc. Tau WNDCG F.NMI F.B-Cubed
AI-KU (add1000) 0.176 0.609 0.205 0.033 0.317
AI-KU 0.176 0.619 0.393 0.066 0.382
AI-KU (remove5-add1000) 0.228 0.654 0.330 0.040 0.463
Unimelb (5p) 0.198 0.623 0.374 0.056 0.475
Unimelb (50k) 0.198 0.633 0.384 0.060 0.494
UoS (#WN senses) 0.171 0.600 0.298 0.046 0.186
UoS (top-3) 0.220 0.637 0.370 0.044 0.451
La Sapienza (1) 0.131 0.544 0.332 – –
La Sapienza (2) 0.131 0.535 0.394 – –
AdaGram, α = 0.05, 100 dim 0.274 0.644 0.318 0.058 0.470
w2v 0.197 0.615 0.291 0.011 0.615
w2v (nouns) 0.179 0.626 0.304 0.011 0.623
JBT 0.205 0.624 0.291 0.017 0.598
JBT (nouns) 0.198 0.643 0.310 0.031 0.595
TWSI (nouns) 0.215 0.651 0.318 0.030 0.573
September 7, 2016 | 28
Conclusion
▶ Novel approach for learning word sense embeddings.
▶ Can use existing word embeddings as input.
▶ WSD performance comparable to the state-of-the-art systems.
▶ Source code and pre-trained models:
https://github.com/tudarmstadt-lt/SenseGram
September 7, 2016 | 29
Evaluation based on the TWSI dataset: a large-
scale dataset for development
September 7, 2016 | 30
Thank you!
September 7, 2016 | 31

More Related Content

PDF
Making Sense of Word Embeddings
PDF
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
PDF
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
PDF
Neural Semi-supervised Learning under Domain Shift
PDF
Learning Probabilistic Relational Models
PDF
Random Generation of Relational Bayesian Networks
PPTX
OU Rise library analytics viz
PPTX
Transfer learning-presentation
Making Sense of Word Embeddings
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Neural Semi-supervised Learning under Domain Shift
Learning Probabilistic Relational Models
Random Generation of Relational Bayesian Networks
OU Rise library analytics viz
Transfer learning-presentation

Similar to Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics (20)

PDF
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
PDF
Scene Description From Images To Sentences
PPTX
Joint contrastive learning with infinite possibilities
PPTX
Text Mining for Lexicography
PDF
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
PDF
G04124041046
PDF
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
PDF
PPTX
Using Knowledge Graph for Promoting Cognitive Computing
PPT
PPT SLIDES
PPT
PPT SLIDES
PDF
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
PDF
(SURVEY) Active Learning
PPTX
[NS][Lab_Seminar_250428]Geometry Sensitive Cross-Modal Reasoning for Composed...
PPTX
[SDP-2024] Group-H18 End-Term Evaluation PPT.pptx
PDF
MediaEval 2017 - Interestingness Task: GIBIS at MediaEval 2017: Predicting Me...
PDF
Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning
PPTX
240506_Thanh_LabSeminar[ASG2Caption].pptx
PPTX
Ensemble based method for the classification of flooding event using social m...
PPTX
AIML UNIT 4.pptx. IT contains syllabus and full subject
Parallel Filter-Based Feature Selection Based on Balanced Incomplete Block De...
Scene Description From Images To Sentences
Joint contrastive learning with infinite possibilities
Text Mining for Lexicography
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
G04124041046
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
Using Knowledge Graph for Promoting Cognitive Computing
PPT SLIDES
PPT SLIDES
“Towards Multi-Step Expert Advice for Cognitive Computing” - Dr. Achim Rettin...
(SURVEY) Active Learning
[NS][Lab_Seminar_250428]Geometry Sensitive Cross-Modal Reasoning for Composed...
[SDP-2024] Group-H18 End-Term Evaluation PPT.pptx
MediaEval 2017 - Interestingness Task: GIBIS at MediaEval 2017: Predicting Me...
Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning
240506_Thanh_LabSeminar[ASG2Caption].pptx
Ensemble based method for the classification of flooding event using social m...
AIML UNIT 4.pptx. IT contains syllabus and full subject
Ad

More from Alexander Panchenko (18)

PDF
Graph's not dead: from unsupervised induction of linguistic structures from t...
PDF
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
PDF
Improving Hypernymy Extraction with Distributional Semantic Classes
PDF
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
PDF
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
PDF
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
PDF
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
PPTX
Getting started in Apache Spark and Flink (with Scala) - Part II
PPTX
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
PDF
Text Analysis of Social Networks: Working with FB and VK Data
PPTX
Неологизмы в социальной сети Фейсбук
PDF
Sentiment Index of the Russian Speaking Facebook
PDF
Similarity Measures for Semantic Relation Extraction
PDF
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
PDF
Detecting Gender by Full Name: Experiments with the Russian Language
PDF
PDF
Вычислительная лексическая семантика: метрики семантической близости и их при...
PDF
Semantic Similarity Measures for Semantic Relation Extraction
Graph's not dead: from unsupervised induction of linguistic structures from t...
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
Improving Hypernymy Extraction with Distributional Semantic Classes
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
Getting started in Apache Spark and Flink (with Scala) - Part II
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
Text Analysis of Social Networks: Working with FB and VK Data
Неологизмы в социальной сети Фейсбук
Sentiment Index of the Russian Speaking Facebook
Similarity Measures for Semantic Relation Extraction
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Detecting Gender by Full Name: Experiments with the Russian Language
Вычислительная лексическая семантика: метрики семантической близости и их при...
Semantic Similarity Measures for Semantic Relation Extraction
Ad

Recently uploaded (20)

PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
BIOMOLECULES PPT........................
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
An interstellar mission to test astrophysical black holes
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Microbiology with diagram medical studies .pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPT
protein biochemistry.ppt for university classes
PPTX
2. Earth - The Living Planet Module 2ELS
Introduction to Cardiovascular system_structure and functions-1
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
microscope-Lecturecjchchchchcuvuvhc.pptx
famous lake in india and its disturibution and importance
BIOMOLECULES PPT........................
Viruses (History, structure and composition, classification, Bacteriophage Re...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Derivatives of integument scales, beaks, horns,.pptx
The KM-GBF monitoring framework – status & key messages.pptx
An interstellar mission to test astrophysical black holes
Biophysics 2.pdffffffffffffffffffffffffff
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Comparative Structure of Integument in Vertebrates.pptx
Microbiology with diagram medical studies .pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Phytochemical Investigation of Miliusa longipes.pdf
protein biochemistry.ppt for university classes
2. Earth - The Living Planet Module 2ELS

Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

  • 1. Seminar at University of Leipzig 8 September, 2016, Leipzig, Germany Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics Alexander Panchenko, Johannes Simon, Martin Riedl and Chris Biemann Technische Universität Darmstadt, LT Group, Computer Science Department, Germany September 7, 2016 | 1
  • 2. Summary ▶ Panchenko A., Simon J., Riedl M., Biemann C. "Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics". In Proceedings of the KONVENS 2016, Bochum, Germany ▶ An approach to word sense induction and disambiguation. ▶ The method is unsupervised and knowledge-free. ▶ Sense induction by clustering of word similarity networks ▶ Feature aggregation w.r.t. the induced inventory. ▶ Comparable to the state-of-the-art unsupervised WSD (SemEval’13 participants and various sense embeddings). ▶ Open source implementation: github.com/tudarmstadt-lt/JoSimText September 7, 2016 | 2
  • 3. Motivation for Unsupervised Knowledge-Free Word Sense Disambiguation ▶ A word sense disambiguation (WSD) system: ▶ Input: word and its context. ▶ Output: a sense of this word. ▶ Surveys: Agirre and Edmonds (2007) and Navigli (2009). ▶ Knowledge-based approaches that rely on hand-crafted resources, such as WordNet. ▶ Supervised approaches learn from hand-labeled training data, such as SemCor. ▶ Problem 1: hand-crafted lexical resources and training data are expensive to create, often inconsistent and domain-dependent. ▶ Problem 2: These methods assume a fixed sense inventory: ▶ senses emerge and disappear over time. ▶ different applications require different granularities the sense inventory. ▶ An alternative route is the unsupervised knowledge-free approach. ▶ learn an interpretable sense inventory ▶ learn a disambiguation model September 7, 2016 | 3
  • 4. Contribution ▶ The contribution is a framework that relies on induced inventories as a pivot for learning contextual feature representations and disambiguation. ▶ We rely on the JoBimText framework and distributional semantics (Biemann and Riedl, 2013) adding a word sense disambiguation functionality on top of it. ▶ The advantage of our method, compared to prior art, is that it can integrate several types of context features in an unsupervised way. ▶ The method achieves state-of-the-art results in unsupervised WSD. September 7, 2016 | 4
  • 5. Method: Data-Driven Noun Sense Modelling 1. Computation of a distributional thesaurus ▶ using distributional semantics 2. Word sense induction ▶ using ego-network clustering of related words 3. Building a disambiguation model of the induced senses ▶ by feature aggregation w.r.t. the induced sense inventory September 7, 2016 | 5
  • 6. Method: Distributional Thesaurus of Nouns us- ing the JoBimText framework ▶ A distributional thesaurus (DT) is a graph of word similarities, such as “(Python, Java, 0.781)”. ▶ We used the JoBimText framework (Biemann and Riedl, 2013): ▶ efficient computation of nearest neighbours for all words ▶ providing state-of-the-art performance (Riedl, 2016) ▶ For each noun in the corpus get 200 most similar nouns September 7, 2016 | 6
  • 7. Method: Distributional Thesaurus of Nouns us- ing the JoBimText framework (cont.) ▶ For each noun in the corpus get l = 200 most similar nouns: 1. Extract word, feature and word-feature frequencies. ▶ Using dependency-based features, such as amod(•, grilled) or prep_for(•, dinner) using the Malt parser (Nivre et al., 2007) ▶ Collapsing of dependencies in the same way as the Stanford dependencies. 2. Discard rare words, features and word-features (t < 3). 3. Normalize word-feature scores using the Local Mutual Information (LMI): LMI(i, j) = fij · PMI(i, j) = fij · log fij ∑ i,j fij fi∗ · f∗j 4. Ranking word features by LMI. 5. Prune all, but p = 1000 most significant features per word. 6. Word similarities are computed as a number of common features for two words: sim(ti , tj ) = |k : fik > 0 ∧ fjk > 0| 7. Return l = 200 most related words per word. September 7, 2016 | 7
  • 8. Method: Noun Sense Induction via Ego-Network Clustering ▶ The "furniture" and the "data" sense clusters of the word "table". ▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006). September 7, 2016 | 8
  • 9. Method: Noun Sense Induction via Ego-Network Clustering (cont.) ▶ Process one word per iteration ▶ Construct an ego-network of the word: ▶ use dependency-based distributional word similarities ▶ the ego-network size (N): the number of related words ▶ the ego-network connectivity (n): how strongly the neighbours are related; this parameter controls granularity of sense inventory. ▶ Graph clustering using the Chinese Whispers algorithm. September 7, 2016 | 9
  • 10. Method: Disambiguation of Induced Noun Senses ▶ Learning a disambiguation model P(si |C) for each of the induced senses si ∈ S of the target word w in context C = {c1, ..., cm}. ▶ We use the Naïve Bayes model: P(si |C) = P(si ) ∏|C| j=1 P(cj |si ) P(c1, ..., cm) , ▶ The best sense given the context C: s∗ i = arg max si ∈S P(si ) |C| ∏ j=1 P(cj |si ). September 7, 2016 | 10
  • 11. Method: Disambiguation of Induced Noun Senses (cont.) ▶ The prior probability of each sense is computed based on the largest cluster heuristic: P(si ) = |si | ∑ si ∈S |si | . ▶ Extract sense representations by aggregation of features from all words of the cluster si . ▶ Probability of the feature cj given the sense si : P(cj |si ) = 1 − α Λi |si | ∑ k λk f(wk , cj ) f(wk ) + α, ▶ To normalize the score we divide it by the sum of all the weights Λi = ∑|si | k λk : ▶ α is a small number, e.g. 10−5 , added for smoothing. September 7, 2016 | 11
  • 12. Method: Disambiguation of Induced Noun Senses (cont.) ▶ To calculate a WSD model we need to extract from corpus: 1. the distributional thesaurus; 2. sense clusters; 3. word-feature frequencies. ▶ Sense representations are obtained by “averaging” of feature representations of words in the sense clusters. September 7, 2016 | 12
  • 13. Feature Extraction: Single Models ▶ The method requires sparse word-feature counts f(wk , cj ). ▶ We demonstrate the approach on the four following types of features: 1. Features based on sense clusters: Cluster ▶ Features: words from the induced sense clusters; ▶ Weights: similarity scores. 2. Dependency features: Deptarget, Depall ▶ Features: syntactic dependencies attached to the word, e.g. “subj(•,type)” or “amod(digital,•)” ▶ Weights: LMI scores of the scores. 3. Dependency word features: Depword ▶ Features: words extracted from all syntactic dependencies attached to a target word. For instance, the feature “subj(•,write)” would result in the feature “write”. ▶ Weights: LMI scores. 4. Trigram features: Trigramtarget, Trigramall ▶ Features: pairs of left and right words around the target word, e.g. “typing_•_or” and “digital_•_.”. ▶ Weights: LMI scores. September 7, 2016 | 13
  • 14. Feature Combination: Combined Models ▶ Feature-level Combination of Features ▶ Union context features of different types, such as dependencies and trigrams. ▶ “Stack” feature spaces. ▶ Meta-level Combination of Features 1. Independent sense classifications by single models 2. Aggregation of predictions with: ▶ Majority selects the sense si selected by the largest number of single models. ▶ Ranks. First, results of single model classification are ranked by their confidence ˆP(si |C): the most suitable sense to the context obtains rank one and so on. Finally, we assign the sense with the least sum of ranks. ▶ Sum. This strategy assigns the sense with the largest sum of classification confidences i.e., ∑ i ˆP(si |Ci k ), where i is the number of the single model. September 7, 2016 | 14
  • 15. Corpora used for experiments # Tokens Size Text Type Wikipedia 1.863 · 109 11.79 Gb encyclopaedic ukWaC 1.980 · 109 12.05 Gb Web pages Table: Corpora used for training our models. September 7, 2016 | 15
  • 16. Results: Evaluation on the “Python-Ruby- Jaguar” (PRJ) dataset: 3 words, 60 contexts, 2 senses per word ▶ A simple dataset: 60 contexts, 2 homonyms per word. ▶ The models based on the meta-combinations are not shown for brevity as they did not improve performance of the presented models in terms of F-score. September 7, 2016 | 16
  • 17. Results: Evaluation on the TWSI dataset: 1012 nouns, 145140 contexts, 2.33 senses per word September 7, 2016 | 17
  • 18. Results: the TWSI dataset: effect of the corpus choice on the WSD performance ▶ 10 best models according to the F-score on the TWSI dataset ▶ Trained on Wikipedia and ukWaC corpora September 7, 2016 | 18
  • 19. Results: Evaluation on the SemEval 2013 Task 13 dataset: 20 nouns, 1848 contexts September 7, 2016 | 19
  • 21. Word Embeddings for WSD using Graph-Based Distributional Semantics ▶ Pelevina M., Arefiev N., Biemann C., Panchenko A. "Making Sense of Word Embeddings". In Proceedings of the 1st Workshop on Representation Learning for NLP. ACL 2016, Berlin, Germany. Best Paper Award ▶ An approach to learn word sense embeddings. ▶ The same approach as presented above, but using word2vec instead of JoBimText: dense vs sparse feature representations. September 7, 2016 | 21
  • 22. Overview of the contribution Prior methods: ▶ Induce inventory by clustering of word instances (Li and Jurafsky, 2015) ▶ Use existing inventories (Rothe and Schütze, 2015) Our method: ▶ Input: word embeddings ▶ Output: word sense embeddings ▶ Word sense induction by clustering of word ego-networks ▶ Word sense disambiguation based on the induced sense representations September 7, 2016 | 22
  • 23. Learning Word Sense Embeddings September 7, 2016 | 23
  • 24. Word Sense Induction: Ego-Network Clustering ▶ The "furniture" and the "data" sense clusters of the word "table". ▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006). September 7, 2016 | 24
  • 25. Neighbours of Word and Sense Vectors Vector Nearest Neighbours table tray, bottom, diagram, bucket, brackets, stack, bas- ket, list, parenthesis, cup, trays, pile, playfield, bracket, pot, drop-down, cue, plate table#0 leftmost#0, column#1, randomly#0, tableau#1, top- left0, indent#1, bracket#3, pointer#0, footer#1, cur- sor#1, diagram#0, grid#0 table#1 pile#1, stool#1, tray#0, basket#0, bowl#1, bucket#0, box#0, cage#0, saucer#3, mirror#1, birdcage#0, hole#0, pan#1, lid#0 ▶ Neighbours of the word “table" and its senses produced by our method. ▶ The neighbours of the initial vector belong to both senses. ▶ The neighbours of the sense vectors are sense-specific. September 7, 2016 | 25
  • 26. Word Sense Disambiguation 1. Context Extraction ▶ use context words around the target word 2. Context Filtering ▶ based on context word’s relevance for disambiguation 3. Sense Choice ▶ maximize similarity between context vector and sense vector September 7, 2016 | 26
  • 27. Word Sense Disambiguation: Example September 7, 2016 | 27
  • 28. Evaluation on SemEval 2013 Task 13 dataset: comparison to the state-of-the-art Model Jacc. Tau WNDCG F.NMI F.B-Cubed AI-KU (add1000) 0.176 0.609 0.205 0.033 0.317 AI-KU 0.176 0.619 0.393 0.066 0.382 AI-KU (remove5-add1000) 0.228 0.654 0.330 0.040 0.463 Unimelb (5p) 0.198 0.623 0.374 0.056 0.475 Unimelb (50k) 0.198 0.633 0.384 0.060 0.494 UoS (#WN senses) 0.171 0.600 0.298 0.046 0.186 UoS (top-3) 0.220 0.637 0.370 0.044 0.451 La Sapienza (1) 0.131 0.544 0.332 – – La Sapienza (2) 0.131 0.535 0.394 – – AdaGram, α = 0.05, 100 dim 0.274 0.644 0.318 0.058 0.470 w2v 0.197 0.615 0.291 0.011 0.615 w2v (nouns) 0.179 0.626 0.304 0.011 0.623 JBT 0.205 0.624 0.291 0.017 0.598 JBT (nouns) 0.198 0.643 0.310 0.031 0.595 TWSI (nouns) 0.215 0.651 0.318 0.030 0.573 September 7, 2016 | 28
  • 29. Conclusion ▶ Novel approach for learning word sense embeddings. ▶ Can use existing word embeddings as input. ▶ WSD performance comparable to the state-of-the-art systems. ▶ Source code and pre-trained models: https://github.com/tudarmstadt-lt/SenseGram September 7, 2016 | 29
  • 30. Evaluation based on the TWSI dataset: a large- scale dataset for development September 7, 2016 | 30