SlideShare a Scribd company logo
Explaining Character-Aware Neural
Networks for Word-Level Prediction
Frederic Godin, Kris Demuynck, Joni Dambre, Wesley Deneve and Thomas Demeester
Department of Electronics and Information Systems
Ghent University, Belgium
Do They Discover Linguistic Rules?
Introduction
2
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Example: Rule-based tagger for PoS tagging
Brill (1994)’s transformation-based error-driven tagger
3
Template
Change the most-likely tag X to
Y if the last (1,2,3,4) characters
of the word are x
Rule
Change the tag common noun to
plural common noun if the word has
suffix -s
Easily interpretable
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Interpretability in NLP used to be easy
Rule-based/Tree-based models
Shallow statistical models (E.g., Logistic regression, CRF)
4
Very transparent: follow the trace
Essentially: weight + feature
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Current NLP interpretability...
5
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Our proposed method
6
We present contextual decomposition (CD) for CNNs
- Extends CD for LSTMs (Murdoch et al. 2018)
- White box approach to interpretability
We trace back morphological tagging decisions to the
character-level
- Which characters are important?
- Same patterns as linguistically known?
- Difference CNN and BiLSTM?
Contextual decomposition
for CNNs
7
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition
Idea: every output value can be “decomposed” in
- Relevant contributions originating from the input we are interested in
(E.g., some characters)
- Irrelevant contributions originating from all the other inputs (E.g., all
the other characters in a word)
8
CNNeconomicas plural
economicas
economicas
economicas
economicas
Relevant
relevant irrelevantrelevant
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs
Three main components of CNN
̶ Convolution
̶ Activation function
̶ Max-over-time pooling
Classification layer
9
^ e c o n o m i c a s $
...
Max over time
FC
Gender = feminine
CNN filters
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs: Convolution
Output of single convolutional filter at timestep t:
10
Relevant Irrelevant
n = filter size
S = Indexes of of relevant inputs
Wi = i-th column of filter W
^ e c o n o m i c a s $
Indexes: 8, 9, 10, 11
9 8, 10, 11
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs: Activation func.
Goal: Linearize activation function to be able to split output.
Linearization formula:
11
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition for CNNs: Max pooling
Max-over-time pooling:
Determine t first and just copy that split:
12
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Contextual decomposition of classification layer
Probability of certain class:
13
We simplify:
Relevant contribution to class j
Experiments
14
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Task
15
Morphological tagging: predict morphological labels for a word (gender,
tense, singular/plural,..)
economicas
For a subset of words, we have manual segmentations and
annotations
lemma=económico
gender=feminine
number=plural
economicas
lemma=económico
gender=feminine
number=pluraleconomicas
economicas
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Datasets
Universal dependencies 1.4:
̶ Finnish, Spanish and Swedish
̶ Select all unique words and their morphological labels
Manual annotations and segmentations of 300 test set words
16
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Architectures: CNN vs BiLSTM
17
^ e c o n o m i c a s $
FC
Gender = feminine
^ e c o n o m i c a s $
...
Max over time
FC
Gender = feminine
CNN filters
CNN BiLSTM
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Do the NN patterns follow manual segmentations?
18
All = every possible combination of characters
Cons = all consecutive character n-grams
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Visualizing contributions: 1 character
19
Spanish
^ g r a t u i t a $
Label: Gender=feminine
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Visualizing contributions: 2 characters (Swedish)
20
CNN BiLSTM
^ k r o n o r $ ^ k r o n o r $
^
k
r
o
n
o
r
$
^
k
r
o
n
o
r
$
Label: number=plural
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Most important patterns per language: Spanish
21
Linguistic rules for feminine gender:
- Feminine adjectives often end with “a”
- Nouns ending with “dad” or “ión” are often feminine
Found pattern:
- “a” is a very important pattern
- “dad” and “sió” are import trigrams
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Most important patterns per language: Swedish
22
Linguistic rules for plural form:
- 5 suffixes: or, ar, (e)r, n, and no ending
“na” is definite article in plural forms
Found pattern:
- “or” and “ar”
- But also “na” and “rn”
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Interactions/compositions of patterns
How do positive and negative patterns interact?
Consider the Spanish verb “gusta”
- Gender=Not Applicable (NA)
- We know that suffix “a” is indicator for gender=feminine
23
Consider most positive/negative set of characters per class:
The stem provides counterevidence for gender=feminine
Conclusion
24
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Summary
We introduced a white box approach to understanding CNNs
We showed that:
̶ BiLSTMs and CNNs sometimes choose different patterns
̶ The learned patterns coincide with our linguistic knowledge
̶ Sometimes other plausible patterns are used
25
Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction
Questions?
26
Fréderic Godin
Ph.D. Researcher Deep Learning and NLP
IDLab
E frederic.godin@ugent.be
@frederic_godin
www.fredericgodin.com
idlab.technology / idlab.ugent.be

More Related Content

PDF
Skip, residual and densely connected RNN architectures
PDF
Improving Language Modeling using Densely Connected Recurrent Neural Networks
PDF
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
PPTX
AINL 2016: Yagunova
PDF
Sequence to sequence (encoder-decoder) learning
PDF
AINL 2016: Eyecioglu
PDF
AINL 2016: Nikolenko
PDF
AINL 2016: Malykh
Skip, residual and densely connected RNN architectures
Improving Language Modeling using Densely Connected Recurrent Neural Networks
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
AINL 2016: Yagunova
Sequence to sequence (encoder-decoder) learning
AINL 2016: Eyecioglu
AINL 2016: Nikolenko
AINL 2016: Malykh

What's hot (9)

PPTX
NLP using Deep learning
PPTX
Natural Language Processing
PPTX
Tutorial on word2vec
PDF
DataXDay - The wonders of deep learning: how to leverage it for natural langu...
PDF
Word Embeddings - Introduction
PDF
Probabilistic content models,
PPT
Modular Ontologies - A Formal Investigation of Semantics and Expressivity
PDF
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
PPTX
Language Interaction and Quality Issues: An Exploratory Study
NLP using Deep learning
Natural Language Processing
Tutorial on word2vec
DataXDay - The wonders of deep learning: how to leverage it for natural langu...
Word Embeddings - Introduction
Probabilistic content models,
Modular Ontologies - A Formal Investigation of Semantics and Expressivity
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
Language Interaction and Quality Issues: An Exploratory Study
Ad

Similar to Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules? (20)

PPTX
Talk from NVidia Developer Connect
PPTX
Automated Software Requirements Labeling
PDF
David Barber - Deep Nets, Bayes and the story of AI
PPTX
NLP Introduction and basics of natural language processing
PDF
Representation Learning of Text for NLP
PDF
Anthiil Inside workshop on NLP
PDF
Natural Language Processing (NLP)
PPTX
Natural language processing: feature extraction
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Deep learning for natural language embeddings
PPTX
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
PDF
Bay Area NLP Reading Group - 7.12.16
PDF
Magpie
DOC
P-6
DOC
P-6
PDF
Lookingforwardenglish
PDF
Alberto Massidda - Images and words: mechanics of automated captioning with n...
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
Pycon ke word vectors
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Talk from NVidia Developer Connect
Automated Software Requirements Labeling
David Barber - Deep Nets, Bayes and the story of AI
NLP Introduction and basics of natural language processing
Representation Learning of Text for NLP
Anthiil Inside workshop on NLP
Natural Language Processing (NLP)
Natural language processing: feature extraction
Visual-Semantic Embeddings: some thoughts on Language
Deep learning for natural language embeddings
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Bay Area NLP Reading Group - 7.12.16
Magpie
P-6
P-6
Lookingforwardenglish
Alberto Massidda - Images and words: mechanics of automated captioning with n...
NLP Bootcamp 2018 : Representation Learning of text for NLP
Pycon ke word vectors
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Ad

Recently uploaded (20)

PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Application of enzymes in medicine (2).pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Overview of calcium in human muscles.pptx
Introcution to Microbes Burton's Biology for the Health
Biophysics 2.pdffffffffffffffffffffffffff
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Biomechanics of the Hip - Basic Science.pptx
. Radiology Case Scenariosssssssssssssss
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Application of enzymes in medicine (2).pptx
6.1 High Risk New Born. Padetric health ppt
The Land of Punt — A research by Dhani Irwanto
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
lecture 2026 of Sjogren's syndrome l .pdf
Introduction to Cardiovascular system_structure and functions-1
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
C1 cut-Methane and it's Derivatives.pptx
7. General Toxicologyfor clinical phrmacy.pptx
CORDINATION COMPOUND AND ITS APPLICATIONS
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Overview of calcium in human muscles.pptx

Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules?

  • 1. Explaining Character-Aware Neural Networks for Word-Level Prediction Frederic Godin, Kris Demuynck, Joni Dambre, Wesley Deneve and Thomas Demeester Department of Electronics and Information Systems Ghent University, Belgium Do They Discover Linguistic Rules?
  • 3. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Example: Rule-based tagger for PoS tagging Brill (1994)’s transformation-based error-driven tagger 3 Template Change the most-likely tag X to Y if the last (1,2,3,4) characters of the word are x Rule Change the tag common noun to plural common noun if the word has suffix -s Easily interpretable
  • 4. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Interpretability in NLP used to be easy Rule-based/Tree-based models Shallow statistical models (E.g., Logistic regression, CRF) 4 Very transparent: follow the trace Essentially: weight + feature
  • 5. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Current NLP interpretability... 5
  • 6. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Our proposed method 6 We present contextual decomposition (CD) for CNNs - Extends CD for LSTMs (Murdoch et al. 2018) - White box approach to interpretability We trace back morphological tagging decisions to the character-level - Which characters are important? - Same patterns as linguistically known? - Difference CNN and BiLSTM?
  • 8. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Contextual decomposition Idea: every output value can be “decomposed” in - Relevant contributions originating from the input we are interested in (E.g., some characters) - Irrelevant contributions originating from all the other inputs (E.g., all the other characters in a word) 8 CNNeconomicas plural economicas economicas economicas economicas Relevant relevant irrelevantrelevant
  • 9. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Contextual decomposition for CNNs Three main components of CNN ̶ Convolution ̶ Activation function ̶ Max-over-time pooling Classification layer 9 ^ e c o n o m i c a s $ ... Max over time FC Gender = feminine CNN filters
  • 10. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Contextual decomposition for CNNs: Convolution Output of single convolutional filter at timestep t: 10 Relevant Irrelevant n = filter size S = Indexes of of relevant inputs Wi = i-th column of filter W ^ e c o n o m i c a s $ Indexes: 8, 9, 10, 11 9 8, 10, 11
  • 11. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Contextual decomposition for CNNs: Activation func. Goal: Linearize activation function to be able to split output. Linearization formula: 11
  • 12. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Contextual decomposition for CNNs: Max pooling Max-over-time pooling: Determine t first and just copy that split: 12
  • 13. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Contextual decomposition of classification layer Probability of certain class: 13 We simplify: Relevant contribution to class j
  • 15. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Task 15 Morphological tagging: predict morphological labels for a word (gender, tense, singular/plural,..) economicas For a subset of words, we have manual segmentations and annotations lemma=económico gender=feminine number=plural economicas lemma=económico gender=feminine number=pluraleconomicas economicas
  • 16. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Datasets Universal dependencies 1.4: ̶ Finnish, Spanish and Swedish ̶ Select all unique words and their morphological labels Manual annotations and segmentations of 300 test set words 16
  • 17. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Architectures: CNN vs BiLSTM 17 ^ e c o n o m i c a s $ FC Gender = feminine ^ e c o n o m i c a s $ ... Max over time FC Gender = feminine CNN filters CNN BiLSTM
  • 18. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Do the NN patterns follow manual segmentations? 18 All = every possible combination of characters Cons = all consecutive character n-grams
  • 19. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Visualizing contributions: 1 character 19 Spanish ^ g r a t u i t a $ Label: Gender=feminine
  • 20. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Visualizing contributions: 2 characters (Swedish) 20 CNN BiLSTM ^ k r o n o r $ ^ k r o n o r $ ^ k r o n o r $ ^ k r o n o r $ Label: number=plural
  • 21. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Most important patterns per language: Spanish 21 Linguistic rules for feminine gender: - Feminine adjectives often end with “a” - Nouns ending with “dad” or “ión” are often feminine Found pattern: - “a” is a very important pattern - “dad” and “sió” are import trigrams
  • 22. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Most important patterns per language: Swedish 22 Linguistic rules for plural form: - 5 suffixes: or, ar, (e)r, n, and no ending “na” is definite article in plural forms Found pattern: - “or” and “ar” - But also “na” and “rn”
  • 23. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Interactions/compositions of patterns How do positive and negative patterns interact? Consider the Spanish verb “gusta” - Gender=Not Applicable (NA) - We know that suffix “a” is indicator for gender=feminine 23 Consider most positive/negative set of characters per class: The stem provides counterevidence for gender=feminine
  • 25. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Summary We introduced a white box approach to understanding CNNs We showed that: ̶ BiLSTMs and CNNs sometimes choose different patterns ̶ The learned patterns coincide with our linguistic knowledge ̶ Sometimes other plausible patterns are used 25
  • 26. Fréderic Godin - Explaining Character-Aware Neural Networks for Word-Level Prediction Questions? 26
  • 27. Fréderic Godin Ph.D. Researcher Deep Learning and NLP IDLab E frederic.godin@ugent.be @frederic_godin www.fredericgodin.com idlab.technology / idlab.ugent.be