Colloquium talk on modal sense classification using a convolutional neural network

Modal
sense
classification
using
a
convolutional
neural
network

Ana
Marasović
Institut für Computerlinguistik
Ruprecht-‐Karls-‐Universität Heidelberg
01.07.2016.

Modal
verbs
are
ambiguous
between
the
following
senses:
1. epistemic (possibility)
He
could be
at
home.
2. deontic (permission/obligation)
You
can enter
now.
3. dynamic (capability)
Only
John
can solve
this
problem.
MSC
is
special
case
of
WSD

Mein
Gott,
sie ______
sich schrecklichgefühlt haben!

Why
do
we
care
about
it?
Distinguishing
facts
from
hypotheses
and
speculations,
or
apprehended,

planned,
desired
states
of
affairs
• planned
(positively):
should,
must
+
deontic
• apprehended
(negative):
should
not
+
deontic
• disliked
or
forbidden
(negative):
may
not
+deontic
• desired
(positive):
should
+
deontic
Tasks
of
relevance

• factuality
recognition
• sentiment
analysis
• opinion
mining
• argumentation
• opinion
summarization

Related
work
• Ruppenhofer and
Rehbein (2012)
→ R&R
• relatively
high
performance
• shallow
lexical
and
syntactic
features
• small-‐scale
manually
annotated
corpora
• large
distributional
bias
• Zhou
et
al.
(2015)
→ Z+

Related
work
sparsity heuristically
tagged
data
Z+
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
R&R

Related
work
distributional
bias
sparsity
balancing
of
the
data
heuristically
tagged
data
Z+
Z+
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
R&R

Related
work
distributional
bias
sparsity
balancing
of
the
data
enriching
with
semantic
features
heuristically
tagged
data
Z+
Z+
Z+
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
shallow
lexical
and
syntactic

features
R&R

Related
work
distributional
bias
shallow
lexical
and
syntactic

features
sparsity
balancing
of
the
data
enriching
with
semantic
features
heuristically
tagged
data
R&R
Z+
Z+
Z+
?
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
difficulties
beating
the
majority
sense

baseline

Related
work
adapting
to
other
languages
distributional
bias
shallow
lexical
and
syntactic

features
sparsity
balancing
of
the
data
enriching
with
semantic
features
heuristically
tagged
data
R&R
Z+
Z+
Z+
?
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
difficulties
beating
the
majority
sense

baseline

Related
work
distributional
bias
shallow
lexical
and
syntactic

features
sparsity
balancing
of
the
data
enriching
with
semantic
features
heuristically
tagged
data
R&R
Z+
Z+
Z+
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
adapting
to
other
languages
manual
crafting
of
the
features
difficulties
beating
the
majority
sense

baseline

?

Related
work
distributional
bias
shallow
lexical
and
syntactic

features
sparsity
balancing
of
the
data
enriching
with
semantic
features
CNN
heuristically
tagged
data
R&R
Z+
Z+
Z+
• Ruppenhofer and
Rehbein (2012)
→ R&R
• Zhou
et
al.
(2015)
→ Z+
adapting
to
other
languages
manual
crafting
of
the
features
difficulties
beating
the
majority
sense

baseline

?

Outline
• Introduction
• Convolutional
neural
network
(CNN)
for
sentence
modeling

• CNN
for
MSC
• CNN
for
general
word
sense
disambiguation
(WSD)

• Future
work

Convolutional
neural
networks

for
sentence
modeling

N.
Kalchbrenner et
al.
“A
Convolutional
Neural
Network
for
Modelling
Sentences.”
ACL (2014).
Y.
Kim
“Convolutional
Neural
Networks
for
Sentence
Classification.”
EMNLP (2014).

One-‐layer
convolutional
neural
network

I
like
this
movie
very

much
!
input
matrix filter
region
size

filter
size

=

4
filter
width

I
like
this
movie
very

much
!

⊗ Σ

⊗ =
Hadamard
product
of
two
matrices
=
element-‐wise
product
of
two
matrices

Σ =
sum
of
elements
of
a
matrix
value
that
corresponds

to
the
first
4-‐gram
in

the
input
sentence
input
matrix filter

I
like
this
movie
very

much
!
I
like
this
movie
very
much
!

⊗ Σ

⊗ Σ
value
that

corresponds
to
the

second
4-‐gram
in

the
input
sentence
stride
is
one

filter
is
shifted

by
one
row

input
matrix filter

I
like
this
movie
very

much
!
input
matrix filter
I
like
this
movie
very
much
!
I
like
this
movie
very
much
!

⊗ Σ

⊗ Σ

⊗ Σ
value
that
corresponds

to
the
third
4-‐gram
in

the
input
sentence

I
like
this
movie
very

much
!
filter
I
like
this
movie
very
much
!
I
like
this
movie
very
much
!
I
like
this
movie
very
much
!

⊗

Σ

⊗

Σ

⊗

Σ

⊗

Σ
+
non-‐linearity

=

feature
map

value
that
corresponds

to
the
fourth
4-‐gram
in

the
input
sentence

One-‐channel
convolutional
neural
network
used
in
Kim
(2014).

Figure
taken
from
Zhang
et
al.
(2015).

Properties
of
one-‐layer
CNN
• CNN
handles
input
sequences
of
varying
length
• CNN does not depend on external language-‐specific features such as
dependency or constituent parse trees
• CNN
is
sensitive
to
the
order
of
the
words
in
the
sentence
• Filters
serve
as
feature
detectors

• Convolving the same filter with the n-‐gram at every position in the
sentence allows the features to be extracted independently of their
position in the sentence

CNN
for
MSC
MSC
as
a
sentence
classification
task
with
a
fixed
sense
inventory

Experimental
setup:
data
Corpora
• MPQAE (R&R)

• EPOSE

and
EPOSG
from
EuroParl &
OpenSubtitles (EPOS)
heuristically
tagged
via
cross-‐lingual
sense

projection
in
case
of
rare
extractions
for
German:
additional
data
from
MVs
with
shared
senses

was
added
• MASCE:
manually
annotated
subset
of
the
multi-‐genre
corpus
MASC
• TESTG:
manually
annotated
instances
from
EPOSG

Hyperparameters (Zhang,
2015):

• non-‐linearity:
ReLU
• filter
region
sizes:
3,
4,
5

• number
of
filters
per
region
size:

100

• dropout
keep
probability:
0.5
• l2 regularization
coefficient:
10-‐3
• number
of
iterations:
1001
• mini-‐batch
size:
50

• optimizer:
Adam
optimization

algorithm
with
learning
rate
10-‐4
Experimental
setup:
continued
Input
representation:
tuned
and
static

versions
of
the
following
word
vectors
• randomly
initialized
• word2vec
(Mikolovet
al.)

• dependency-‐based
(Levy
et
al.)

Impact
of
word
vectors
(E)
• train
dataset:
balanced 80%
MPQA
(R&R)
+
EPOSE
• test
dataset:
(unbalanced)
20%
MPQA
• accuracy
with
5-‐fold
CV

Comparison
of
CNN
and
baselines
(E)
Classifiers
trained
on
the
balanced
dataset.
For
every
modal
verb
the
best
word
vectors
for
it
are
used.
can
(3) could
(3) may
(2) must
(2) should
(2) micro
BLrandom 33.33 33.33 50.00 50.00 50.00 41.49
MaxEnt 59.64 61.25 92.14 87.60 90.11 74.88
NN 56.01 55.42 90.00 75.42 88.68 69.74
CNN 65.78 67.50 93.57 93.82 90.77 79.29
• train
dataset:
balanced 80%
MPQA
(R&R)
+
EPOSE
• test
dataset:
(unbalanced)
20%
MPQA
• accuracy
with
5-‐fold
CV

Comparison
of
CNN
and
baselines
(E)
Classifiers
trained
on
the
unbalanced dataset.
For
every
modal
verb
the
best
word
vectors
for
it
are
used.
can
(3) could
(3) may
(2) must
(2) should
(2) micro
BLmajority 69.92 65.00 93.57 94.32 90.81 80.18
MaxEnt 64.76 63.33 92.14 92.78 91.48 78.01
NN 67.29 66.08 94.23 86.37 90.96 77.93
CNN 70.87 66.55 93.49 94.97 90.59 80.74
• train
dataset:
unbalanced 80%
MPQA
(R&R)
+
EPOSE
• test
dataset:
(unbalanced)
20%
MPQA
• accuracy
with
5-‐fold
CV

Impact
of
word
vectors
(G)

• train
dataset:
balanced EPOSG
• test
dataset:
TESTG
• accuracy
on
the
test
dataset
• 1772
words
from
10166
in
vocabulary
don’t
have
pre-‐trained
word2vec
• 2087
words
from
10166
in
vocabulary
don’t
have
pre-‐trained
dep.-‐based
vector

Comparison
of
CNN
and
baselines
(G)

dürfen können müssen sollen micro
BLrandom 50.00 33.33 50.00 50.00 39.10
NN 80.30 48.89 74.63 49.75 60.00
CNN 99.49 81.78 88.06 76.62 86.02
• train
dataset:
balanced EPOSG
• test
dataset:
TESTG
• accuracy
on
the
test
dataset

Visualizing
what
filters
have
learned
first
sentence
second
sentence
.
.
.
n-‐th sentence
+
filter max
value
of
the
first
sentence
max
value of
the
second
sentence
.
.
.
max
value
of
the
n-‐th sentence
⇒
⇒
top
15
sentences
w.r.t.
the
max
value
⇒
n-‐gram
from
each
sentence
corresponding
to
the
max
value

Top
15
5-‐grams
with
respect
to
one
filter
illustrated
in
the
embedded
space

Feature
detectors
for
must
feature sense example
past reading
of
the
emb.
verb
ep you
must
have
been
out
last
night
non-‐past
reading
of the
emb.
verb
de we
must
take
further
efforts
stative reading
of
the
emb.
verb ep you
must
think
me
a
perfect tool
eventive reading
of
the
emb.
verb de we
must
develop
a
policy
passive
construction de actual
steps
must
be
taken
negation de we
must
not
fear
domain
specific
vocabulary de
European
parliament, present
regulation,
fisheries

policy
telic
clauses de
to
address
these
problems,
to
prevent
both
forum,

to
exert
maximum
influence
discourse
markers de but,
and
(then)

Feature
detectors
for
müssen and
können
feature sense example
features
that
relate to
observations
on
English
attitude
predicates ep believe,
not
know,
tell
me,
have
an
idea,
be
afraid
adverbials ep possibly
conditionals ep if
counterfactual
and
negative

polarity
context
ep bot
be
the
case,
how, ever
placeholders for
propositions ep it
abstract
concepts ep Idea,
music, grades,
application
indefinite subjects ep one
3rd person
pronouns ep -‐
verb-‐object
combinations
for
action

that
can
be
granted
de use
telephone
achievements
(können only) dy present
report

Other
observations
Statistics
1) average
distance
of
top
ngramsfrom
the
modal
verb
2) average
distance
of
top
ngramswhich
are
on
the
left
from
the
modal
verb
3) as
2)
but
for
ngramson
the
right
and
ngramsstarting
with
the
modal
Observations
• there
are
no
greater
overall
distances
for
German
compared
to
English
• for
German
considerably
more
ngramsthat
include
the
modal
verb,
especially

for
epistemic
readings
of
können,
müssen,
dürfen,
but
not
for
sollen
• strikingly
larger
distances
to
the
left
of
the
modal
verb
for
epistemic
readings

compared
to
non-‐epistemic

Recap
• novel
approach
for
multilingual
MSC
using
a
one-‐layer
CNN
• CNN
approach
outperforms
feature-‐based
baselines
• CNN
is
able
to
learn
meaningful
structure
from
data
• CNN
learns
both
known
and
previously
unattested
linguistic
features
for
MSC
and

domain-‐specific
concepts
• CNN
learns
linguistic
and
semantic
features
from
flexible
window
regions
without

syntactic
pre-‐processing
• CNN
is
easily
adaptable
to
novel
languages
• CNN
allows
for
insightful
model
inspection,
but
this
requires
manual
work

Word
sense
disambiguation
• if
features
CNN
picks
relate
to
semantic
factors
→ CNN
should
be
a
good
candidate
for
WSD
• features
CNN
picks
relate
to
n-‐grams
independent
of
their
position
in
the

sentence
→ CNN
is
flexible

→ can
wider
context
be
useful
for
WSD?

Comparison
with
the
results
from
Rothe and
Schütze (2015)

SensEval-‐3
surrounding
word 65.30
local
collocation 64.70 IMS
(state-‐of-‐the art) 72.30
Snaive -‐ product 62.20 IMS
+
Snaive -‐ product 69.40
S -‐ cosine 60.50 IMS
+
S -‐ cosine 72.40
S -‐ product 64.30 IMS
+
S -‐ product 73.60
S
-‐ raw 63.10 IMS
+
S
-‐ raw 66.80
CNN 67.90 IMS +
CNN 72.00

AutoExtend
• for
the
sentence
representation
all

constituent
words
are
available

• rich
knowledge
about
the
target
word
CNN
• for
sentence
representation
all

constituent
word
are
available

• without
any
knowledge
of
the
target

word
• flexibility
(wider
context)
in
picking

relevant
n-‐grams

Future
work

Feature
work
on
WSD

• tune
hyperparameters
• use
more
data
• use
deeper
network
Feature
work
in
general
• extraction
of
opinion
entities:
opinion
expressions,
their
holders
and
targets
• implicit
sentiment:
where
MSC
plays
role
• planned
(positively):
should,
must
+
deontic
• apprehended
(negative):
should
not
+
deontic
• disliked
or
forbidden
(negative):
may
not
+deontic
• desired
(positive):
should
+
deontic

Thank
you
for
your
attention!

References
• N.
Kalchbrenner et
al.
“A
Convolutional
Neural
Network
for
Modelling
Sentences.”
ACL (2014).
• Y.
Kim
“Convolutional
Neural
Networks
for
Sentence
Classification.”
EMNLP (2014).
• O.
Levy
and
Y.
Goldberg.
“Dependency-‐Based
Word
Embeddings.”
ACL (2014).
• T.
Mikolov et
al.
"Distributed
representations
of
words
and
phrases
and
their

compositionality."
Advances
in
neural
information
processing
systems
26
(2013).
• S.
Rothe and
H.
Schütze.
“AutoExtend:
Extending
Word
Embeddings
to
Embeddings
for

Synsets and
Lexemes.”
ACL (2015).
• J.
Ruppenhofer and
I.
Rehbein.
Yes
we
can!?
Annotating
the
senses
of
English
modal
verbs.
In

Proceedings
of
the
8th
International
Conference
on
Language
Resources
and
Evaluation

(LREC) (pp.
24-‐26).
(2012)
• Y.
Zhang
and
W.
Byron.
“A
Sensitivity
Analysis
of
(and
Practitioners'
Guide
to)
Convolutional

Neural
Networks
for
Sentence
Classification.”
CoRR abs/1510.03820
(2015):
n.
pag.
• M.
Zhou,
A.
Frank,
A.
Friedrich
and
A.
Palmer.
Semantically
Enriched
Models
for
Modal
Sense

Classification.
In
Workshop
on
Linking
Models
of
Lexical,
Sentential
and
Discourse-‐level

Semantics
(LSDSem) (p.
44) (2015)
.

Narrow
and
wide
convolution
PAD
PAD
PAD
I
like
this
movie
very

much
!
PAD
PAD
PAD
PAD
PAD
PAD
I
like
this
movie
very

much
!
PAD
PAD
PAD
PAD
PAD
PAD
I
like
this
movie
very

much
!
PAD
PAD
PAD
PAD
PAD
PAD
I
like
this
movie
very

much
!
PAD
PAD
PAD
…
s
=
sentence
length
m
=
filter
region
size
narrow
convolution
⇒ feature
map
size
equals
to
s-‐m+1
wide
convolution
⇒ feature
map
size
equals
to
s+m-‐1

Will
a
one-‐layer
CNN
be
sufficient?
Soldiers
can
drink
alcohol
until
they
fall
over.
(dynamic-‐capability)
Soldiers can
drink
alcohol
at
late
hours
only.
(deontic-‐permission)
filter
that
was
trained
to

capture
occurrence
of

phrases
like
the
unigram

“soldiers”
filter
that
was
trained
to

capture
occurrence
of

phrases
like
the
4-‐gram

“at
late
hours
only”
how
can
does

this
miracle

come
about?
filter
that
was
trained
to

capture
occurrence
of

phrases
like
the
bigram

“drink
alcohol”

Dynamic

Convolutional

Neural
Network

(DCNN)
Kalchbrenner et
al.
(2014)
• Wide
type
of
convolution
• One-‐dimensional
filter
to
each
row
of

the
input
matrix
• Stacked
convolutional
layers
• k-‐max
pooling
• k
is
a
function
of
the
length
of
the

sentence
and
the
depth
of
the
network

Train-‐test
configurations
train test
English
80%
MPQAE (R&R) +
EPOSE

+/-‐ balancing
a. 20%
MPQAE (R&R)
w/
5-‐fold
CV
b. MASCE
German EPOSG TESTG

Impact
of
word
vectors
(E)
can
(3) could
(3) may
(2) must
(2) should
(2)
w2v-‐static 65.02 51.67 93.57 93.82 90.77
w2v-‐tuned 63.73 54.17 93.57 93.82 90.77
deps-‐static 65.78 56.67 93.57 93.82 90.77
deps-‐tuned 59.89 67.50 93.57 93.29 90.42
rand-‐static 63.99 46.67 93.57 92.79 90.77
rand-‐tuned 64.50 48.33 93.57 92.79 90.77
• train
dataset:
balanced 80%
MPQA
(R&R)
+
EPOSE
• test
dataset:
(unbalanced)
20%
MPQA
• accuracy
with
5-‐fold
CV

MASC
Balanced
vs.
unbalanced
training
when
evaluated
on
MPQAE and
MASCE for
CNN
and

MaxEnt.
test dataset training
dataset:
balanced/unbalanced
MaxEnt
MPQAE BA
(74.88) UBA (78.01)
MASCE BA
(3/19) UBA
(15/19)
CNN
MPQAE BA
(79.92) UBA
(80.74)
MASCE BA
(13/19) UBA
(3/19)
balanced
(BA)

training
data
CNN
(19/19) MaxEnt (0/19)
unbalanced
(UBA)

training
data
CNN
(12/19) MaxEnt (7/19)
Difference
between
CNN
and
MaxEnt trained
on
MPQAE
+EPOS
and
evaluated
on
MASC.

CNN

(German)
NN

(German)

Appendix:
number
of
instances

Appendix:
number
of
instances
(G)
ep de dy
dürfen 1000 1000 0
können 1000 1000 1000
müssen 1000 1000 0
sollen 1000 1000 0
ep de dy
98 100 0
100 47 100
34 100 0
101 100 0
• train
dataset:
balanced EPOSG
• test
dataset:
TESTG

Colloquium talk on modal sense classification using a convolutional neural network

More Related Content

What's hot (20)

Similar to Colloquium talk on modal sense classification using a convolutional neural network (20)

Recently uploaded (20)

Colloquium talk on modal sense classification using a convolutional neural network