SlideShare a Scribd company logo
Modal	
  sense	
  classification
using	
  a	
  convolutional	
  neural	
  network	
  
Ana	
  Marasović
Institut für Computerlinguistik
Ruprecht-­‐Karls-­‐Universität Heidelberg
01.07.2016.
Modal	
  verbs	
  are	
  ambiguous	
  between	
  the	
  following	
  senses:
1. epistemic (possibility)
He	
  could be	
  at	
  home.
2. deontic (permission/obligation)
You	
  can enter	
  now.
3. dynamic (capability)
Only	
  John	
  can solve	
  this	
  problem.
MSC	
  is	
  special	
  case	
  of	
  WSD	
  
Mein	
  Gott,	
  sie ______	
  sich schrecklichgefühlt haben!	
  
Why	
  do	
  we	
  care	
  about	
  it?
Distinguishing	
  facts	
  from	
  hypotheses	
  and	
  speculations,	
  or	
  apprehended,	
  
planned,	
  desired	
  states	
  of	
  affairs
• planned	
  (positively):	
  should,	
  must	
  +	
  deontic
• apprehended	
  (negative):	
  should	
  not	
  +	
  deontic
• disliked	
  or	
  forbidden	
  (negative):	
  may	
  not	
  +deontic
• desired	
  (positive):	
  should	
  +	
  deontic
Tasks	
  of	
  relevance	
  	
  
• factuality	
  recognition
• sentiment	
  analysis
• opinion	
  mining
• argumentation
• opinion	
  summarization
Related	
  work
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• relatively	
  high	
  performance
• shallow	
  lexical	
  and	
  syntactic	
  features
• small-­‐scale	
  manually	
  annotated	
  corpora
• large	
  distributional	
  bias
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
Related	
  work
sparsity heuristically	
  tagged	
  data
Z+
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
R&R
Related	
  work
distributional	
  bias
sparsity
balancing	
  of	
  the	
  data
heuristically	
  tagged	
  data
Z+
Z+
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
R&R
Related	
  work
distributional	
  bias
sparsity
balancing	
  of	
  the	
  data
enriching	
  with	
  semantic	
  features
heuristically	
  tagged	
  data
Z+
Z+
Z+
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
shallow	
  lexical	
  and	
  syntactic	
  
features
R&R
Related	
  work
distributional	
  bias
shallow	
  lexical	
  and	
  syntactic	
  
features
sparsity
balancing	
  of	
  the	
  data
enriching	
  with	
  semantic	
  features
heuristically	
  tagged	
  data
R&R
Z+
Z+
Z+
?
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
difficulties	
   beating	
   the	
  majority	
  sense	
  
baseline	
  
Related	
  work
adapting	
  to	
  other	
  languages
distributional	
  bias
shallow	
  lexical	
  and	
  syntactic	
  
features
sparsity
balancing	
  of	
  the	
  data
enriching	
  with	
  semantic	
  features
heuristically	
  tagged	
  data
R&R
Z+
Z+
Z+
?
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
difficulties	
   beating	
   the	
  majority	
  sense	
  
baseline	
  
Related	
  work
distributional	
  bias
shallow	
  lexical	
  and	
  syntactic	
  
features
sparsity
balancing	
  of	
  the	
  data
enriching	
  with	
  semantic	
  features
heuristically	
  tagged	
  data
R&R
Z+
Z+
Z+
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
adapting	
  to	
  other	
  languages
manual	
  crafting	
  of	
  the	
  features
difficulties	
   beating	
   the	
  majority	
  sense	
  
baseline	
  
?
Related	
  work
distributional	
  bias
shallow	
  lexical	
  and	
  syntactic	
  
features
sparsity
balancing	
  of	
  the	
  data
enriching	
  with	
  semantic	
  features
CNN
heuristically	
  tagged	
  data
R&R
Z+
Z+
Z+
• Ruppenhofer and	
  Rehbein (2012)	
  → R&R
• Zhou	
  et	
  al.	
  (2015)	
  → Z+
adapting	
  to	
  other	
  languages
manual	
  crafting	
  of	
  the	
  features
difficulties	
   beating	
   the	
  majority	
  sense	
  
baseline	
  
?
Outline
• Introduction
• Convolutional	
  neural	
  network	
  (CNN)	
  for	
  sentence	
  modeling	
  
• CNN	
  for	
  MSC
• CNN	
  for	
  general	
  word	
  sense	
  disambiguation	
  (WSD)	
  
• Future	
  work	
  
Convolutional	
  neural	
  networks	
  
for	
  sentence	
  modeling	
  
N.	
  Kalchbrenner et	
  al.	
  “A	
  Convolutional	
  Neural	
  Network	
  for	
  Modelling	
  Sentences.”	
  ACL (2014).
Y.	
  Kim	
  “Convolutional	
  Neural	
  Networks	
  for	
  Sentence	
  Classification.”	
  EMNLP (2014).
One-­‐layer	
  convolutional	
  neural	
  network	
  
I
like
this
movie
very	
  
much
!
input	
  matrix filter
region	
  size	
  
filter	
  size	
  	
  	
  	
  =	
  	
  	
  4
filter	
  width
I
like
this
movie
very	
  
much
!
	
  ⊗ Σ
	
  ⊗ =	
  Hadamard	
  product	
  of	
  two	
  matrices
=	
  element-­‐wise	
  product	
  of	
  two	
  matrices
	
  	
  	
  Σ =	
  sum	
  of	
  elements	
  of	
  a	
  matrix
value	
  that	
  corresponds	
  
to	
  the	
  first	
  4-­‐gram	
  in	
  
the	
  input	
  sentence
input	
  matrix filter
I
like
this
movie
very	
  
much
!
I
like
this
movie
very
much
!
	
  ⊗ Σ
	
  ⊗ Σ
value	
  that	
  
corresponds	
   to	
  the	
  
second	
  4-­‐gram	
  in	
  
the	
  input	
  sentence
stride	
  is	
  one	
  
filter	
  is	
  shifted	
  
by	
  one	
  row	
  
input	
  matrix filter
I
like
this
movie
very	
  
much
!
input	
  matrix filter
I
like
this
movie
very
much
!
I
like
this
movie
very
much
!
	
  ⊗ Σ
	
  ⊗ Σ
	
  ⊗ Σ
value	
  that	
  corresponds	
  
to	
  the	
  third	
  4-­‐gram	
  in	
  
the	
  input	
  sentence
I
like
this
movie
very	
  
much
!
filter
I
like
this
movie
very
much
!
I
like
this
movie
very
much
!
I
like
this
movie
very
much
!
	
  ⊗
	
  Σ
	
  ⊗
	
  Σ
	
  ⊗
	
  Σ
	
  ⊗
	
  Σ
+	
  non-­‐linearity	
  	
  =	
  
feature	
  map	
  
value	
  that	
  corresponds	
  
to	
  the	
  fourth	
  4-­‐gram	
  in	
  
the	
  input	
  sentence
One-­‐channel	
   convolutional	
   neural	
   network	
  used	
  in	
  Kim	
  (2014).	
  
Figure	
   taken	
  from	
  Zhang	
  et	
  al.	
  (2015).
Properties	
  of	
  one-­‐layer	
  CNN
• CNN	
  handles	
  input	
  sequences	
  of	
  varying	
  length
• CNN does not depend on external language-­‐specific features such as
dependency or constituent parse trees
• CNN	
  is	
  sensitive	
  to	
  the	
  order	
  of	
  the	
  words	
  in	
  the	
  sentence
• Filters	
  serve	
  as	
  feature	
  detectors	
  
• Convolving the same filter with the n-­‐gram at every position in the
sentence allows the features to be extracted independently of their
position in the sentence
CNN	
  for	
  MSC
MSC	
  as	
  a	
  sentence	
  classification	
  task	
  with	
  a	
  fixed	
  sense	
  inventory
Experimental	
  setup:	
  data
Corpora
• MPQAE (R&R)	
  
• EPOSE	
  	
  and	
  EPOSG
from	
  EuroParl &	
  OpenSubtitles (EPOS)	
  heuristically	
  tagged	
  via	
  cross-­‐lingual	
  sense	
  
projection
in	
  case	
  of	
  rare	
  extractions	
  for	
  German:	
  additional	
  data	
  from	
  MVs	
  with	
  shared	
  senses	
  
was	
  added
• MASCE:	
  manually	
  annotated	
  subset	
  of	
  the	
  multi-­‐genre	
  corpus	
  MASC
• TESTG:	
  manually	
  annotated	
  instances	
  from	
  EPOSG
Hyperparameters (Zhang,	
  2015):	
  
• non-­‐linearity:	
  ReLU
• filter	
  region	
  sizes:	
  3,	
  4,	
  5	
  
• number	
  of	
  filters	
  per	
  region	
  size:	
  
100	
  
• dropout	
  keep	
  probability:	
  0.5
• l2 regularization	
  coefficient:	
  10-­‐3
• number	
  of	
  iterations:	
  1001
• mini-­‐batch	
  size:	
  50	
  
• optimizer:	
  Adam	
  optimization	
  
algorithm	
  with	
  learning	
  rate	
  10-­‐4
Experimental	
  setup:	
  continued
Input	
  representation:	
  tuned	
  and	
  static	
  
versions	
  of	
  the	
  following	
  word	
  vectors
• randomly	
  initialized
• word2vec	
  (Mikolovet	
  al.)	
  
• dependency-­‐based	
  (Levy	
  et	
  al.)
Impact	
  of	
  word	
  vectors	
  (E)
• train	
  dataset:	
  balanced 80%	
  MPQA	
  (R&R)	
  +	
  EPOSE
• test	
  dataset:	
  (unbalanced)	
  20%	
  MPQA
• accuracy	
  with	
  5-­‐fold	
  CV
Comparison	
  of	
  CNN	
  and	
  baselines	
  (E)
Classifiers	
   trained	
  on	
  the	
  balanced	
   dataset.	
  For	
  every	
  modal	
  verb	
  the	
  best	
  word	
  vectors	
  for	
  it	
  are	
  used.
can	
  (3) could	
  (3) may	
  (2) must	
  (2) should	
  (2) micro
BLrandom 33.33 33.33 50.00 50.00 50.00 41.49
MaxEnt 59.64 61.25 92.14 87.60 90.11 74.88
NN 56.01 55.42 90.00 75.42 88.68 69.74
CNN 65.78 67.50 93.57 93.82 90.77 79.29
• train	
  dataset:	
  balanced 80%	
  MPQA	
  (R&R)	
  +	
  EPOSE
• test	
  dataset:	
  (unbalanced)	
  20%	
  MPQA
• accuracy	
  with	
  5-­‐fold	
  CV
Comparison	
  of	
  CNN	
  and	
  baselines	
  (E)
Classifiers	
   trained	
  on	
  the	
  unbalanced dataset.	
  For	
  every	
   modal	
  verb	
  the	
  best	
  word	
  vectors	
  for	
  it	
  are	
  used.
can	
  (3) could	
  (3) may	
  (2) must	
  (2) should	
  (2) micro
BLmajority 69.92 65.00 93.57 94.32 90.81 80.18
MaxEnt 64.76 63.33 92.14 92.78 91.48 78.01
NN 67.29 66.08 94.23 86.37 90.96 77.93
CNN 70.87 66.55 93.49 94.97 90.59 80.74
• train	
  dataset:	
  unbalanced 80%	
  MPQA	
  (R&R)	
  +	
  EPOSE
• test	
  dataset:	
  (unbalanced)	
  20%	
  MPQA
• accuracy	
  with	
  5-­‐fold	
  CV
Impact	
  of	
  word	
  vectors	
  (G)	
  
• train	
  dataset:	
  balanced EPOSG
• test	
  dataset:	
  TESTG
• accuracy	
  on	
  the	
  test	
  dataset
• 1772	
  words	
  from	
  10166	
  in	
  vocabulary	
  don’t	
  have	
  pre-­‐trained	
  word2vec
• 2087	
  words	
  from	
  10166	
  in	
  vocabulary	
  don’t	
  have	
  pre-­‐trained	
  dep.-­‐based	
  vector
Comparison	
  of	
  CNN	
  and	
  baselines	
  (G)	
  
dürfen können müssen sollen micro
BLrandom 50.00 33.33 50.00 50.00 39.10
NN 80.30 48.89 74.63 49.75 60.00
CNN 99.49 81.78 88.06 76.62 86.02
• train	
  dataset:	
  balanced EPOSG
• test	
  dataset:	
  TESTG
• accuracy	
  on	
  the	
  test	
  dataset
Visualizing	
  what	
  filters	
  have	
  learned
first	
  sentence
second	
  sentence
.
.
.
n-­‐th sentence
+
filter max	
  value	
  of	
  the	
  first	
  sentence
max	
  value of	
  the	
  second	
  sentence
.
.
.
max	
  value	
  of	
  the	
  n-­‐th sentence
⇒
⇒
top	
  15	
  sentences	
  w.r.t.	
  the	
  max	
  value
⇒
n-­‐gram	
  from	
  each	
  sentence	
  corresponding	
  to	
  the	
  max	
  value
Top	
  15	
  5-­‐grams	
  with	
  respect	
  to	
  one	
  filter	
  illustrated	
  in	
  the	
  embedded	
  space	
  
Feature	
  detectors	
  for	
  must
feature sense example
past reading	
  of	
  the	
  emb.	
  verb	
   ep you	
  must	
  have	
  been	
  out	
  last	
  night
non-­‐past	
  reading	
  of the	
  emb.	
  verb	
   de we	
  must	
  take	
  further	
  efforts
stative reading	
  of	
  the	
  emb.	
  verb ep you	
  must	
  think	
  me	
  a	
  perfect tool
eventive reading	
  of	
  the	
  emb.	
  verb de we	
  must	
  develop	
  a	
  policy
passive	
  construction de actual	
  steps	
  must	
  be	
  taken
negation de we	
  must	
  not	
  fear
domain	
  specific	
  vocabulary de
European	
  parliament, present	
  regulation,	
  fisheries	
  
policy
telic	
  clauses de
to	
  address	
  these	
  problems,	
   to	
  prevent	
  both	
  forum,	
  
to	
  exert	
  maximum	
  influence
discourse	
  markers de but,	
  and	
  (then)
Feature	
  detectors	
  for	
  müssen and	
  können
feature sense example
features	
  that	
  relate to	
  observations	
  on	
  English
attitude	
  predicates ep believe,	
  not	
  know,	
  tell	
  me,	
  have	
  an	
  idea,	
  be	
  afraid
adverbials ep possibly
conditionals ep if
counterfactual	
  and	
  negative	
  
polarity	
  context
ep bot	
  be	
  the	
  case,	
  how, ever
placeholders for	
  propositions ep it
abstract	
  concepts ep Idea,	
  music, grades,	
  application
indefinite subjects ep one
3rd person	
  pronouns ep -­‐
verb-­‐object	
  combinations	
  for	
  action	
  
that	
  can	
  be	
  granted
de use	
  telephone
achievements	
  (können only) dy present	
  report
Other	
  observations
Statistics
1) average	
  distance	
  of	
  top	
  ngramsfrom	
  the	
  modal	
  verb
2) average	
  distance	
  of	
  top	
  ngramswhich	
  are	
  on	
  the	
  left	
  from	
  the	
  modal	
  verb
3) as	
  2)	
  but	
  for	
  ngramson	
  the	
  right	
  and	
  ngramsstarting	
  with	
  the	
  modal
Observations
• there	
  are	
  no	
  greater	
  overall	
  distances	
  for	
  German	
  compared	
  to	
  English
• for	
  German	
  considerably	
  more	
  ngramsthat	
  include	
  the	
  modal	
  verb,	
  especially	
  
for	
  epistemic	
  readings	
  of	
  können,	
  müssen,	
  dürfen,	
  but	
  not	
  for	
  sollen
• strikingly	
  larger	
  distances	
  to	
  the	
  left	
  of	
  the	
  modal	
  verb	
  for	
  epistemic	
  readings	
  
compared	
  to	
  non-­‐epistemic
Recap
• novel	
  approach	
  for	
  multilingual	
  MSC	
  using	
  a	
  one-­‐layer	
  CNN
• CNN	
  approach	
  outperforms	
  feature-­‐based	
  baselines
• CNN	
  is	
  able	
  to	
  learn	
  meaningful	
  structure	
  from	
  data
• CNN	
  learns	
  both	
  known	
  and	
  previously	
  unattested	
  linguistic	
  features	
  for	
  MSC	
  and	
  
domain-­‐specific	
  concepts
• CNN	
  learns	
  linguistic	
  and	
  semantic	
  features	
  from	
  flexible	
  window	
  regions	
  without	
  
syntactic	
  pre-­‐processing
• CNN	
  is	
  easily	
  adaptable	
  to	
  novel	
  languages
• CNN	
  allows	
  for	
  insightful	
  model	
  inspection,	
  but	
  this	
  requires	
  manual	
  work
Word	
  sense	
  disambiguation
• if	
  features	
  CNN	
  picks	
  relate	
  to	
  semantic	
  factors
→ CNN	
  should	
  be	
  a	
  good	
  candidate	
  for	
  WSD
• features	
  CNN	
  picks	
  relate	
  to	
  n-­‐grams	
  independent	
  of	
  their	
  position	
  in	
  the	
  
sentence
→ CNN	
  is	
  flexible	
  
→ can	
  wider	
  context	
  be	
  useful	
  for	
  WSD?	
  
Comparison	
   with	
  the	
  results	
  from	
  Rothe and	
  Schütze (2015)	
  
SensEval-­‐3
surrounding	
  word 65.30
local	
  collocation 64.70 IMS	
  (state-­‐of-­‐the art) 72.30
Snaive -­‐ product 62.20 IMS	
  +	
  Snaive -­‐ product 69.40
S -­‐ cosine 60.50 IMS	
  +	
  S -­‐ cosine 72.40
S -­‐ product 64.30 IMS	
  +	
  S -­‐ product 73.60
S	
  -­‐ raw 63.10 IMS	
  +	
  S	
  -­‐ raw 66.80
CNN 67.90 IMS +	
  CNN 72.00
AutoExtend
• for	
  the	
  sentence	
  representation	
  all	
  
constituent	
  words	
  are	
  available	
  
• rich	
  knowledge	
  about	
  the	
  target	
  word
CNN
• for	
  sentence	
  representation	
  all	
  
constituent	
  word	
  are	
  available	
  
• without	
  any	
  knowledge	
  of	
  the	
  target	
  
word
• flexibility	
  (wider	
  context)	
  in	
  picking	
  
relevant	
  n-­‐grams
Future	
  work	
  
Feature	
  work	
  on	
  WSD	
  
• tune	
  hyperparameters
• use	
  more	
  data
• use	
  deeper	
  network
Feature	
  work	
  in	
  general
• extraction	
  of	
  opinion	
  entities:	
  opinion	
  expressions,	
  their	
  holders	
  and	
  targets
• implicit	
  sentiment:	
  where	
  MSC	
  plays	
  role
• planned	
  (positively):	
  should,	
  must	
  +	
  deontic
• apprehended	
  (negative):	
  should	
  not	
  +	
  deontic
• disliked	
  or	
  forbidden	
  (negative):	
  may	
  not	
  +deontic
• desired	
  (positive):	
  should	
  +	
  deontic
Thank	
  you	
  for	
  your	
  attention!
References
• N.	
  Kalchbrenner et	
  al.	
  “A	
  Convolutional	
  Neural	
  Network	
  for	
  Modelling	
  Sentences.”	
  ACL (2014).
• Y.	
  Kim	
  “Convolutional	
  Neural	
  Networks	
  for	
  Sentence	
  Classification.”	
  EMNLP (2014).
• O.	
  Levy	
  and	
  Y.	
  Goldberg.	
  “Dependency-­‐Based	
  Word	
  Embeddings.”	
  ACL (2014).
• T.	
  Mikolov et	
  al.	
  "Distributed	
  representations	
  of	
  words	
  and	
  phrases	
  and	
  their	
  
compositionality."	
  Advances	
  in	
  neural	
  information	
  processing	
  systems	
  26	
  (2013).
• S.	
  Rothe and	
  H.	
  Schütze.	
  “AutoExtend:	
  Extending	
  Word	
  Embeddings	
  to	
  Embeddings	
  for	
  
Synsets and	
  Lexemes.”	
  ACL (2015).
• J.	
  Ruppenhofer and	
  I.	
  Rehbein.	
  Yes	
  we	
  can!?	
  Annotating	
  the	
  senses	
  of	
  English	
  modal	
  verbs.	
  In	
  
Proceedings	
  of	
  the	
  8th	
  International	
  Conference	
  on	
  Language	
  Resources	
  and	
  Evaluation	
  
(LREC) (pp.	
  24-­‐26).	
  (2012)
• Y.	
  Zhang	
  and	
  W.	
  Byron.	
  “A	
  Sensitivity	
  Analysis	
  of	
  (and	
  Practitioners'	
  Guide	
  to)	
  Convolutional	
  
Neural	
  Networks	
  for	
  Sentence	
  Classification.”	
  CoRR abs/1510.03820	
  (2015):	
  n.	
  pag.
• M.	
  Zhou,	
  A.	
  Frank,	
  A.	
  Friedrich	
  and	
  A.	
  Palmer.	
  Semantically	
  Enriched	
  Models	
  for	
  Modal	
  Sense	
  
Classification.	
  In	
  Workshop	
  on	
  Linking	
  Models	
  of	
  Lexical,	
  Sentential	
  and	
  Discourse-­‐level	
  
Semantics	
  (LSDSem) (p.	
  44) (2015)	
  .
Narrow	
  and	
  wide	
  convolution
PAD
PAD
PAD
I
like
this
movie
very	
  
much
!
PAD
PAD
PAD
PAD
PAD
PAD
I
like
this
movie
very	
  
much
!
PAD
PAD
PAD
PAD
PAD
PAD
I
like
this
movie
very	
  
much
!
PAD
PAD
PAD
PAD
PAD
PAD
I
like
this
movie
very	
  
much
!
PAD
PAD
PAD
…
s	
  =	
  sentence	
  length
m	
  =	
  filter	
  region	
  size
narrow	
  convolution	
  ⇒ feature	
  map	
  size	
  equals	
  to	
  s-­‐m+1
wide	
  convolution	
  ⇒ feature	
  map	
  size	
  equals	
  to	
  s+m-­‐1
Will	
  a	
  one-­‐layer	
  CNN	
  be	
  sufficient?
Soldiers	
  can	
  drink	
  alcohol	
  until	
  they	
  fall	
  over.	
  (dynamic-­‐capability)
Soldiers can	
  drink	
  alcohol	
  at	
  late	
  hours	
  only.	
  (deontic-­‐permission)
filter	
   that	
  was	
  trained	
   to	
  
capture	
   occurrence	
   of	
  
phrases	
  like	
  the	
  unigram	
  
“soldiers”
filter	
   that	
  was	
  trained	
   to	
  
capture	
   occurrence	
   of	
  
phrases	
  like	
  the	
  4-­‐gram	
  
“at	
  late	
  hours	
  only”
how	
  can	
  does	
  
this	
  miracle	
  
come	
  about?
filter	
   that	
  was	
  trained	
   to	
  
capture	
   occurrence	
   of	
  
phrases	
  like	
  the	
  bigram	
  
“drink	
   alcohol”
Dynamic	
  
Convolutional	
  
Neural	
  Network	
  
(DCNN)
Kalchbrenner et	
  al.	
  (2014)
• Wide	
  type	
  of	
  convolution
• One-­‐dimensional	
  filter	
  to	
  each	
  row	
  of	
  
the	
  input	
  matrix
• Stacked	
  convolutional	
   layers
• k-­‐max	
  pooling
• k	
  is	
  a	
  function	
  of	
  the	
  length	
  of	
  the	
  
sentence	
  and	
  the	
  depth	
  of	
  the	
  network
Train-­‐test	
  configurations
train test
English
80%	
  MPQAE (R&R) +	
  EPOSE	
  
+/-­‐ balancing
a. 20%	
  MPQAE (R&R)	
  w/	
  5-­‐fold	
  CV
b. MASCE
German EPOSG TESTG
Impact	
  of	
  word	
  vectors	
  (E)
can	
  (3) could	
  (3) may	
  (2) must	
  (2) should	
  (2)
w2v-­‐static 65.02 51.67 93.57 93.82 90.77
w2v-­‐tuned 63.73 54.17 93.57 93.82 90.77
deps-­‐static 65.78 56.67 93.57 93.82 90.77
deps-­‐tuned 59.89 67.50 93.57 93.29 90.42
rand-­‐static 63.99 46.67 93.57 92.79 90.77
rand-­‐tuned 64.50 48.33 93.57 92.79 90.77
• train	
  dataset:	
  balanced 80%	
  MPQA	
  (R&R)	
  +	
  EPOSE
• test	
  dataset:	
  (unbalanced)	
  20%	
  MPQA
• accuracy	
  with	
  5-­‐fold	
  CV
MASC
Balanced	
   vs.	
  unbalanced	
   training	
   when	
  evaluated	
   on	
  MPQAE and	
  MASCE for	
  CNN	
  and	
  
MaxEnt.
test dataset training	
  dataset:	
  balanced/unbalanced
MaxEnt
MPQAE BA	
  (74.88) UBA (78.01)
MASCE BA	
  (3/19) UBA	
  (15/19)
CNN
MPQAE BA	
  (79.92) UBA	
  (80.74)
MASCE BA	
  (13/19) UBA	
  (3/19)
balanced	
  (BA)	
  
training	
  data
CNN	
  (19/19) MaxEnt (0/19)
unbalanced	
  (UBA)	
  
training	
  data
CNN	
  (12/19) MaxEnt (7/19)
Difference	
   between	
   CNN	
  and	
  MaxEnt trained	
   on	
  MPQAE	
  +EPOS	
  and	
  evaluated	
   on	
  MASC.
Appendix:	
  MASC
Appendix:	
  MASC
Appendix:	
  MASC
Appendix:	
  MASC
Appendix:	
  MASC
Comparison	
  of	
  CNN	
  and	
  baselines	
  (G)	
  
dürfen können müssen sollen micro
BLrandom 50.00 33.33 50.00 50.00 39.10
NN 80.30 48.89 74.63 49.75 60.00
CNN 99.49 81.78 88.06 76.62 86.02
• train	
  dataset:	
  balanced EPOSG
• test	
  dataset:	
  TESTG
• accuracy	
  on	
  the	
  test	
  dataset
CNN	
  
(German)
NN	
  
(German)
Appendix:	
  number	
  of	
  instances
Appendix:	
  number	
  of	
  instances	
  (G)
ep de dy
dürfen 1000 1000 0
können 1000 1000 1000
müssen 1000 1000 0
sollen 1000 1000 0
ep de dy
98 100 0
100 47 100
34 100 0
101 100 0
• train	
  dataset:	
  balanced EPOSG
• test	
  dataset:	
  TESTG

More Related Content

PDF
Anthiil Inside workshop on NLP
PDF
Learning to understand phrases by embedding the dictionary
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PPTX
NLP Bootcamp
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PPTX
Language models
Anthiil Inside workshop on NLP
Learning to understand phrases by embedding the dictionary
Deep Learning for Natural Language Processing: Word Embeddings
NLP Bootcamp
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Visual-Semantic Embeddings: some thoughts on Language
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Language models

What's hot (20)

PDF
AINL 2016: Nikolenko
PPTX
Talk from NVidia Developer Connect
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PDF
AINL 2016: Eyecioglu
PDF
Deep Learning, an interactive introduction for NLP-ers
PDF
Deep Learning for Information Retrieval
PDF
Multi modal retrieval and generation with deep distributed models
PDF
Deep learning for nlp
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
AINL 2016: Yagunova
PPTX
Recent Advances in NLP
PDF
Deep Learning for NLP Applications
PDF
AINL 2016: Filchenkov
PDF
Frontiers of Natural Language Processing
PDF
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
PDF
Deep learning for natural language embeddings
PPTX
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
DOCX
A neural probabilistic language model
PDF
AINL 2016: Malykh
AINL 2016: Nikolenko
Talk from NVidia Developer Connect
Deep Learning for NLP: An Introduction to Neural Word Embeddings
AINL 2016: Eyecioglu
Deep Learning, an interactive introduction for NLP-ers
Deep Learning for Information Retrieval
Multi modal retrieval and generation with deep distributed models
Deep learning for nlp
(Deep) Neural Networks在 NLP 和 Text Mining 总结
NLP Bootcamp 2018 : Representation Learning of text for NLP
AINL 2016: Yagunova
Recent Advances in NLP
Deep Learning for NLP Applications
AINL 2016: Filchenkov
Frontiers of Natural Language Processing
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep learning for natural language embeddings
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
A neural probabilistic language model
AINL 2016: Malykh
Ad

Similar to Colloquium talk on modal sense classification using a convolutional neural network (20)

PDF
Natural Language Processing (NLP)
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
PDF
Recursive Neural Networks
PDF
Nlp research presentation
PDF
Automatic Personality Prediction with Attention-based Neural Networks
PDF
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
PDF
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
PDF
Turkish language modeling using BERT
PDF
Intrinsic and Extrinsic Evaluations of Word Embeddings
PPTX
Deep Learning and Watson Studio
PPT
Introduction to Natural Language Processing
PPTX
Dcnn for text
PPTX
Deep Learning for Natural Language Processing
PPTX
DNN Model Interpretability
PPT
SLoSP-2007-1statisticalstatisticalstatistical.ppt
PPT
SLoSP-2007-1 natural language processing.ppt
PPTX
Natural language inference(NLI) importtant
PDF
Incrementality
PDF
Beyond the Symbols: A 30-minute Overview of NLP
Natural Language Processing (NLP)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Recursive Neural Networks
Nlp research presentation
Automatic Personality Prediction with Attention-based Neural Networks
Poster for RepL4NLP - Multilingual Modal Sense Classification Using a Convolu...
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Turkish language modeling using BERT
Intrinsic and Extrinsic Evaluations of Word Embeddings
Deep Learning and Watson Studio
Introduction to Natural Language Processing
Dcnn for text
Deep Learning for Natural Language Processing
DNN Model Interpretability
SLoSP-2007-1statisticalstatisticalstatistical.ppt
SLoSP-2007-1 natural language processing.ppt
Natural language inference(NLI) importtant
Incrementality
Beyond the Symbols: A 30-minute Overview of NLP
Ad

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Construction Project Organization Group 2.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
introduction to datamining and warehousing
PDF
Well-logging-methods_new................
PPT
Project quality management in manufacturing
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Digital Logic Computer Design lecture notes
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT on Performance Review to get promotions
OOP with Java - Java Introduction (Basics)
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Sustainable Sites - Green Building Construction
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CYBER-CRIMES AND SECURITY A guide to understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Construction Project Organization Group 2.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Operating System & Kernel Study Guide-1 - converted.pdf
introduction to datamining and warehousing
Well-logging-methods_new................
Project quality management in manufacturing
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Digital Logic Computer Design lecture notes
Automation-in-Manufacturing-Chapter-Introduction.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx

Colloquium talk on modal sense classification using a convolutional neural network

  • 1. Modal  sense  classification using  a  convolutional  neural  network   Ana  Marasović Institut für Computerlinguistik Ruprecht-­‐Karls-­‐Universität Heidelberg 01.07.2016.
  • 2. Modal  verbs  are  ambiguous  between  the  following  senses: 1. epistemic (possibility) He  could be  at  home. 2. deontic (permission/obligation) You  can enter  now. 3. dynamic (capability) Only  John  can solve  this  problem. MSC  is  special  case  of  WSD   Mein  Gott,  sie ______  sich schrecklichgefühlt haben!  
  • 3. Why  do  we  care  about  it? Distinguishing  facts  from  hypotheses  and  speculations,  or  apprehended,   planned,  desired  states  of  affairs • planned  (positively):  should,  must  +  deontic • apprehended  (negative):  should  not  +  deontic • disliked  or  forbidden  (negative):  may  not  +deontic • desired  (positive):  should  +  deontic Tasks  of  relevance     • factuality  recognition • sentiment  analysis • opinion  mining • argumentation • opinion  summarization
  • 4. Related  work • Ruppenhofer and  Rehbein (2012)  → R&R • relatively  high  performance • shallow  lexical  and  syntactic  features • small-­‐scale  manually  annotated  corpora • large  distributional  bias • Zhou  et  al.  (2015)  → Z+
  • 5. Related  work sparsity heuristically  tagged  data Z+ • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ R&R
  • 6. Related  work distributional  bias sparsity balancing  of  the  data heuristically  tagged  data Z+ Z+ • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ R&R
  • 7. Related  work distributional  bias sparsity balancing  of  the  data enriching  with  semantic  features heuristically  tagged  data Z+ Z+ Z+ • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ shallow  lexical  and  syntactic   features R&R
  • 8. Related  work distributional  bias shallow  lexical  and  syntactic   features sparsity balancing  of  the  data enriching  with  semantic  features heuristically  tagged  data R&R Z+ Z+ Z+ ? • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ difficulties   beating   the  majority  sense   baseline  
  • 9. Related  work adapting  to  other  languages distributional  bias shallow  lexical  and  syntactic   features sparsity balancing  of  the  data enriching  with  semantic  features heuristically  tagged  data R&R Z+ Z+ Z+ ? • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ difficulties   beating   the  majority  sense   baseline  
  • 10. Related  work distributional  bias shallow  lexical  and  syntactic   features sparsity balancing  of  the  data enriching  with  semantic  features heuristically  tagged  data R&R Z+ Z+ Z+ • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ adapting  to  other  languages manual  crafting  of  the  features difficulties   beating   the  majority  sense   baseline   ?
  • 11. Related  work distributional  bias shallow  lexical  and  syntactic   features sparsity balancing  of  the  data enriching  with  semantic  features CNN heuristically  tagged  data R&R Z+ Z+ Z+ • Ruppenhofer and  Rehbein (2012)  → R&R • Zhou  et  al.  (2015)  → Z+ adapting  to  other  languages manual  crafting  of  the  features difficulties   beating   the  majority  sense   baseline   ?
  • 12. Outline • Introduction • Convolutional  neural  network  (CNN)  for  sentence  modeling   • CNN  for  MSC • CNN  for  general  word  sense  disambiguation  (WSD)   • Future  work  
  • 13. Convolutional  neural  networks   for  sentence  modeling   N.  Kalchbrenner et  al.  “A  Convolutional  Neural  Network  for  Modelling  Sentences.”  ACL (2014). Y.  Kim  “Convolutional  Neural  Networks  for  Sentence  Classification.”  EMNLP (2014).
  • 14. One-­‐layer  convolutional  neural  network   I like this movie very   much ! input  matrix filter region  size   filter  size        =      4 filter  width
  • 15. I like this movie very   much !  ⊗ Σ  ⊗ =  Hadamard  product  of  two  matrices =  element-­‐wise  product  of  two  matrices      Σ =  sum  of  elements  of  a  matrix value  that  corresponds   to  the  first  4-­‐gram  in   the  input  sentence input  matrix filter
  • 16. I like this movie very   much ! I like this movie very much !  ⊗ Σ  ⊗ Σ value  that   corresponds   to  the   second  4-­‐gram  in   the  input  sentence stride  is  one   filter  is  shifted   by  one  row   input  matrix filter
  • 17. I like this movie very   much ! input  matrix filter I like this movie very much ! I like this movie very much !  ⊗ Σ  ⊗ Σ  ⊗ Σ value  that  corresponds   to  the  third  4-­‐gram  in   the  input  sentence
  • 18. I like this movie very   much ! filter I like this movie very much ! I like this movie very much ! I like this movie very much !  ⊗  Σ  ⊗  Σ  ⊗  Σ  ⊗  Σ +  non-­‐linearity    =   feature  map   value  that  corresponds   to  the  fourth  4-­‐gram  in   the  input  sentence
  • 19. One-­‐channel   convolutional   neural   network  used  in  Kim  (2014).   Figure   taken  from  Zhang  et  al.  (2015).
  • 20. Properties  of  one-­‐layer  CNN • CNN  handles  input  sequences  of  varying  length • CNN does not depend on external language-­‐specific features such as dependency or constituent parse trees • CNN  is  sensitive  to  the  order  of  the  words  in  the  sentence • Filters  serve  as  feature  detectors   • Convolving the same filter with the n-­‐gram at every position in the sentence allows the features to be extracted independently of their position in the sentence
  • 21. CNN  for  MSC MSC  as  a  sentence  classification  task  with  a  fixed  sense  inventory
  • 22. Experimental  setup:  data Corpora • MPQAE (R&R)   • EPOSE    and  EPOSG from  EuroParl &  OpenSubtitles (EPOS)  heuristically  tagged  via  cross-­‐lingual  sense   projection in  case  of  rare  extractions  for  German:  additional  data  from  MVs  with  shared  senses   was  added • MASCE:  manually  annotated  subset  of  the  multi-­‐genre  corpus  MASC • TESTG:  manually  annotated  instances  from  EPOSG
  • 23. Hyperparameters (Zhang,  2015):   • non-­‐linearity:  ReLU • filter  region  sizes:  3,  4,  5   • number  of  filters  per  region  size:   100   • dropout  keep  probability:  0.5 • l2 regularization  coefficient:  10-­‐3 • number  of  iterations:  1001 • mini-­‐batch  size:  50   • optimizer:  Adam  optimization   algorithm  with  learning  rate  10-­‐4 Experimental  setup:  continued Input  representation:  tuned  and  static   versions  of  the  following  word  vectors • randomly  initialized • word2vec  (Mikolovet  al.)   • dependency-­‐based  (Levy  et  al.)
  • 24. Impact  of  word  vectors  (E) • train  dataset:  balanced 80%  MPQA  (R&R)  +  EPOSE • test  dataset:  (unbalanced)  20%  MPQA • accuracy  with  5-­‐fold  CV
  • 25. Comparison  of  CNN  and  baselines  (E) Classifiers   trained  on  the  balanced   dataset.  For  every  modal  verb  the  best  word  vectors  for  it  are  used. can  (3) could  (3) may  (2) must  (2) should  (2) micro BLrandom 33.33 33.33 50.00 50.00 50.00 41.49 MaxEnt 59.64 61.25 92.14 87.60 90.11 74.88 NN 56.01 55.42 90.00 75.42 88.68 69.74 CNN 65.78 67.50 93.57 93.82 90.77 79.29 • train  dataset:  balanced 80%  MPQA  (R&R)  +  EPOSE • test  dataset:  (unbalanced)  20%  MPQA • accuracy  with  5-­‐fold  CV
  • 26. Comparison  of  CNN  and  baselines  (E) Classifiers   trained  on  the  unbalanced dataset.  For  every   modal  verb  the  best  word  vectors  for  it  are  used. can  (3) could  (3) may  (2) must  (2) should  (2) micro BLmajority 69.92 65.00 93.57 94.32 90.81 80.18 MaxEnt 64.76 63.33 92.14 92.78 91.48 78.01 NN 67.29 66.08 94.23 86.37 90.96 77.93 CNN 70.87 66.55 93.49 94.97 90.59 80.74 • train  dataset:  unbalanced 80%  MPQA  (R&R)  +  EPOSE • test  dataset:  (unbalanced)  20%  MPQA • accuracy  with  5-­‐fold  CV
  • 27. Impact  of  word  vectors  (G)   • train  dataset:  balanced EPOSG • test  dataset:  TESTG • accuracy  on  the  test  dataset • 1772  words  from  10166  in  vocabulary  don’t  have  pre-­‐trained  word2vec • 2087  words  from  10166  in  vocabulary  don’t  have  pre-­‐trained  dep.-­‐based  vector
  • 28. Comparison  of  CNN  and  baselines  (G)   dürfen können müssen sollen micro BLrandom 50.00 33.33 50.00 50.00 39.10 NN 80.30 48.89 74.63 49.75 60.00 CNN 99.49 81.78 88.06 76.62 86.02 • train  dataset:  balanced EPOSG • test  dataset:  TESTG • accuracy  on  the  test  dataset
  • 29. Visualizing  what  filters  have  learned first  sentence second  sentence . . . n-­‐th sentence + filter max  value  of  the  first  sentence max  value of  the  second  sentence . . . max  value  of  the  n-­‐th sentence ⇒ ⇒ top  15  sentences  w.r.t.  the  max  value ⇒ n-­‐gram  from  each  sentence  corresponding  to  the  max  value
  • 30. Top  15  5-­‐grams  with  respect  to  one  filter  illustrated  in  the  embedded  space  
  • 31. Feature  detectors  for  must feature sense example past reading  of  the  emb.  verb   ep you  must  have  been  out  last  night non-­‐past  reading  of the  emb.  verb   de we  must  take  further  efforts stative reading  of  the  emb.  verb ep you  must  think  me  a  perfect tool eventive reading  of  the  emb.  verb de we  must  develop  a  policy passive  construction de actual  steps  must  be  taken negation de we  must  not  fear domain  specific  vocabulary de European  parliament, present  regulation,  fisheries   policy telic  clauses de to  address  these  problems,   to  prevent  both  forum,   to  exert  maximum  influence discourse  markers de but,  and  (then)
  • 32. Feature  detectors  for  müssen and  können feature sense example features  that  relate to  observations  on  English attitude  predicates ep believe,  not  know,  tell  me,  have  an  idea,  be  afraid adverbials ep possibly conditionals ep if counterfactual  and  negative   polarity  context ep bot  be  the  case,  how, ever placeholders for  propositions ep it abstract  concepts ep Idea,  music, grades,  application indefinite subjects ep one 3rd person  pronouns ep -­‐ verb-­‐object  combinations  for  action   that  can  be  granted de use  telephone achievements  (können only) dy present  report
  • 33. Other  observations Statistics 1) average  distance  of  top  ngramsfrom  the  modal  verb 2) average  distance  of  top  ngramswhich  are  on  the  left  from  the  modal  verb 3) as  2)  but  for  ngramson  the  right  and  ngramsstarting  with  the  modal Observations • there  are  no  greater  overall  distances  for  German  compared  to  English • for  German  considerably  more  ngramsthat  include  the  modal  verb,  especially   for  epistemic  readings  of  können,  müssen,  dürfen,  but  not  for  sollen • strikingly  larger  distances  to  the  left  of  the  modal  verb  for  epistemic  readings   compared  to  non-­‐epistemic
  • 34. Recap • novel  approach  for  multilingual  MSC  using  a  one-­‐layer  CNN • CNN  approach  outperforms  feature-­‐based  baselines • CNN  is  able  to  learn  meaningful  structure  from  data • CNN  learns  both  known  and  previously  unattested  linguistic  features  for  MSC  and   domain-­‐specific  concepts • CNN  learns  linguistic  and  semantic  features  from  flexible  window  regions  without   syntactic  pre-­‐processing • CNN  is  easily  adaptable  to  novel  languages • CNN  allows  for  insightful  model  inspection,  but  this  requires  manual  work
  • 35. Word  sense  disambiguation • if  features  CNN  picks  relate  to  semantic  factors → CNN  should  be  a  good  candidate  for  WSD • features  CNN  picks  relate  to  n-­‐grams  independent  of  their  position  in  the   sentence → CNN  is  flexible   → can  wider  context  be  useful  for  WSD?  
  • 36. Comparison   with  the  results  from  Rothe and  Schütze (2015)   SensEval-­‐3 surrounding  word 65.30 local  collocation 64.70 IMS  (state-­‐of-­‐the art) 72.30 Snaive -­‐ product 62.20 IMS  +  Snaive -­‐ product 69.40 S -­‐ cosine 60.50 IMS  +  S -­‐ cosine 72.40 S -­‐ product 64.30 IMS  +  S -­‐ product 73.60 S  -­‐ raw 63.10 IMS  +  S  -­‐ raw 66.80 CNN 67.90 IMS +  CNN 72.00
  • 37. AutoExtend • for  the  sentence  representation  all   constituent  words  are  available   • rich  knowledge  about  the  target  word CNN • for  sentence  representation  all   constituent  word  are  available   • without  any  knowledge  of  the  target   word • flexibility  (wider  context)  in  picking   relevant  n-­‐grams
  • 38. Future  work   Feature  work  on  WSD   • tune  hyperparameters • use  more  data • use  deeper  network Feature  work  in  general • extraction  of  opinion  entities:  opinion  expressions,  their  holders  and  targets • implicit  sentiment:  where  MSC  plays  role • planned  (positively):  should,  must  +  deontic • apprehended  (negative):  should  not  +  deontic • disliked  or  forbidden  (negative):  may  not  +deontic • desired  (positive):  should  +  deontic
  • 39. Thank  you  for  your  attention!
  • 40. References • N.  Kalchbrenner et  al.  “A  Convolutional  Neural  Network  for  Modelling  Sentences.”  ACL (2014). • Y.  Kim  “Convolutional  Neural  Networks  for  Sentence  Classification.”  EMNLP (2014). • O.  Levy  and  Y.  Goldberg.  “Dependency-­‐Based  Word  Embeddings.”  ACL (2014). • T.  Mikolov et  al.  "Distributed  representations  of  words  and  phrases  and  their   compositionality."  Advances  in  neural  information  processing  systems  26  (2013). • S.  Rothe and  H.  Schütze.  “AutoExtend:  Extending  Word  Embeddings  to  Embeddings  for   Synsets and  Lexemes.”  ACL (2015). • J.  Ruppenhofer and  I.  Rehbein.  Yes  we  can!?  Annotating  the  senses  of  English  modal  verbs.  In   Proceedings  of  the  8th  International  Conference  on  Language  Resources  and  Evaluation   (LREC) (pp.  24-­‐26).  (2012) • Y.  Zhang  and  W.  Byron.  “A  Sensitivity  Analysis  of  (and  Practitioners'  Guide  to)  Convolutional   Neural  Networks  for  Sentence  Classification.”  CoRR abs/1510.03820  (2015):  n.  pag. • M.  Zhou,  A.  Frank,  A.  Friedrich  and  A.  Palmer.  Semantically  Enriched  Models  for  Modal  Sense   Classification.  In  Workshop  on  Linking  Models  of  Lexical,  Sentential  and  Discourse-­‐level   Semantics  (LSDSem) (p.  44) (2015)  .
  • 41. Narrow  and  wide  convolution PAD PAD PAD I like this movie very   much ! PAD PAD PAD PAD PAD PAD I like this movie very   much ! PAD PAD PAD PAD PAD PAD I like this movie very   much ! PAD PAD PAD PAD PAD PAD I like this movie very   much ! PAD PAD PAD … s  =  sentence  length m  =  filter  region  size narrow  convolution  ⇒ feature  map  size  equals  to  s-­‐m+1 wide  convolution  ⇒ feature  map  size  equals  to  s+m-­‐1
  • 42. Will  a  one-­‐layer  CNN  be  sufficient? Soldiers  can  drink  alcohol  until  they  fall  over.  (dynamic-­‐capability) Soldiers can  drink  alcohol  at  late  hours  only.  (deontic-­‐permission) filter   that  was  trained   to   capture   occurrence   of   phrases  like  the  unigram   “soldiers” filter   that  was  trained   to   capture   occurrence   of   phrases  like  the  4-­‐gram   “at  late  hours  only” how  can  does   this  miracle   come  about? filter   that  was  trained   to   capture   occurrence   of   phrases  like  the  bigram   “drink   alcohol”
  • 43. Dynamic   Convolutional   Neural  Network   (DCNN) Kalchbrenner et  al.  (2014) • Wide  type  of  convolution • One-­‐dimensional  filter  to  each  row  of   the  input  matrix • Stacked  convolutional   layers • k-­‐max  pooling • k  is  a  function  of  the  length  of  the   sentence  and  the  depth  of  the  network
  • 44. Train-­‐test  configurations train test English 80%  MPQAE (R&R) +  EPOSE   +/-­‐ balancing a. 20%  MPQAE (R&R)  w/  5-­‐fold  CV b. MASCE German EPOSG TESTG
  • 45. Impact  of  word  vectors  (E) can  (3) could  (3) may  (2) must  (2) should  (2) w2v-­‐static 65.02 51.67 93.57 93.82 90.77 w2v-­‐tuned 63.73 54.17 93.57 93.82 90.77 deps-­‐static 65.78 56.67 93.57 93.82 90.77 deps-­‐tuned 59.89 67.50 93.57 93.29 90.42 rand-­‐static 63.99 46.67 93.57 92.79 90.77 rand-­‐tuned 64.50 48.33 93.57 92.79 90.77 • train  dataset:  balanced 80%  MPQA  (R&R)  +  EPOSE • test  dataset:  (unbalanced)  20%  MPQA • accuracy  with  5-­‐fold  CV
  • 46. MASC Balanced   vs.  unbalanced   training   when  evaluated   on  MPQAE and  MASCE for  CNN  and   MaxEnt. test dataset training  dataset:  balanced/unbalanced MaxEnt MPQAE BA  (74.88) UBA (78.01) MASCE BA  (3/19) UBA  (15/19) CNN MPQAE BA  (79.92) UBA  (80.74) MASCE BA  (13/19) UBA  (3/19) balanced  (BA)   training  data CNN  (19/19) MaxEnt (0/19) unbalanced  (UBA)   training  data CNN  (12/19) MaxEnt (7/19) Difference   between   CNN  and  MaxEnt trained   on  MPQAE  +EPOS  and  evaluated   on  MASC.
  • 52. Comparison  of  CNN  and  baselines  (G)   dürfen können müssen sollen micro BLrandom 50.00 33.33 50.00 50.00 39.10 NN 80.30 48.89 74.63 49.75 60.00 CNN 99.49 81.78 88.06 76.62 86.02 • train  dataset:  balanced EPOSG • test  dataset:  TESTG • accuracy  on  the  test  dataset
  • 54. Appendix:  number  of  instances
  • 55. Appendix:  number  of  instances  (G) ep de dy dürfen 1000 1000 0 können 1000 1000 1000 müssen 1000 1000 0 sollen 1000 1000 0 ep de dy 98 100 0 100 47 100 34 100 0 101 100 0 • train  dataset:  balanced EPOSG • test  dataset:  TESTG