Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @DLSS

Detection of Verbal Multi-Word Expressions via
Conditional Random Fields with Syntactic Dependency
Features and Semantic Re-Ranking
Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel
Dutta Chowdhury, Carl Vogel and Qun Liu
ADAPT Research Centre, Dublin, Ireland
lifeng.han@adaptcentre.ie
alsulaia@tcd.ie
koel.chowdhury@adaptcentre.ie
2017.July.19th. DLSS @ Bilbao, Spain
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

www.adaptcentre.ieLifeng Han (Aaron 亞倫倫)
2016.12-on, PhD student in ADAPT Centre @ DCU
2016.10-11, RA researcher in ADAPT Centre, Dublin
2016.03-2016.07, Guest researcher in Uni. of Amsterdam
2014.09-2016.02, researcher/employee in Uni. of Amsterdam
2014.08, Msc. in software engineering and Bs. in Maths
2011-2014, RA and student in NLP2CT lab/ Uni. of Macau
Codes: [github.com/poethan ] poethan = poet han :D:D
Talk-Slides: [github.com/poethan/slides ]
Network: [ linkedin.com/in/aaronhan/ ]

www.adaptcentre.ieAshjan Alsulaimani
PhD student: TCD, Dublin
Master: The University of Edinburgh

www.adaptcentre.ieKoel Dutta Chowdhury
February,2017 - current, PhD student in ADAPT Centre, Dublin City University,
Dublin
October,2016 - February,2017 Research Assistant, ADAPT Centre, Dublin City
University, Dublin
June 2016 - September,2016- Research Scholar, LTRC lab, IIIT-Hyderabad,
India
August,2015-June,2015-Research Scholar, Indian Statistical Institute, Kolkata,
India

www.adaptcentre.ieContent
About ADAPT centre, Ireland
- general
- groups / topics
- activities
- corporations
My journey in NLP
- Machine Translation (MT) Evaluation Models
- Quality Estimation (QE) Models
- Multiword-Expression (MWE shared task and following ups)
- DL4MT phd
- other works (CWS, NER, Treebanks)

www.adaptcentre.ieADAPT - general
http://adaptcentre.ie/
It is a joint research centre with 4 uni. DCU/TCD/UCD/DIT
Located in Dublin, ADAPT- DCU/TCD lab
Former name: CNGL, continued research, some people left, while
coming more.
Funding applied by PI-s from different Uni.
Fundings from Irish Science Foundation and EU.

www.adaptcentre.iewww.adaptcentre.ie/about

www.adaptcentre.ieADAPT - groups/topics
broad research topics:
“ADAPT research is spearheading the development of next-generation digital
technologies that enable seamless tech-mediated interaction and
communication. The breadth of ADAPT's research expertise is unique globally
and the Centre's structure supports collaborative innovation with industry to
unlock the potential of digital content. ADAPT has attracted over €50million
research funding from Science Foundation Ireland and industry
collaborations”.

Social Media / NLP / Knowledge Management / NN / Digital Content / ML /
Multimedia Content Summary / Sentiment Analysis / Ethics and Privacy / AI /
Image and Video / Personalisation / Search and IR / DL / MT / Multimodal
Interaction / Semantic Web and Linked Data / Virtual and Augmented Reality

www.adaptcentre.iewww.adaptcentre.ie/research
Research Themes:
Understanding Global Content
Transforming Global Content
Personalising the User Experience
Interacting with Global Content
Managing the Global Conversation

www.adaptcentre.ieADAPT - activities
ADAPT has many meetings/gatherings:
Monthly 101 seminar: different topics each time
Science meeting: every two months
ADAPT Industrial showcase
Social meetup professionals: Dublin ML (host) /NLP meetup
Social meetup fun: pingpang (German/Spanish winner), etc.
Also join:
- Faculty industrial showcase
- University research open days

www.adaptcentre.ieADAPT - corporation
ADAPT has some cooperation (research/industrial projects / intern,
etc.):
Huaway / iFlytec / DID / Microsoft / Iconic / FBD / Intel
Linkedin,
IBM,
Accenture,
eBay,
etc.
http://www.adaptcentre.ie/industry
always welcome cooperation

www.adaptcentre.ieMy journey in NLP
MT / MTE / QE / DL4MT / NER / CWS / Treebanks
- full slides: [github.com/poethan/slides ]

www.adaptcentre.ieMWE
MWE (Multi-Word Expression) Detection task:
Task intro: Verbal MWE (VMWE)
Proposed models
Performances
Thanks for Dr. Alfredo Maldonado for the slides of MWE section

www.adaptcentre.ie
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with
Syntactic Dependency Features and Semantic Re-Ranking
Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl
Vogel and Qun Liu
The 13th Workshop on Multiword Expressions (MWE 2017) @ EACL2017.
Download: Link-paper / Video / slides / [bibTex]
Abstract: A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is
presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields
(CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017,
ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based
evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic
vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition
would struggle to beat a simple lookup baseline system and argue for a more purpose-speciﬁc evaluation scheme.

www.adaptcentre.ieVMWE - Shared Task

www.adaptcentre.ieShort summary

www.adaptcentre.ieMWE note: Semantic Reranking - Erwan Moreau
It's important to distinguish the two main components, A and B below:
A) the unsupervised semantic similarity part, which uses Europarl to calculate "semantic features" for a sentence
with expressions tagged. The goal is that these features help predict whether the tagged expressions are correct
or not (note that a sentence may contain 0, 1 or several expressions). More precisely, the idea is to compute
features which represent whether a candidate expression is a real MWE, by comparing frequency and semantic
similarity between its individual words and the full expression. It works like this:
1) extracts all the sentences with expressions labeled from the CRF output.
2) For every expression, we build pseudo-expressions for each individual word in the expression as well as for
each case of "the expression minus one word". Then for every pseudo-expression and for the full expression we
compute the context vector based on Europarl, i.e. the count of every word which co-occurs with the target
expression (or word) within a fixed-size window. In the features we use the frequencies for each of these pseudo-
expressions, as well as the semantic similarity score between each pseudo-expression and the full expression.
Originally the goal was to measure compositionality (whether the meanings of the words are combined together in
the expression); but these features probably also capture how often the words appear together, which is an
indication of a real expression. There is an additional set of features which consist of comparing the current
expression to the other 9 candidate expressions.
3) Since we need a fixed number of features for every instance = sentence (for the supervised learning part), we
must "summarize": if an expression has N words, the N values are "summarized" with the min, mean and max.
Same thing for the M expressions in the sentence. In training mode we also add the probability found by the CRF
as a feature.
Thanks to Dr. Erwan Moreau for detail: erwan.moreau@adaptcentre.ie

www.adaptcentre.ie
B) the supervised regression part (we used Weka decision trees regression, but other models would certainly
work as well), which is fed with the features calculated using the above and predicts a single score in [0,1]
which represents "how correct" the labelling of the expressions is for a sentence: here an instance is a sentence
with its expressions labeled, and since for every sentence the CRF part gives us the top 10 labelling we use
each of these 10 as one instance. In training mode, we assign score 1 to the gold labelling (if found among the
CRF candidates) and 0 to other (wrong) labeling (the goal being to make the system assign low scores to wrong
answers and high scores to good answers). In testing mode, we obtain the predicted scores and for every
sentence we take the labelings which obtained the highest in the group of 10 candidates; most of the time the
first from CRF is also the highest score, but sometimes the labelling we select was ranked after the first -> that's
when the proper re-ranking happens.

www.adaptcentre.ieReferences
Maldonado, A., Han, L., Moreau, E., Alsulaimani, A., Chowdhury, K. D., Vogel, C., & Liu, Q. (2017).
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic
Dependency Features and Semantic Re-Ranking. In Proceedings of The 13th Workshop on Multiword
Expressions. Valencia.
Hall, M. et al. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations, 11(1):
10–18.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the
10th Machine Translation Summit, pages 79–86, Phuket.
Laferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random elds: Probabilistic models for
segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference
on Machine Learning. pp. 282– 289.
Maldonado, A., & Emms, M. (2011). Measuring the compositionality of collocations via word co-
occurrence vectors: Shared task system description. In Proceedings of the Distributional Semantics
and Compositionality workshop (DISCo 2011). Portland, OR.
Quinlan, J.R. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint
Conference on Arti cial Intelligence, pages 343–348.
Sag, I. A. et al. (2002). Multiword Expressions: A Pain in the Neck for NLP. Third International
Conference on Computational Linguistics and Intelligent Text Processing (Lecture Notes in Computer
Science), 2276, 1–15.
Svary, A. et al. (2017). The PARSEME Shared Task on Automatic Identi cation of Verbal Multiword
Expressions. In Proceedings of The 13th Workshop on Multiword Expressions. Valencia.
Singleton, D. (2000). Language and the Lexicon: An Introduction. London: Arnold.

www.adaptcentre.ie
Q & A
LIFENG.HAN@adaptcentre.ie
ADAPT Center, DCU
github.com/poethan

Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @DLSS

More Related Content

What's hot (18)

Similar to Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @DLSS (20)

More from Lifeng (Aaron) Han (20)

Recently uploaded (20)

Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @DLSS