SlideShare a Scribd company logo
Detection of Verbal Multi-Word Expressions via
Conditional Random Fields with Syntactic Dependency
Features and Semantic Re-Ranking
Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel
Dutta Chowdhury, Carl Vogel and Qun Liu
ADAPT Research Centre, Dublin, Ireland
lifeng.han@adaptcentre.ie
alsulaia@tcd.ie
koel.chowdhury@adaptcentre.ie
2017.July.19th. DLSS @ Bilbao, Spain
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
www.adaptcentre.ieLifeng Han (Aaron 亞倫倫)
2016.12-on, PhD student in ADAPT Centre @ DCU
2016.10-11, RA researcher in ADAPT Centre, Dublin
2016.03-2016.07, Guest researcher in Uni. of Amsterdam
2014.09-2016.02, researcher/employee in Uni. of Amsterdam
2014.08, Msc. in software engineering and Bs. in Maths
2011-2014, RA and student in NLP2CT lab/ Uni. of Macau
Codes: [github.com/poethan ] poethan = poet han :D:D
Talk-Slides: [github.com/poethan/slides ]
Network: [ linkedin.com/in/aaronhan/ ]
www.adaptcentre.ieAshjan Alsulaimani
PhD student: TCD, Dublin
Master: The University of Edinburgh
www.adaptcentre.ieKoel Dutta Chowdhury
February,2017 - current, PhD student in ADAPT Centre, Dublin City University,
Dublin
October,2016 - February,2017 Research Assistant, ADAPT Centre, Dublin City
University, Dublin
June 2016 - September,2016- Research Scholar, LTRC lab, IIIT-Hyderabad,
India
August,2015-June,2015-Research Scholar, Indian Statistical Institute, Kolkata,
India
www.adaptcentre.ieContent
About ADAPT centre, Ireland
- general
- groups / topics
- activities
- corporations
My journey in NLP
- Machine Translation (MT) Evaluation Models
- Quality Estimation (QE) Models
- Multiword-Expression (MWE shared task and following ups)
- DL4MT phd
- other works (CWS, NER, Treebanks)
www.adaptcentre.ieADAPT - general
http://adaptcentre.ie/
It is a joint research centre with 4 uni. DCU/TCD/UCD/DIT
Located in Dublin, ADAPT- DCU/TCD lab
Former name: CNGL, continued research, some people left, while
coming more.
Funding applied by PI-s from different Uni.
Fundings from Irish Science Foundation and EU.
www.adaptcentre.iewww.adaptcentre.ie/about
www.adaptcentre.ieADAPT - groups/topics
broad research topics:
“ADAPT research is spearheading the development of next-generation digital
technologies that enable seamless tech-mediated interaction and
communication. The breadth of ADAPT's research expertise is unique globally
and the Centre's structure supports collaborative innovation with industry to
unlock the potential of digital content.  ADAPT has attracted over €50million
research funding from Science Foundation Ireland and industry
collaborations”.

Social Media / NLP / Knowledge Management / NN / Digital Content / ML /
Multimedia Content Summary / Sentiment Analysis / Ethics and Privacy / AI /
Image and Video / Personalisation / Search and IR / DL / MT / Multimodal
Interaction / Semantic Web and Linked Data / Virtual and Augmented Reality
www.adaptcentre.iewww.adaptcentre.ie/research
Research Themes:
Understanding Global Content
Transforming Global Content
Personalising the User Experience
Interacting with Global Content
Managing the Global Conversation
www.adaptcentre.ieADAPT - activities
ADAPT has many meetings/gatherings:
Monthly 101 seminar: different topics each time
Science meeting: every two months
ADAPT Industrial showcase
Social meetup professionals: Dublin ML (host) /NLP meetup
Social meetup fun: pingpang (German/Spanish winner), etc.
Also join:
- Faculty industrial showcase
- University research open days
www.adaptcentre.ieADAPT - corporation
ADAPT has some cooperation (research/industrial projects / intern,
etc.):
Huaway / iFlytec / DID / Microsoft / Iconic / FBD / Intel
Linkedin,
IBM,
Accenture,
eBay,
etc.
http://www.adaptcentre.ie/industry
always welcome cooperation
www.adaptcentre.ie
www.adaptcentre.ieMy journey in NLP
MT / MTE / QE / DL4MT / NER / CWS / Treebanks
- full slides: [github.com/poethan/slides ]
www.adaptcentre.ieMWE
MWE (Multi-Word Expression) Detection task:
Task intro: Verbal MWE (VMWE)
Proposed models
Performances
Thanks for Dr. Alfredo Maldonado for the slides of MWE section
www.adaptcentre.ie
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with
Syntactic Dependency Features and Semantic Re-Ranking
Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl
Vogel and Qun Liu
The 13th Workshop on Multiword Expressions (MWE 2017) @ EACL2017. 
Download: Link-paper   / Video / slides / [bibTex]
Abstract:   A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is
presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields
(CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017,
ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based
evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic
vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition
would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme.
www.adaptcentre.ieIntro
www.adaptcentre.ie
www.adaptcentre.ieVMWE - Shared Task
www.adaptcentre.ieTask
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ieApproaches
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ie
Ash.:
www.adaptcentre.ie
www.adaptcentre.ie
www.adaptcentre.ieExperiments
www.adaptcentre.ie
www.adaptcentre.ieShort summary
www.adaptcentre.iea bit fun
www.adaptcentre.ieMWE note: Semantic Reranking - Erwan Moreau
It's important to distinguish the two main components, A and B below:
A) the unsupervised semantic similarity part, which uses Europarl to calculate "semantic features" for a sentence
with expressions tagged. The goal is that these features help predict whether the tagged expressions are correct
or not (note that a sentence may contain 0, 1 or several expressions). More precisely, the idea is to compute
features which represent whether a candidate expression is a real MWE, by comparing frequency and semantic
similarity between its individual words and the full expression. It works like this:
1) extracts all the sentences with expressions labeled from the CRF output.
2) For every expression, we build pseudo-expressions for each individual word in the expression as well as for
each case of "the expression minus one word". Then for every pseudo-expression and for the full expression we
compute the context vector based on Europarl, i.e. the count of every word which co-occurs with the target
expression (or word) within a fixed-size window. In the features we use the frequencies for each of these pseudo-
expressions, as well as the semantic similarity score between each pseudo-expression and the full expression.
Originally the goal was to measure compositionality (whether the meanings of the words are combined together in
the expression); but these features probably also capture how often the words appear together, which is an
indication of a real expression. There is an additional set of features which consist of comparing the current
expression to the other 9 candidate expressions.
3) Since we need a fixed number of features for every instance = sentence (for the supervised learning part), we
must "summarize": if an expression has N words, the N values are "summarized" with the min, mean and max.
Same thing for the M expressions in the sentence. In training mode we also add the probability found by the CRF
as a feature.
Thanks to Dr. Erwan Moreau for detail: erwan.moreau@adaptcentre.ie
www.adaptcentre.ie
B) the supervised regression part (we used Weka decision trees regression, but other models would certainly
work as well), which is fed with the features calculated using the above and predicts a single score in [0,1]
which represents "how correct" the labelling of the expressions is for a sentence: here an instance is a sentence
with its expressions labeled, and since for every sentence the CRF part gives us the top 10 labelling we use
each of these 10 as one instance. In training mode, we assign score 1 to the gold labelling (if found among the
CRF candidates) and 0 to other (wrong) labeling (the goal being to make the system assign low scores to wrong
answers and high scores to good answers). In testing mode, we obtain the predicted scores and for every
sentence we take the labelings which obtained the highest in the group of 10 candidates; most of the time the
first from CRF is also the highest score, but sometimes the labelling we select was ranked after the first -> that's
when the proper re-ranking happens.
www.adaptcentre.ieDraw steps
www.adaptcentre.ieReferences
Maldonado, A., Han, L., Moreau, E., Alsulaimani, A., Chowdhury, K. D., Vogel, C., & Liu, Q. (2017).
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic
Dependency Features and Semantic Re-Ranking. In Proceedings of The 13th Workshop on Multiword
Expressions. Valencia.
Hall, M. et al. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations, 11(1):
10–18.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the
10th Machine Translation Summit, pages 79–86, Phuket.
Laferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random elds: Probabilistic models for
segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference
on Machine Learning. pp. 282– 289.
Maldonado, A., & Emms, M. (2011). Measuring the compositionality of collocations via word co-
occurrence vectors: Shared task system description. In Proceedings of the Distributional Semantics
and Compositionality workshop (DISCo 2011). Portland, OR.
Quinlan, J.R. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint
Conference on Arti cial Intelligence, pages 343–348.
Sag, I. A. et al. (2002). Multiword Expressions: A Pain in the Neck for NLP. Third International
Conference on Computational Linguistics and Intelligent Text Processing (Lecture Notes in Computer
Science), 2276, 1–15.
Svary, A. et al. (2017). The PARSEME Shared Task on Automatic Identi cation of Verbal Multiword
Expressions. In Proceedings of The 13th Workshop on Multiword Expressions. Valencia.
Singleton, D. (2000). Language and the Lexicon: An Introduction. London: Arnold.
www.adaptcentre.ie
Q & A
LIFENG.HAN@adaptcentre.ie
ADAPT Center, DCU
github.com/poethan

More Related Content

PPT
Improvement in Quality of Speech associated with Braille codes - A Review
PDF
Ijartes v1-i1-002
PPT
An Intuitive Natural Language Understanding System
PDF
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
PDF
Corpus-Based Vocabulary Learning in Technical English
PDF
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
PPTX
Language Grid
PDF
Introduction to Ontology Engineering with Fluent Editor 2014
Improvement in Quality of Speech associated with Braille codes - A Review
Ijartes v1-i1-002
An Intuitive Natural Language Understanding System
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
Corpus-Based Vocabulary Learning in Technical English
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Language Grid
Introduction to Ontology Engineering with Fluent Editor 2014

What's hot (18)

PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
PDF
A New Concept Extraction Method for Ontology Construction From Arabic Text
PDF
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
PDF
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
PDF
Nguyen
PDF
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
PDF
Word Segmentation and Lexical Normalization for Unsegmented Languages
PDF
FUZZY LOGIC IN NARROW SENSE WITH HEDGES
PPTX
Fmri of bilingual brain atl reveals language independent representations
PDF
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
PDF
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
PDF
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
PDF
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
PDF
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
PDF
Design of A Spell Corrector For Hausa Language
PDF
The effect of training set size in authorship attribution: application on sho...
PDF
PDF
A Proposition Bank of Urdu
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
A New Concept Extraction Method for Ontology Construction From Arabic Text
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Nguyen
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
Word Segmentation and Lexical Normalization for Unsegmented Languages
FUZZY LOGIC IN NARROW SENSE WITH HEDGES
Fmri of bilingual brain atl reveals language independent representations
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Design of A Spell Corrector For Hausa Language
The effect of training set size in authorship attribution: application on sho...
A Proposition Bank of Urdu
Ad

Similar to Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @DLSS (20)

PDF
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
PDF
A deep analysis of Multi-word Expression and Machine Translation
PDF
French machine reading for question answering
PDF
Reflective Plan Examples
PDF
G04124041046
DOCX
A neural probabilistic language model
PPT
The impact of standardized terminologies and domain-ontologies in multilingua...
PPTX
Icwl2015 wahl
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
A hybrid composite features based sentence level sentiment analyzer
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
PDF
CHUNKER BASED SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR NEPALI TEXT
PDF
AMBIGUITY-AWARE DOCUMENT SIMILARITY
PDF
IJNLC 2013 - Ambiguity-Aware Document Similarity
PDF
II BCA JAVA PROGRAMMING NOTES FOR FIVE UNITS.pdf
DOC
Doc format.
PPTX
VOC real world enterprise needs
PDF
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
A deep analysis of Multi-word Expression and Machine Translation
French machine reading for question answering
Reflective Plan Examples
G04124041046
A neural probabilistic language model
The impact of standardized terminologies and domain-ontologies in multilingua...
Icwl2015 wahl
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
A hybrid composite features based sentence level sentiment analyzer
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
CHUNKER BASED SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR NEPALI TEXT
AMBIGUITY-AWARE DOCUMENT SIMILARITY
IJNLC 2013 - Ambiguity-Aware Document Similarity
II BCA JAVA PROGRAMMING NOTES FOR FIVE UNITS.pdf
Doc format.
VOC real world enterprise needs
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
Ad

More from Lifeng (Aaron) Han (20)

PDF
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
PDF
Measuring Uncertainty in Translation Quality Evaluation (TQE)
PDF
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
PDF
Meta-evaluation of machine translation evaluation methods
PDF
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
PDF
Apply chinese radicals into neural machine translation: deeper than character...
PDF
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
PDF
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
PDF
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
PDF
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
PDF
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
PDF
machine translation evaluation resources and methods: a survey
PDF
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
PDF
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
PPTX
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
PDF
PubhD talk: MT serving the society
PDF
Lepor: augmented automatic MT evaluation metric
PDF
Thesis-Master-MTE-Aaron
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Meta-evaluation of machine translation evaluation methods
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Apply chinese radicals into neural machine translation: deeper than character...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
machine translation evaluation resources and methods: a survey
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
PubhD talk: MT serving the society
Lepor: augmented automatic MT evaluation metric
Thesis-Master-MTE-Aaron

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Machine Learning_overview_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MIND Revenue Release Quarter 2 2025 Press Release
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Machine Learning_overview_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @DLSS

  • 1. Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel and Qun Liu ADAPT Research Centre, Dublin, Ireland lifeng.han@adaptcentre.ie alsulaia@tcd.ie koel.chowdhury@adaptcentre.ie 2017.July.19th. DLSS @ Bilbao, Spain The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
  • 2. www.adaptcentre.ieLifeng Han (Aaron 亞倫倫) 2016.12-on, PhD student in ADAPT Centre @ DCU 2016.10-11, RA researcher in ADAPT Centre, Dublin 2016.03-2016.07, Guest researcher in Uni. of Amsterdam 2014.09-2016.02, researcher/employee in Uni. of Amsterdam 2014.08, Msc. in software engineering and Bs. in Maths 2011-2014, RA and student in NLP2CT lab/ Uni. of Macau Codes: [github.com/poethan ] poethan = poet han :D:D Talk-Slides: [github.com/poethan/slides ] Network: [ linkedin.com/in/aaronhan/ ]
  • 3. www.adaptcentre.ieAshjan Alsulaimani PhD student: TCD, Dublin Master: The University of Edinburgh
  • 4. www.adaptcentre.ieKoel Dutta Chowdhury February,2017 - current, PhD student in ADAPT Centre, Dublin City University, Dublin October,2016 - February,2017 Research Assistant, ADAPT Centre, Dublin City University, Dublin June 2016 - September,2016- Research Scholar, LTRC lab, IIIT-Hyderabad, India August,2015-June,2015-Research Scholar, Indian Statistical Institute, Kolkata, India
  • 5. www.adaptcentre.ieContent About ADAPT centre, Ireland - general - groups / topics - activities - corporations My journey in NLP - Machine Translation (MT) Evaluation Models - Quality Estimation (QE) Models - Multiword-Expression (MWE shared task and following ups) - DL4MT phd - other works (CWS, NER, Treebanks)
  • 6. www.adaptcentre.ieADAPT - general http://adaptcentre.ie/ It is a joint research centre with 4 uni. DCU/TCD/UCD/DIT Located in Dublin, ADAPT- DCU/TCD lab Former name: CNGL, continued research, some people left, while coming more. Funding applied by PI-s from different Uni. Fundings from Irish Science Foundation and EU.
  • 8. www.adaptcentre.ieADAPT - groups/topics broad research topics: “ADAPT research is spearheading the development of next-generation digital technologies that enable seamless tech-mediated interaction and communication. The breadth of ADAPT's research expertise is unique globally and the Centre's structure supports collaborative innovation with industry to unlock the potential of digital content.  ADAPT has attracted over €50million research funding from Science Foundation Ireland and industry collaborations”. Social Media / NLP / Knowledge Management / NN / Digital Content / ML / Multimedia Content Summary / Sentiment Analysis / Ethics and Privacy / AI / Image and Video / Personalisation / Search and IR / DL / MT / Multimodal Interaction / Semantic Web and Linked Data / Virtual and Augmented Reality
  • 9. www.adaptcentre.iewww.adaptcentre.ie/research Research Themes: Understanding Global Content Transforming Global Content Personalising the User Experience Interacting with Global Content Managing the Global Conversation
  • 10. www.adaptcentre.ieADAPT - activities ADAPT has many meetings/gatherings: Monthly 101 seminar: different topics each time Science meeting: every two months ADAPT Industrial showcase Social meetup professionals: Dublin ML (host) /NLP meetup Social meetup fun: pingpang (German/Spanish winner), etc. Also join: - Faculty industrial showcase - University research open days
  • 11. www.adaptcentre.ieADAPT - corporation ADAPT has some cooperation (research/industrial projects / intern, etc.): Huaway / iFlytec / DID / Microsoft / Iconic / FBD / Intel Linkedin, IBM, Accenture, eBay, etc. http://www.adaptcentre.ie/industry always welcome cooperation
  • 13. www.adaptcentre.ieMy journey in NLP MT / MTE / QE / DL4MT / NER / CWS / Treebanks - full slides: [github.com/poethan/slides ]
  • 14. www.adaptcentre.ieMWE MWE (Multi-Word Expression) Detection task: Task intro: Verbal MWE (VMWE) Proposed models Performances Thanks for Dr. Alfredo Maldonado for the slides of MWE section
  • 15. www.adaptcentre.ie Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel and Qun Liu The 13th Workshop on Multiword Expressions (MWE 2017) @ EACL2017.  Download: Link-paper   / Video / slides / [bibTex] Abstract:   A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme.
  • 38. www.adaptcentre.ieMWE note: Semantic Reranking - Erwan Moreau It's important to distinguish the two main components, A and B below: A) the unsupervised semantic similarity part, which uses Europarl to calculate "semantic features" for a sentence with expressions tagged. The goal is that these features help predict whether the tagged expressions are correct or not (note that a sentence may contain 0, 1 or several expressions). More precisely, the idea is to compute features which represent whether a candidate expression is a real MWE, by comparing frequency and semantic similarity between its individual words and the full expression. It works like this: 1) extracts all the sentences with expressions labeled from the CRF output. 2) For every expression, we build pseudo-expressions for each individual word in the expression as well as for each case of "the expression minus one word". Then for every pseudo-expression and for the full expression we compute the context vector based on Europarl, i.e. the count of every word which co-occurs with the target expression (or word) within a fixed-size window. In the features we use the frequencies for each of these pseudo- expressions, as well as the semantic similarity score between each pseudo-expression and the full expression. Originally the goal was to measure compositionality (whether the meanings of the words are combined together in the expression); but these features probably also capture how often the words appear together, which is an indication of a real expression. There is an additional set of features which consist of comparing the current expression to the other 9 candidate expressions. 3) Since we need a fixed number of features for every instance = sentence (for the supervised learning part), we must "summarize": if an expression has N words, the N values are "summarized" with the min, mean and max. Same thing for the M expressions in the sentence. In training mode we also add the probability found by the CRF as a feature. Thanks to Dr. Erwan Moreau for detail: erwan.moreau@adaptcentre.ie
  • 39. www.adaptcentre.ie B) the supervised regression part (we used Weka decision trees regression, but other models would certainly work as well), which is fed with the features calculated using the above and predicts a single score in [0,1] which represents "how correct" the labelling of the expressions is for a sentence: here an instance is a sentence with its expressions labeled, and since for every sentence the CRF part gives us the top 10 labelling we use each of these 10 as one instance. In training mode, we assign score 1 to the gold labelling (if found among the CRF candidates) and 0 to other (wrong) labeling (the goal being to make the system assign low scores to wrong answers and high scores to good answers). In testing mode, we obtain the predicted scores and for every sentence we take the labelings which obtained the highest in the group of 10 candidates; most of the time the first from CRF is also the highest score, but sometimes the labelling we select was ranked after the first -> that's when the proper re-ranking happens.
  • 41. www.adaptcentre.ieReferences Maldonado, A., Han, L., Moreau, E., Alsulaimani, A., Chowdhury, K. D., Vogel, C., & Liu, Q. (2017). Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking. In Proceedings of The 13th Workshop on Multiword Expressions. Valencia. Hall, M. et al. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations, 11(1): 10–18. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, pages 79–86, Phuket. Laferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning. pp. 282– 289. Maldonado, A., & Emms, M. (2011). Measuring the compositionality of collocations via word co- occurrence vectors: Shared task system description. In Proceedings of the Distributional Semantics and Compositionality workshop (DISCo 2011). Portland, OR. Quinlan, J.R. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Arti cial Intelligence, pages 343–348. Sag, I. A. et al. (2002). Multiword Expressions: A Pain in the Neck for NLP. Third International Conference on Computational Linguistics and Intelligent Text Processing (Lecture Notes in Computer Science), 2276, 1–15. Svary, A. et al. (2017). The PARSEME Shared Task on Automatic Identi cation of Verbal Multiword Expressions. In Proceedings of The 13th Workshop on Multiword Expressions. Valencia. Singleton, D. (2000). Language and the Lexicon: An Introduction. London: Arnold.