MultiMWE: Building a Multi-lingual
Multi-Word Expression (MWE)
Parallel Corpora
Lifeng Han, Gareth Jones and Alan Smeaton
ADAPT research seminar, 2020 March

Accepted long paper in 12th Edition of Language Resources and
Evaluation Conference (LREC2020)
Outline
• Multiword Expressions (MWEs) in ADAPT 

• - started earlier, EACL-MWE2017 shared task by TCD+DCU

• Background of MWEs

• - intro MWE, tasks in NLP, corpus, with Machine Translation (MT)

• MultiMWE corpus

• - how we made it (workflow) 

• MultiMWE corpus applications

• - in MT, but it can go further in Multilingual/Crosslingual NLP task
vMWEs in ADAPT
• “Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic
Dependency Features and Semantic Re-Ranking". 2017. Alfredo Maldonado, Lifeng Han, Erwan
Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel, Qun Liu. The 13th Workshop on Multiword
Expressions @ EACL 2017

• - exploits universal syntactic dependency features through a Conditional Random
Fields (CRF) sequence model

• - Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most
languages on full VMWE-based evaluation and 1st in three languages on token-based
evaluation (Polish, Swedish, French)

• - an option to re-rank the 10 best CRF-predicted sequences via semantic vectors,
boosting its scores above other systems in the competition.

• Chapter: “Semantic reranking of CRF label sequences for verbal multiword expression
identification”. 2018. Erwan Moreau, Ashjan Alsulaimani, Alfredo Maldonado, Lifeng Han, Carl Vogel, Koel Dutta
Chowdhury. In Stella Markantonatou, Carlos Ramisch, Agata Savary & Veronika Vincze (eds.), Multiword
expressions at length and in depth: Extended papers from the MWE 2017 workshop, 177–207. Berlin: Language
Science Press. DOI:10.5281/zenodo.1469559
Definition
• a MWE shall be a term including several words to express
a specific concept, which is able to be decomposed, and
the words combined together as an MWE are syntactically,
semantically, pragmatically, or statistically idiosyncratic in
nature.{Sag2002MWE,Tim2010mwe,huning2015MWEs}

• MWE (Sag2002): 

• lexical phrases: fixed or semi-fixed expressions and
syntactically-flexible expressions

• institutionalized phrases
Examples
• Categories of MWE: 

• idioms (…kick the bucket)

• Metaphor (…apple of my eye) 

• compound nouns (tooth + paste -> toothpaste)

• word combinations from various part-of-speech set: verb-particle, adjective-verb (dry cleaning),
preposition-verb (output), adjective-noun (full moon), preposition-noun (underground), etc.

• …

• Appearance of MWE: 

• Continuous (kick off, football magazine) 

• Dis-continuous (pick someone up)
ref: https://www.learnenglish.de/grammar/nouncompound.html
MWE events
• MWE detection, acquisition, construction, treatment, disambiguation

• MWE interaction with other NLP tasks e.g. MT:

• EUROPHRAS 4th Workshop on Multi-word Units in Machine Translation
and Translation Technology (MUMTTT 2019) — 27 September 2019 —
Málaga,

• ACL-SIGLEX orgnized MWE related events (since 2003, yearly)

• 2003 ACL workshop Multiword Expressions: Analysis, Acquisition and
Treatment – Sapporo, Japan 

• 2020 COLING Joint Workshop on Multiword Expressions and Electronic
Lexicons (MWE-LEX 2020) – September 14, 2020 – Barcelona, Spain.
Joint topics on MWEs and e-lexicons:
Extracting and enriching MWE lists from traditional human-readable
lexicons for NLP use
Formats for NLP-applicable MWE lexicons
Interlinking MWE lexicons with other language resources
Using MWE lexicons in NLP tasks (identification, parsing, translation,
MWE discovery in the service of lexicography
Multiword terms in specialised lexicons
Representing semantic properties of MWEs in lexicons
Paving the way towards encoding lexical idiosyncrasies in constructions
MWE related corpora
• MWE aware English Dependency corpus 

• from LDC(https://catalog.ldc.upenn.edu/LDC2017T01), 

• - annotated English compound words in the corpus as one kind of MWE

• - to facilitate the constituency and dependency parsing task

• English MWEs in web reviews data (schneider-etal-2014) 

• http://www.cs.cmu.edu/~ark/LexSem/

• hand-annotated online review data with comprehensive MWEs including
English noun, verb, and preposition super-senses (tags include
communication, group, stative, location, possession, etc.).
• these MWE corpus construction works are monolingual tasks and focus on English only
MWE related corpora
• multilingual MWE corpus construction from the PARSEME

• - The IC1207 COST Action, is an interdisciplinary scientific
network devoted to the role of multi-word expressions
(MWEs) in parsing

• - includes 18 European languages

• - corpus focuses on one kind of MWE (verbal MWE)

• - not parallel, and the size of the data varies very much
from language to language (some languages have only
hundreds of sentences).
MWE interaction with MT
• Báihuà (⽩白話): contemporary Chinese

• 去 (qù) ... 了了(le): is a simple dis-continuous Chinese MWE used to
express a past tense action (went to do something, went to somewhere)

• The purpose ‘上課 (shangke)’ is not translated.
ZH source:
ZH pinyin: Xiǎo míng qù xué xiào shàng kè le
EN reference: Xiao Ming went to school to attend classes
EN MT output: Xiao Ming went to school
note: Three zh->en examples from googleMT 2018/10
MWE interaction with MT
• Poem: Shīgē(詩歌). 花 as ambiguous word. 

• Well aligned MWEs shall help disambiguation. 花 vs ⼈人 (flower - people)
ZH source:
ZH pinyin:
Nián nián suì suì huā xiāng sì, suì suì nián
nián rén bù tóng.
EN reference:
The flowers are similar each year, while
people are changing year after year.
EN MT output:
One year spent similar, each year is
different
MWE interaction with MT
• Wényán (⽂文⾔言): ancient Chinese.

• 安知 as MWE means ‘…do not know…’. Or 安知…哉 means ‘how can … know …?’
ZH source:
ZH pinyin: Yàn què ān zhī hóng hú zhī zhì zāi?
EN reference

(literal):
How can a finch know the ambition of a big
bird (or swan)?
EN MT output: What is the meaning of Yanque Anzhihong?
EN reference:
The nonsense forks do not know the
ambitions of the very motivated people.
MWE interaction with MT
• Some earlier work about MWE+MT:

• {lambert2005mwe} bilingual MWE pairs to modify the word alignment procedure of MT on an
English-Spanish corpus. grouping the MWEs as one token before training

• {Ren2009mwe} integrated bilingual Chinese-English MWEs into the SMT toolkit Moses

• {Bouamor2012IdentifyingBM} which designed models to extract continuous MWEs for French-
English translation

• {Skadina2016MultiwordEI} which discussed various MWEs in English-Latvian MT

• {EBRAHIM2017111mwe} focused on phrasal verb MWEs in Arabic-English phrase-based SMT. 

• MWE+NMT:

• {Li2016neuralname} a character level sequence to sequence modeling to translate named entities
and then integrated this into an overall NMT system on a Chinese-to-English

• {ugawa2018neural} focuses on the difficulty of translation of compound words in the source
language, by introducing an encoder for the input word at the NE tag level at each time step.
MultiMWE corpus
• MWE extraction pipeline from (Rikters and Bojar, 2017)
https://github.com/M4t1ss/MWE-Tools 

• extend the extraction work into language pairs such as
German-English and Chinese-English

• extracted MWEs from our experiments are freely available for
MT and NLP researchers

• However, the extracted MWE candidates from this
framework is only the continuous type. In follow up work, we
will design some patterns or other models to extract
discontinuous MWEs
MultiMWE corpus
• Extract the monolingual MWEs + alignment + filtering
Corpus construction workflow together with corpus flow diagram
MultiMWE corpus
• Some
pattern
examples
used for
Chinese
monolingual
MWE
extraction
<pat><w pos="a"/><w pos="a"/><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="c"/><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="c"/><w pos="a"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="c"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="c"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="c"/><w pos="n"/><w pos="n"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="n"/><w pos="a"/><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="n"/><w pos="a"/><w pos="n"/></pat>
<pat><w pos="a"/><w pos="n"/><w pos="a"/><w pos="n"/><w pos="n"/></pat>
MultiMWE corpus
begin{itemize}
item Morphological tagging of De and En.
item Tagged De/En into XML format.
item Design MWE-patterns for De/En
item Extract Monolingual MWEs with MWEtoolkit
item Generate De-En lexicon translation probability files
with Giza++ and Moses
item Align Bilingual MWEs with MPAligner
end{itemize}
workflow: take de-en language pair as one example
MultiMWE corpus
• Chinese-English MWEs, similar de-en, except:

• item PoS pattern design.

• item Stop-word list preparation.

• item Zh-En translation probability files.
MultiMWE corpus
• Added Chinese Patterns for MWEs from the LCMC Tags,
in addition to the mapped pos tags from en language.
Lancaster Corpus of Mandarin Chinese https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/
MultiMWE corpus
Chinese English Estimation
cat ears 0.780979
long tail 0.820427
small dustpan 0.856796
artistic works 0.6281
group table 0.708438
computer expert 0.801311
golf club 0.976473
acne products 0.695547
different
conditions
0.887839
evergreen plant 0.610852
note: ( )->( ) simplified to traditional
Chinese character, used in paper content
Extracted zh-en pairs 

Without running

Head of the file

Most scores above 0.6

簸箕 (Bòji) MWE non-decomposable

电脑 (Diànnǎo): electricity+brain
MultiMWE corpus
German English Estimation
europäische Kommission european
commission
0.970964
upcoming events upcoming
events
1
europäischen Kommission european
commission
0.990844
praktische
Informationen
practical
information
0.948533
östlichen Teils eastern part 0.793047
private Konzession private
concession
0.921197
französische Staat french state 0.853861
europäischen Rat european
council
0.984224
größeren
Infrastrukturprojekten
major
infrastructure
projects
0.853873
zwischengeschalteten
Banken
intermediary
banks
0.754617
Pruned de-en with threshold 

0.70 with examples from the 

Head of the file
Applications
Credit to https://www.shutterstock.com/image-illustration/
MultiMWE Application
• Theoretically and technically MultiMWE corpus shall be
useful for most cross-lingual/multilingual NLP tasks

• Here we introduce the experiments we did on MT

• De->En and Zh->En

• All attention based NMT model with THUMT toolkit.

• 5 millions of training sentences for each language pair

• Development and testing set from WMT workshop
www.statmt.org
MultiMWE Application
MultiMWE Application
• ⼝口⽔水戰 (Kǒushuǐ zhàn): war of words.
translated into ``water fighting" by the
Baseline, while translated into ``oral
combat" in a proper way by MWE
enhanced model. Though not reflected by
BLEU.

• Baseline translation is due to that this
is a metaphor expression in Chinese
using ⼝口⽔水+戰 that is a combination of
saliva(səˈlʌɪvə)+war

• Modifier MWE 所謂 is used to express
``supposed" or ``so-called”, missed out by
baseline.
Examples of MWE translations in MT outputs
Src
Ref the leaders of Russia and Turkey met on
Tuesday to shake hands and declare a formal
end to an eight - month long war of words and
economic sanctions .
Base Russian and Turkish leaders met Tuesday ,
shaking hands and declaring the official end of
eight months of water fighting and economic
sanctions .
B+MWE Russian and Turkish leaders met on Tuesday
and both shook hands and announced a
formal end of eight months of oral combat and
economic sanctions .
Src
Ref the offence was even greater , coming from a
supposed friend .
Base attacks from a friend are even harder to
accept .
B+MWE the attack from so-called friends is harder to
accept .
Src: source; Ref: reference. B+MWE: Baseline+MWE. Simplified
Chinese ( , ) mapping into Traditional ( , ), used in paper.
Other Selected references
• Chinese character for MT, and Named Entity as MWE:

• Han, L. and Kuang, S. (2018). Incorperating chinese radicals into neural machine translation: Deeper than character
level. Presented in ESSLLI-2018, https://arxiv.org/abs/1805.01565 

• Han, A. L.-F., Wong, D. F., and Chao, L. S. (2013). Chinese named entity recognition with conditional random fields in
the light of chinese characteristics. In Language Processing and Intelligent Information Systems, pages 57–68.
Springer. 

• The MultiMWE corpus, + paper references: https://github.com/poethan/MWE4MT 

• MWE and tools:

• Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002). Multiword expressions: A pain in the neck for
nlp. In Alexander Gelbukh, editor, Com- putational Linguistics and Intelligent Text Processing, pages 1–15, Berlin,
Heidelberg. Springer Berlin Heidel- berg. 

• Rikters, M. and Bojar, O. (2017). Paying Attention to Multi-Word Expressions in Neural Machine Translation. In
Proceedings of the 16th Machine Translation Summit (MT Summit 2017), Nagoya, Japan. 

• Pinnis, M. (2013). Context independent term mapper for European languages. In Proceedings of the International
Conference Recent Advances in Natural Language Pro- cessing RANLP 2013, pages 562–570, Hissar, Bulgaria,
September. INCOMA Ltd. Shoumen, BULGARIA. 

• Ramisch, C. (2015). Multiword Expressions Acquisition: A Generic and Open Framework, volume XIV of The- ory and
Applications of Natural Language Processing. Springer.

More Related Content

PDF
Meta-evaluation of machine translation evaluation methods
PDF
PubhD talk: MT serving the society
PDF
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
PDF
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
PDF
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
PDF
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
PDF
Apply chinese radicals into neural machine translation: deeper than character...
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Meta-evaluation of machine translation evaluation methods
PubhD talk: MT serving the society
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Apply chinese radicals into neural machine translation: deeper than character...
LEPOR: an augmented machine translation evaluation metric - Thesis PPT

What's hot (20)

PPTX
Searching for the Best Machine Translation Combination
PDF
2. Constantin Orasan (UoW) EXPERT Introduction
PDF
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
PDF
Practical machine learning - Part 1
PDF
The VoiceMOS Challenge 2022
PDF
Frontiers of Natural Language Processing
PPTX
2010 PACLIC - pay attention to categories
PDF
Nlp research presentation
PDF
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
PDF
Nlp presentation
PPTX
1909 paclic
PDF
Word representation: SVD, LSA, Word2Vec
PDF
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
PDF
ylchen
PPT
How useful are semantic links for the detection of implicit references in csc...
PDF
13. Constantin Orasan (UoW) Natural Language Processing for Translation
PPTX
2010 INTERSPEECH
PDF
6. Khalil Sima'an (UVA) Statistical Machine Translation
PPTX
Experiments with Different Models of Statistcial Machine Translation
PDF
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
Searching for the Best Machine Translation Combination
2. Constantin Orasan (UoW) EXPERT Introduction
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
Practical machine learning - Part 1
The VoiceMOS Challenge 2022
Frontiers of Natural Language Processing
2010 PACLIC - pay attention to categories
Nlp research presentation
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
Nlp presentation
1909 paclic
Word representation: SVD, LSA, Word2Vec
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
ylchen
How useful are semantic links for the detection of implicit references in csc...
13. Constantin Orasan (UoW) Natural Language Processing for Translation
2010 INTERSPEECH
6. Khalil Sima'an (UVA) Statistical Machine Translation
Experiments with Different Models of Statistcial Machine Translation
2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)
Ad

Similar to MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora (20)

PDF
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
PDF
A deep analysis of Multi-word Expression and Machine Translation
PDF
Machine Translation of Discontinuous Multiword Units
PPT
Moore_slides.ppt
PDF
Challenges in transfer learning in nlp
PPTX
Using ICT to Analyse Language
PPTX
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
PPTX
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
PDF
Lingvist - Statistical Methods in Language Learning
PPTX
What is word2vec?
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PPTX
A Panorama of Natural Language Processing
PPTX
Natural Language Processing (NLP)
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PPTX
naturallanguageprocessingnlp-231215172843-839c05ab.pptx
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Natural Language Processing
PDF
When Multiwords Go Bad in Machine Translation
PDF
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
PDF
Deep learning for nlp
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
A deep analysis of Multi-word Expression and Machine Translation
Machine Translation of Discontinuous Multiword Units
Moore_slides.ppt
Challenges in transfer learning in nlp
Using ICT to Analyse Language
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Lingvist - Statistical Methods in Language Learning
What is word2vec?
Tomáš Mikolov - Distributed Representations for NLP
A Panorama of Natural Language Processing
Natural Language Processing (NLP)
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
naturallanguageprocessingnlp-231215172843-839c05ab.pptx
Visual-Semantic Embeddings: some thoughts on Language
Natural Language Processing
When Multiwords Go Bad in Machine Translation
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Deep learning for nlp
Ad

More from Lifeng (Aaron) Han (20)

PDF
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
PDF
Measuring Uncertainty in Translation Quality Evaluation (TQE)
PDF
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
PDF
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
PDF
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
PDF
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
PDF
machine translation evaluation resources and methods: a survey
PDF
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
PDF
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
PPTX
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
PDF
Lepor: augmented automatic MT evaluation metric
PDF
Thesis-Master-MTE-Aaron
PDF
Machine translation evaluation: a survey
PDF
LEPOR: an augmented machine translation evaluation metric
PPTX
Pptphrase tagset mapping for french and english treebanks and its application...
PDF
Pptphrase tagset mapping for french and english treebanks and its application...
PPTX
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
PDF
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PDF
Unsupervised Quality Estimation Model for English to German Translation and I...
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
machine translation evaluation resources and methods: a survey
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Lepor: augmented automatic MT evaluation metric
Thesis-Master-MTE-Aaron
Machine translation evaluation: a survey
LEPOR: an augmented machine translation evaluation metric
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
Unsupervised Quality Estimation Model for English to German Translation and I...

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
August Patch Tuesday
PPT
Geologic Time for studying geology for geologist
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
DOCX
search engine optimization ppt fir known well about this
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PPT
What is a Computer? Input Devices /output devices
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
A comparative study of natural language inference in Swahili using monolingua...
DP Operators-handbook-extract for the Mautical Institute
NewMind AI Weekly Chronicles – August ’25 Week III
August Patch Tuesday
Geologic Time for studying geology for geologist
Final SEM Unit 1 for mit wpu at pune .pptx
Tartificialntelligence_presentation.pptx
sustainability-14-14877-v2.pddhzftheheeeee
search engine optimization ppt fir known well about this
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
What is a Computer? Input Devices /output devices
O2C Customer Invoices to Receipt V15A.pptx
Hybrid model detection and classification of lung cancer
Enhancing emotion recognition model for a student engagement use case through...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Chapter 5: Probability Theory and Statistics
Benefits of Physical activity for teenagers.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

  • 1. MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora Lifeng Han, Gareth Jones and Alan Smeaton ADAPT research seminar, 2020 March Accepted long paper in 12th Edition of Language Resources and Evaluation Conference (LREC2020)
  • 2. Outline • Multiword Expressions (MWEs) in ADAPT • - started earlier, EACL-MWE2017 shared task by TCD+DCU • Background of MWEs • - intro MWE, tasks in NLP, corpus, with Machine Translation (MT) • MultiMWE corpus • - how we made it (workflow) • MultiMWE corpus applications • - in MT, but it can go further in Multilingual/Crosslingual NLP task
  • 3. vMWEs in ADAPT • “Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking". 2017. Alfredo Maldonado, Lifeng Han, Erwan Moreau, Ashjan Alsulaimani, Koel Dutta Chowdhury, Carl Vogel, Qun Liu. The 13th Workshop on Multiword Expressions @ EACL 2017 • - exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model • - Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation (Polish, Swedish, French) • - an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. • Chapter: “Semantic reranking of CRF label sequences for verbal multiword expression identification”. 2018. Erwan Moreau, Ashjan Alsulaimani, Alfredo Maldonado, Lifeng Han, Carl Vogel, Koel Dutta Chowdhury. In Stella Markantonatou, Carlos Ramisch, Agata Savary & Veronika Vincze (eds.), Multiword expressions at length and in depth: Extended papers from the MWE 2017 workshop, 177–207. Berlin: Language Science Press. DOI:10.5281/zenodo.1469559
  • 4. Definition • a MWE shall be a term including several words to express a specific concept, which is able to be decomposed, and the words combined together as an MWE are syntactically, semantically, pragmatically, or statistically idiosyncratic in nature.{Sag2002MWE,Tim2010mwe,huning2015MWEs} • MWE (Sag2002): • lexical phrases: fixed or semi-fixed expressions and syntactically-flexible expressions • institutionalized phrases
  • 5. Examples • Categories of MWE: • idioms (…kick the bucket) • Metaphor (…apple of my eye) • compound nouns (tooth + paste -> toothpaste) • word combinations from various part-of-speech set: verb-particle, adjective-verb (dry cleaning), preposition-verb (output), adjective-noun (full moon), preposition-noun (underground), etc. • … • Appearance of MWE: • Continuous (kick off, football magazine) • Dis-continuous (pick someone up) ref: https://www.learnenglish.de/grammar/nouncompound.html
  • 6. MWE events • MWE detection, acquisition, construction, treatment, disambiguation • MWE interaction with other NLP tasks e.g. MT: • EUROPHRAS 4th Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2019) — 27 September 2019 — Málaga, • ACL-SIGLEX orgnized MWE related events (since 2003, yearly) • 2003 ACL workshop Multiword Expressions: Analysis, Acquisition and Treatment – Sapporo, Japan • 2020 COLING Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX 2020) – September 14, 2020 – Barcelona, Spain.
  • 7. Joint topics on MWEs and e-lexicons: Extracting and enriching MWE lists from traditional human-readable lexicons for NLP use Formats for NLP-applicable MWE lexicons Interlinking MWE lexicons with other language resources Using MWE lexicons in NLP tasks (identification, parsing, translation, MWE discovery in the service of lexicography Multiword terms in specialised lexicons Representing semantic properties of MWEs in lexicons Paving the way towards encoding lexical idiosyncrasies in constructions
  • 8. MWE related corpora • MWE aware English Dependency corpus • from LDC(https://catalog.ldc.upenn.edu/LDC2017T01), • - annotated English compound words in the corpus as one kind of MWE • - to facilitate the constituency and dependency parsing task • English MWEs in web reviews data (schneider-etal-2014) • http://www.cs.cmu.edu/~ark/LexSem/ • hand-annotated online review data with comprehensive MWEs including English noun, verb, and preposition super-senses (tags include communication, group, stative, location, possession, etc.). • these MWE corpus construction works are monolingual tasks and focus on English only
  • 9. MWE related corpora • multilingual MWE corpus construction from the PARSEME • - The IC1207 COST Action, is an interdisciplinary scientific network devoted to the role of multi-word expressions (MWEs) in parsing • - includes 18 European languages • - corpus focuses on one kind of MWE (verbal MWE) • - not parallel, and the size of the data varies very much from language to language (some languages have only hundreds of sentences).
  • 10. MWE interaction with MT • Báihuà (⽩白話): contemporary Chinese • 去 (qù) ... 了了(le): is a simple dis-continuous Chinese MWE used to express a past tense action (went to do something, went to somewhere) • The purpose ‘上課 (shangke)’ is not translated. ZH source: ZH pinyin: Xiǎo míng qù xué xiào shàng kè le EN reference: Xiao Ming went to school to attend classes EN MT output: Xiao Ming went to school note: Three zh->en examples from googleMT 2018/10
  • 11. MWE interaction with MT • Poem: Shīgē(詩歌). 花 as ambiguous word. • Well aligned MWEs shall help disambiguation. 花 vs ⼈人 (flower - people) ZH source: ZH pinyin: Nián nián suì suì huā xiāng sì, suì suì nián nián rén bù tóng. EN reference: The flowers are similar each year, while people are changing year after year. EN MT output: One year spent similar, each year is different
  • 12. MWE interaction with MT • Wényán (⽂文⾔言): ancient Chinese. • 安知 as MWE means ‘…do not know…’. Or 安知…哉 means ‘how can … know …?’ ZH source: ZH pinyin: Yàn què ān zhī hóng hú zhī zhì zāi? EN reference (literal): How can a finch know the ambition of a big bird (or swan)? EN MT output: What is the meaning of Yanque Anzhihong? EN reference: The nonsense forks do not know the ambitions of the very motivated people.
  • 13. MWE interaction with MT • Some earlier work about MWE+MT: • {lambert2005mwe} bilingual MWE pairs to modify the word alignment procedure of MT on an English-Spanish corpus. grouping the MWEs as one token before training • {Ren2009mwe} integrated bilingual Chinese-English MWEs into the SMT toolkit Moses • {Bouamor2012IdentifyingBM} which designed models to extract continuous MWEs for French- English translation • {Skadina2016MultiwordEI} which discussed various MWEs in English-Latvian MT • {EBRAHIM2017111mwe} focused on phrasal verb MWEs in Arabic-English phrase-based SMT. • MWE+NMT: • {Li2016neuralname} a character level sequence to sequence modeling to translate named entities and then integrated this into an overall NMT system on a Chinese-to-English • {ugawa2018neural} focuses on the difficulty of translation of compound words in the source language, by introducing an encoder for the input word at the NE tag level at each time step.
  • 14. MultiMWE corpus • MWE extraction pipeline from (Rikters and Bojar, 2017) https://github.com/M4t1ss/MWE-Tools • extend the extraction work into language pairs such as German-English and Chinese-English • extracted MWEs from our experiments are freely available for MT and NLP researchers • However, the extracted MWE candidates from this framework is only the continuous type. In follow up work, we will design some patterns or other models to extract discontinuous MWEs
  • 15. MultiMWE corpus • Extract the monolingual MWEs + alignment + filtering Corpus construction workflow together with corpus flow diagram
  • 16. MultiMWE corpus • Some pattern examples used for Chinese monolingual MWE extraction <pat><w pos="a"/><w pos="a"/><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="c"/><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="c"/><w pos="a"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="c"/><w pos="a"/><w pos="n"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="c"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="c"/><w pos="n"/><w pos="n"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="n"/><w pos="a"/><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="n"/><w pos="a"/><w pos="n"/></pat> <pat><w pos="a"/><w pos="n"/><w pos="a"/><w pos="n"/><w pos="n"/></pat>
  • 17. MultiMWE corpus begin{itemize} item Morphological tagging of De and En. item Tagged De/En into XML format. item Design MWE-patterns for De/En item Extract Monolingual MWEs with MWEtoolkit item Generate De-En lexicon translation probability files with Giza++ and Moses item Align Bilingual MWEs with MPAligner end{itemize} workflow: take de-en language pair as one example
  • 18. MultiMWE corpus • Chinese-English MWEs, similar de-en, except: • item PoS pattern design. • item Stop-word list preparation. • item Zh-En translation probability files.
  • 19. MultiMWE corpus • Added Chinese Patterns for MWEs from the LCMC Tags, in addition to the mapped pos tags from en language. Lancaster Corpus of Mandarin Chinese https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/
  • 20. MultiMWE corpus Chinese English Estimation cat ears 0.780979 long tail 0.820427 small dustpan 0.856796 artistic works 0.6281 group table 0.708438 computer expert 0.801311 golf club 0.976473 acne products 0.695547 different conditions 0.887839 evergreen plant 0.610852 note: ( )->( ) simplified to traditional Chinese character, used in paper content Extracted zh-en pairs Without running Head of the file Most scores above 0.6 簸箕 (Bòji) MWE non-decomposable 电脑 (Diànnǎo): electricity+brain
  • 21. MultiMWE corpus German English Estimation europäische Kommission european commission 0.970964 upcoming events upcoming events 1 europäischen Kommission european commission 0.990844 praktische Informationen practical information 0.948533 östlichen Teils eastern part 0.793047 private Konzession private concession 0.921197 französische Staat french state 0.853861 europäischen Rat european council 0.984224 größeren Infrastrukturprojekten major infrastructure projects 0.853873 zwischengeschalteten Banken intermediary banks 0.754617 Pruned de-en with threshold 0.70 with examples from the Head of the file
  • 23. MultiMWE Application • Theoretically and technically MultiMWE corpus shall be useful for most cross-lingual/multilingual NLP tasks • Here we introduce the experiments we did on MT • De->En and Zh->En • All attention based NMT model with THUMT toolkit. • 5 millions of training sentences for each language pair • Development and testing set from WMT workshop www.statmt.org
  • 25. MultiMWE Application • ⼝口⽔水戰 (Kǒushuǐ zhàn): war of words. translated into ``water fighting" by the Baseline, while translated into ``oral combat" in a proper way by MWE enhanced model. Though not reflected by BLEU. • Baseline translation is due to that this is a metaphor expression in Chinese using ⼝口⽔水+戰 that is a combination of saliva(səˈlʌɪvə)+war • Modifier MWE 所謂 is used to express ``supposed" or ``so-called”, missed out by baseline. Examples of MWE translations in MT outputs Src Ref the leaders of Russia and Turkey met on Tuesday to shake hands and declare a formal end to an eight - month long war of words and economic sanctions . Base Russian and Turkish leaders met Tuesday , shaking hands and declaring the official end of eight months of water fighting and economic sanctions . B+MWE Russian and Turkish leaders met on Tuesday and both shook hands and announced a formal end of eight months of oral combat and economic sanctions . Src Ref the offence was even greater , coming from a supposed friend . Base attacks from a friend are even harder to accept . B+MWE the attack from so-called friends is harder to accept . Src: source; Ref: reference. B+MWE: Baseline+MWE. Simplified Chinese ( , ) mapping into Traditional ( , ), used in paper.
  • 26. Other Selected references • Chinese character for MT, and Named Entity as MWE: • Han, L. and Kuang, S. (2018). Incorperating chinese radicals into neural machine translation: Deeper than character level. Presented in ESSLLI-2018, https://arxiv.org/abs/1805.01565 • Han, A. L.-F., Wong, D. F., and Chao, L. S. (2013). Chinese named entity recognition with conditional random fields in the light of chinese characteristics. In Language Processing and Intelligent Information Systems, pages 57–68. Springer. • The MultiMWE corpus, + paper references: https://github.com/poethan/MWE4MT • MWE and tools: • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002). Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Com- putational Linguistics and Intelligent Text Processing, pages 1–15, Berlin, Heidelberg. Springer Berlin Heidel- berg. • Rikters, M. and Bojar, O. (2017). Paying Attention to Multi-Word Expressions in Neural Machine Translation. In Proceedings of the 16th Machine Translation Summit (MT Summit 2017), Nagoya, Japan. • Pinnis, M. (2013). Context independent term mapper for European languages. In Proceedings of the International Conference Recent Advances in Natural Language Pro- cessing RANLP 2013, pages 562–570, Hissar, Bulgaria, September. INCOMA Ltd. Shoumen, BULGARIA. • Ramisch, C. (2015). Multiword Expressions Acquisition: A Generic and Open Framework, volume XIV of The- ory and Applications of Natural Language Processing. Springer.