technology

from seed
	
	
Machine	Translation	of	Discontinuous	
Multiword	Units	
DISCO, NAACL - San Diego USA
June 17th 2016
ANABELA	BARREIRO	
INESC-ID	
FERNANDO	BATISTA	
INESC-ID,	ISCTE-IUL
•  Introduction	
–  Discontinuous	Multiword	Units	(DMWU)	in	NLP	
–  Main	Current	Shortcomings	
–  Our	Goal	
•  CLUE-Aligner	Alignment	Tool	
•  The	Logos	Model	
–  Alignment	of	DMWU	Inspired	by	Logos	
•  bring	[	]	to	a	conclusion	
•  set	[	]	in	motion	
•  play	[	]	role	
•  take	[	]	interest	in	
•  keep	[	]	informed	about	
•  Preliminary	Results	
–  Analysis	of	Preliminary	Results	
•  Advantages	of	the	Logos	Model	
•  Conclusions	and	Future	Directions	
•  Final	Remark	
–  The	eSPERTo	Project	
Outline	
2
•  Increasing	interest	in	multiword	units	(MWU)	in	the	Rield	of	NLP	
	
“lexical	 items	 that:	 (a)	 can	 be	 decomposed	 into	 multiple	 lexemes;	 and	 (b)	 display	 lexical,	
syntactic,	semantic,	pragmatic	and/or	statistical	idiomaticity”	(Baldwin	and	Kim	2010)	
	
•  Compositionality	 property	 –	 causes	 automatic	 processing	 of	 MWU	
particularly	challenging	
–  Free	combinations		
	round	table	=	meeting	
–  Opaque	meanings	
•  piece	of	cake	=	easy	to	do	
•  pay	a	visit	=	visit	
–  Cannot	be	translated	word-for-word		
•  raining	cats	and	dogs	
–  Allow	insertions	(=	words	that	are	not	part	of	the	unit)	
•  to	bring	[INSERTION]	to	a	conclusion	
I	would	urge	the	European	Commission	to	bring	the	process	of	adopting	the	directive	on	additional	pensions	to	a	conclusion	
Introduction	
3
•  Non-adjacent	linguistic	phenomena	–	remote	dependency	
•  Common	across	languages	
•  DifRicult	to	recognize	and	process	
•  Remain	a	problem	for	NLP	applications	
•  Lack	of	formalization	still	triggers	problems	with	the	syntactic	and	
semantic	analysis	of	sentences	containing	MWU	
•  Impairment	of	NLP	systems’	performance	
•  Cause	MT	to	fail	in	assigning	the	correct	translation	
•  For	 SMT	 systems,	 DMWU	 constitute	 signiRicant	 challenges	 to	
correct	 word	 and	 phrase	 alignment	 (Shen	 et	 al.	 2009),	 and	
therefore,	to	high	quality	MT	
Discontinuous	Multiword	Units	in	NLP	
4
•  Linguistic	knowledge	is	still	limited	in	most	systems	
–  Some	 SMT	 methodologies	 rely	 mostly	 on	 statistics	 to	 train/evaluate	
MT	 systems,	 use	 probabilistic	 alignments	 with	 no/little	 linguistic	
knowledge,	disregard	syntactic	discontinuity.		
–  Inability	to	identify	MWU	correctly	results	in	translation	deRiciencies.	
•  Lack	 of	 publicly	 available	 manual	 multilingual	 datasets,	 and	 of	
linguistically	motivated	alignment	guidelines	
–  Publicly	 available	 alignments	 are	 mostly	 bilingual,	 with	 some	
exceptions	(Graça	et	al.	2008)	
–  Guidelines	 cover	 cross-linguistic	 phenomena	 superRicially,	 excluding	
important	alignment	challenges	presented	by	DMWU.	
•  Lack	of	more	robust	alignment	tools	
–  Limitations	 in	 assisting	 human	 annotators	 in	 the	 task	 of	 identifying	
and	aligning	correctly	DMWU	and	produce	rules	from	them.	
Main	Current	Shortcomings	
5
•  Present	an	experimental	empirical	analysis	of	DMWU	
•  Stress	the	relevance	of	correct	(and	non-arbitrary)	alignment	of	DMWU	
•  Highlight	an	alignment	methodology	inspired	by	the	Logos	Model	(Scott,	
2003;	Barreiro	et	al.,	2011)	and	the	Semtab	function	to	deploy	semantico-
syntactic	knowledge	that	allows	to	translate	DMWU	with	high	Ridelity	
•  Illustrate	DMWU	manual	alignments	produced	with	CLUE-Aligner	–	Cross-
Language	 Unit	 Elicitation	 –	 a	 Web	 alignment	 interactive	 tool	 (Barreiro,	
Raposo,	Luís	2016)	
	
*Even	 though	 similar	 in	 name	 to	 the	 "clue	 alignment	 approach”	 (Tiedemann,	 2003;	 2004;	 2011),	
mainly	 devoted	 to	 word-level	 alignment,	 our	 approach	 is	 theoretically	 and	 methodologically	
different	 with	 a	 focus	 on	 phrase	 alignment,	 contemplating	 multiwords	 and	 linguistically-relevant	
phrasal	units.	
	Our	goal	
6
•  Allows	the	block-alignment	of	contiguous	and	DMWU	
•  Uses	a	matrix	visualization	and	coloring	schemes	that	help	distinguish	
between	sure	(S)	and	possible	(P)	alignments	
•  Allows	storage	of	pairs	of	paraphrastic	units,	with	indication	of	the	place	
of	insertions,	represented	by	"[	]"		
–  I	urge	[	]	to	|	Exorto	[	]	a	
–  This	 feature	 is	 valuable	 in	 the	 construction	 of	 translation	 rules	 or	
grammars	and	syntactic	parsers	that	use	those	paraphrastic	pairs,	for	
which	precision	is	important	
–  It	is	also	important	in	ML	to	help	learning	constituents	
7	
CLUE-Aligner
insertion
insertion
Black cells represent full/optimal semantic correspondence
Grey cells represent approximate semantic correspondence	
Light	orange	cell	groups	represent	unaligned	P-inser3ons	
Dark	orange	cell	groups	represent	unaligned	S-inser3ons
pre-processing	of		
contracted	forms
still ainda
CLUE-Aligner	Interface	
Single	Word	Alignments		
and	Block	Alignments	
	
DMWs	
and	Insertions	Light	green	cell	/	cell	groups	represent	aligned	P-inser3ons	
Dark	green	cell	/	cell	groups	represent	aligned	S-inser3ons
•  Integrates	 semantic	 and	 contextual	 knowledge	 and	 applies	 it	 to	 the	
translation	process	
•  Precision	 is	 associated	 with	 the	 application	 of	 Semtab	 semantic	 and	
contextual	data-driven	pattern-rules,	which	are	deep	structure	patterns	
that	 match	 on	 (apply	 to)	 a	 great	 variety	 of	 surface	 structures,	 including	
DMWU	
–  deal(VI)	with	N(questions)	=	s’occuper	de	N	
•  Alignments	that	mirror	Semtab	semantic	nuances	can	help	create	new	MT	
systems	and	improve	existing	ones	
The	Logos	Model	
10
Alignment	of	DMWU	Inspired	by	Logos	
11	
•  Europarl	corpus	(Koehn	2005)	-	contains	a	large	number	of	occurrences	of	
DMWU	(subset	with	47.4	million	words)	
•  5	cases	of	SVC	illustrate	“bad”	translation	errors	
–  Search	was	performed	on	all	forms	of	each	verb	
	
–  Learning	 automatic	 models	 to	 deal	 with	 DWMU	 may	 not	 be	
straightforward
bring	[	]	to	a	conclusion	
12	
Contains	a	9	word	inser3on	
Alignment	of	the	EN	discontinuous	
SVC	with	the	PT	equivalent	stylistic	
variant,	the	compound	verb	
“apressar-se	a	apresentar”
set	[	]	in	motion	
13	
Alignment	of	the	EN	discontinuous	
SVC	with	the	equivalent	PT	single	
verb	“empreender”	(“undertake”)	
Inser3on	of	the	direct	object
play	[	]	role	
14	
Alignment	of	the	EN	SVC	with	the	
equivalent	PT	non-elementary	SVC	
“desempenham	um	papel”	
Inser3on	of	an	adjec3ve	modifier
take	[	]	interest	in	
15	
Inser3on	of	an	adjec3ve	
Alignment	of	a	EN	discontinuous	
prepositional	verb	with	its	equivalent	FR	
reUlexive	prepositional	verb	
	
FR	translation	also	contains	insertions
keep	[	]	informed	about	
16	
Inser3on	of	an	adverb	
Alignment	of	a	EN	discontinuous	SVC	
with	a	prepositional	adjective	with	its	
ES	equivalent	prepositional	SVC	
	
EN	translation	also	contains	an	adverbial	
insertion
•  1st	20	sentences	from	subset	corpus	representing	each	of	the	5	DMWU	cases	
•  Translated	each	sentence	with	Google	Translate	to	verify	translation	quality	
•  Performed	an	empirical	evaluation	of	the	achieved	translations	
Preliminary	Results	
17	
bring [ ] to a
conclusion
set [ ] in motion play [ ] role take [ ] interest in keep [ ] informed
about
0
5
10
15
20
25
correct
incorrect, inadequate or non-optimal
(literal, unnatural)
Analysis	of	Preliminary	Results	
18	
DMWU (support verb construction) Google Translate Correct translation
to bring [this dossier] to a conclusion trazer a uma conclusão concluir / terminar [este dossier]
set […] in motion estabeleceu […] em movimento iniciou / pôs em marcha […]
play [the] role jogar [o] papel desempenhar [o] papel
take [a lukewarm] interest in *ter um interesse [*morna] em manifeste / demonstre um interesse
[morno/fraco/ténue]
keep [us] informed about *tem [nos] *manteve informados sobre nos tem mantido informados sobre
nos tem informado sobre
	
	
EN	–	It	is	unacceptable	for	the	Commission	only	to	take	a	lukewarm	interest	in	a	country.		
PT-GT	–	É	inaceitável	que	a	Comissão	só	a	*ter	um	interesse	morna	em	um	país.	
	
Lexical	errors	related	to	DMWU	+	Structural	errors	
•  Lack	of	agreement	(para	nos	manter	regular	e	estreitamente	
*informado	sobre;	que	o	Parlamento	*ser	bem	*informados	sobre)	
•  Incorrect	word	order	(se	conseguirmos	*a	adoptar	e	de5ini-lo	em	
movimento)	
•  Etc.
Advantages	of	the	Logos	Model	
19	
•  Consistent	 and	 efRicient	 solution	 to	 process	 DMWU,	 not	 consistently	
processed	in	former	word	or	phrase	alignment	techniques	
•  Ability	 to	 relate	 constituents	 that	 are	 apart	 (even	 very	 far	 apart)	 in	 the	
sentence	
•  Consistent	way	to	analyze	and	translate	words	in	context	
•  Ability	to	generalize	between	alternative	forms	of	the	same	MWU,	phrase	
or	expression	(take	a	walk	=	walk)	
•  Semtab	has	a	robust	solution	for	the	problem	of	open	class	items	or	less	
frequent	 MWU	 and	 phrases	 that	 cannot	 be	 learnt	 quickly	 and	 translated	
correctly	 by	 an	 SMT	 system,	 but	 annoyingly	 can	 be	 observed	 in	 MT	
translations	(also	used	in	non-native	speakearisms)	
–  make	a	visit	or	pay	a	visit?	
•  MWU	are	not	processed	on	a	word-for-word	basis,	they	represent	atomic	
semantico-syntactic	and	translation	units
•  Standard	MT	systems	can	beneRit	from	a	correct	processing	of	DMWU	
•  currently	not	being	explored	efRiciently	
•  processing,	recognition	and	translation	of	DMWU	is	challenging	
•  Some	methodologies	are	inefRicient	
•  they	violate	the	intrinsic	property	of	the	unit	as	an	atomic	group	of	
elements	
•  elements	of	the	unit	cannot	be	separated	or	aligned	individually	
•  unit	boundaries	need	to	be	respected	
•  Post-editing	efforts	can	be	minimized	by	improving	alignment	quality	
•  Even	though	we	analyzed	just	a	few	cases	of	SVC,	our	Rindings	point	out	to	a	
general	lack	of	quality	in	the	translation	of	DMWU	(and	discontinuous	
phrasal	expressions)	
Conclusions	and	Future	Directions	
20
•  Validation	
•  Broader	quantiRication	of	phenomena	needed	to	validate	exploratory	results	
•  Evaluation	
•  Evaluation	of	the	performance	in	hierarchical	phrase	and	syntax-based	MT	and	
neural	network	translation	models	(with	theoretical	capacity	to	learn	DMWU)	
•  Annotation	
•  Manual	multilingual	alignments	(gold	sets)	
•  Alignment	Guidelines	
•  Improved	and	enlarged	sets	of	linguistically-based/motivated	alignment	
guidelines	(gold	standards)	
•  Cross-Linguistic	Analysis	
•  Deep	analysis	of	challenging	cross-linguistic	phenomena,	including	DMWU	
•  Rule	/	Grammar	Construction	
•  Translation	rules	extracted	from	quality	manually-annotated	corpora	
•  Tool	Enhancement	and	Automation	
•  Feed	CLUE-Aligner	with	manual	training	data	and	enhance	the	tool	for	automatic	
alignment	and	extraction	of	large	amounts	of	translation	pairs	for	MT	case	studies	
•  Translation	Applications	
•  Increase	precision	and	recall	in	MT	systems	
•  Paraphrases	
•  Methodology	and	resources	-	a	valuable	asset	for	applications	requiring	paraphrases	
Conclusions	and	Future	Directions	
21
•  Extreme	importance	of	paraphrases	for	translation	(human	and	MT)	
•  Paraphrastic	knowledge	allows	choosing	the	best/optimal	translations	
from	a	set	of	possible	translations	
	
	EN	–	It	is	time	to	bring	this	issue	to	a	conclusion.		
	EN	–	We	must	bring	this	episode	to	a	conclusion.		
	
	PT	–	Está	na	hora	de	resolver	esta	questão.	
	PT	–	Chegou	a	hora	de	concluir	este	assunto.	
	PT	–	Punhamos	um	ponto	Uinal	neste	tema.	
	PT	–	Temos	de	concluir	este	episódio.	
	
	Suggest	your	own	paraphrase!	
Final	Remark	
22
The	eSPERTo	Project	
23	
the man who is American
the man from America
the man with American nationality
…
The American man
https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl
Paraphrases 4 Translation (Human + MT)
24	
Thank	you!	
Acknowledgements
This research work was supported by Fundação para a Ciência e a Tecnologia (FCT), under project
eSPERTo EXPL/MHC-LIN/2260/2013, UID/CEC/50021/2013, and post-doctoral grant SFRH/BPD/
91446/2012

More Related Content

PDF
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
PDF
Representation Learning of Vectors of Words and Phrases
PPTX
Introduction to Interpretable Machine Learning
PDF
Continual Learning with Deep Architectures - Tutorial ICML 2021
PPTX
Word representations in vector space
PDF
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
PDF
A New Approach of Learning Hierarchy Construction Based on Fuzzy Logic
PPTX
[DL輪読会]Unbiased Gradient Estimation in Unrolled Computation Graphs with Persi...
Roee Aharoni - 2017 - Morphological Inflection Generation with Hard Monotonic...
Representation Learning of Vectors of Words and Phrases
Introduction to Interpretable Machine Learning
Continual Learning with Deep Architectures - Tutorial ICML 2021
Word representations in vector space
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
A New Approach of Learning Hierarchy Construction Based on Fuzzy Logic
[DL輪読会]Unbiased Gradient Estimation in Unrolled Computation Graphs with Persi...

What's hot (14)

PDF
Advances in Learning with Bayesian Networks - july 2015
PDF
Deep learning for natural language embeddings
PDF
Ontological knowledge integration for Bayesian network structure learning
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
Multi modal retrieval and generation with deep distributed models
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
Advances in Bayesian network learning
PPTX
Teaching algebra through functional programming
PDF
Anthiil Inside workshop on NLP
PDF
word embeddings and applications to machine translation and sentiment analysis
PPTX
Word2vec slide(lab seminar)
PPTX
2010 INTERSPEECH
PDF
Fdp kavita pandey_automata
PPT
Design patterns ppt
Advances in Learning with Bayesian Networks - july 2015
Deep learning for natural language embeddings
Ontological knowledge integration for Bayesian network structure learning
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Multi modal retrieval and generation with deep distributed models
Tomáš Mikolov - Distributed Representations for NLP
Advances in Bayesian network learning
Teaching algebra through functional programming
Anthiil Inside workshop on NLP
word embeddings and applications to machine translation and sentiment analysis
Word2vec slide(lab seminar)
2010 INTERSPEECH
Fdp kavita pandey_automata
Design patterns ppt
Ad

Viewers also liked (8)

PDF
Otoole Presentation
PPTX
Ijsland
DOCX
Importance of Communication for Healthy Relationships
PPT
La actitud-mental-positiva-como-parte-del-exito
PDF
Ziekenhuis, visie op architectuur
PDF
Hype vs. Reality: The AI Explainer
PDF
Study: The Future of VR, AR and Self-Driving Cars
Otoole Presentation
Ijsland
Importance of Communication for Healthy Relationships
La actitud-mental-positiva-como-parte-del-exito
Ziekenhuis, visie op architectuur
Hype vs. Reality: The AI Explainer
Study: The Future of VR, AR and Self-Driving Cars
Ad

Similar to Machine Translation of Discontinuous Multiword Units (7)

PDF
When Multiwords Go Bad in Machine Translation
PPTX
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
PPTX
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PDF
PDF
The Latest Advances in Patent Machine Translation
When Multiwords Go Bad in Machine Translation
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
The Latest Advances in Patent Machine Translation

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

PDF
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
PDF
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
PPTX
Syntactic-semantic analysis for information extraction in biomedicine
PPT
Cross language semantic relations between English and Portuguese
PPTX
Paraphrasing biomedical support verb constructions for machine translation
PDF
PPTX
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
PDF
Barreiro et al POP@PROPOR2018-informal2formal-language
PDF
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
PPTX
Barreiro-Batista-LR4NLP@Coling2018-presentation
PPTX
Barreiro-Mota-VarDial@Coling2018-poster
PDF
Poster @ enetCollect CA MC meeting in Iasi, Romania
PDF
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Syntactic-semantic analysis for information extraction in biomedicine
Cross language semantic relations between English and Portuguese
Paraphrasing biomedical support verb constructions for machine translation
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
Barreiro et al POP@PROPOR2018-informal2formal-language
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Mota-VarDial@Coling2018-poster
Poster @ enetCollect CA MC meeting in Iasi, Romania
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool

Recently uploaded (20)

PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
limit test definition and all limit tests
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPT
Mutation in dna of bacteria and repairss
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
Wound infection.pdfWound infection.pdf123
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
veterinary parasitology ````````````.ppt
PPTX
PMR- PPT.pptx for students and doctors tt
PPTX
gene cloning powerpoint for general biology 2
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Introcution to Microbes Burton's Biology for the Health
limit test definition and all limit tests
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Seminar Hypertension and Kidney diseases.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
Mutation in dna of bacteria and repairss
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Wound infection.pdfWound infection.pdf123
Enhancing Laboratory Quality Through ISO 15189 Compliance
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
veterinary parasitology ````````````.ppt
PMR- PPT.pptx for students and doctors tt
gene cloning powerpoint for general biology 2
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx

Machine Translation of Discontinuous Multiword Units