Dealing with Lexicon Acquired from
Comparable Corpora
Post-edition and Exchange
Estelle Delpech, Lingua et Machina
Béatrice Daille, U. de Nantes - LINA

1/23
Working w/ lexicon acquired from
comparable corpora
I. Terminology acquisition from

comparable corpora : quick overview

II. A tool for terminology post-edition
III. Data exchange : a TBX variant for
automatically acquired lexicons

IV. Future work

2/23
Part I
Terminology Acquisition from
Comparable Corpora

3/23
Terminology acquisition from
comparable corpora




Comparable corpora:
“Two corpora, respectively in two languages l1 and l2 are said
”comparable” if there exists a substantial part of the
vocabulary of the corpus in language l1 whose translation can
be found in the corpus in language l2.”
(my translation of [Déjan and Gaussier, 2002] )



Advantages :


Availabily



Real usages

4/23
Terminology acquisition from
comparable corpora




Terminology extraction : a contextual analysis







Compare contexts of source and target terms
If contexts are similar, there's a good chance
source and target terms are translations of each
other, ex :
mastectomy : reconstruction, prophylactic, treat,
undergo, removal
mastectomie : reconstruction, prophylactique,
traiter, subir, ablation
5/23
Terminology acquisition from
comparable corpora




Outputs one-to-many alignments
– Evaluation : precision on the TopNBest alignments
mastectomy



Results





0,92 ablation
0,89 mastectomie
0,48 opération

Not as good as acquisition from parallel corpora !
Fung (1997) : 30 % accuracy on the Top20
candidates
Morin et al. (2004) : translation is usually the 34th for
6/23
complex terms
Part II
A Tool for Post-edition

7/23
A tool for post-edition


Existing Tools :



ArayaTermExtractor (Waldhör 2006)





iView (Merkel and Foo, 2007)
Xerox Terminology Suite ®

Our needs :


Deal with one-to-many alignments



Non-aligned contexts



Allow non binary annotation



Display useful information to help finding the right
candidate in the corpus
8/23
“Useful” information
→ Knownledge that helps catching the in vivo
behavior terms
→Text-driven, term-oriented approach


Useful information :


Variants



Collocations



Distributional neighbors



Contexts

→ To be harvested during the term extraction /
alignment process

9/23
Useful information : example
Mastectomy

Mastectomie

risk reducting ~
simple ~

~ préventive
~ simple

Tumorectomy
Lumpectomy
Oophorectomy

Tumorectomie
Ablation
Opération

...patient may choose to have
risk-reducing bilateral
mastectomy if they have a
strong family history of breast
cancer...

...la mastectomie préventive
pourrait supprimer la grande
majorité du risque de
développer un cancer...
10/23
Post-edition interface
http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password

11/23
Part III
Data Exchange :
a TBX variant for
automatically acquired
lexicon

12/23
Quick introduction to TBX (1)





TBX : Term Base eXchange
Open, XML-based standard for exchanging
structured terminological data
approved as an international standard by LISA
and ISO (norm 30042)



Maps to TMF data model



Subset of MARTIF



Designed for various use cases



Customizable
13/23
Quick introduction to TBX (2)


2 components :




Structure : core structure based on TMF
metamodel
Content : formalism to express data-categories
and their constraints
Content

Form
Core DTD/Schema

Default TBX

Default XCS

XCS1

TBX variant 1

Adapted from ISO norm 30042:2008, Fig. 4, p.30

XCSn

TBX variant n

14/23
Quick introduction to TBX (3)


Form defined in DTD



Content
defined in XCS

respPerson
responsability
reliabilityCode
partOfSpeech
corpusTrace
termType
usageNote
Taken from ISO norm 30042:2008, Fig. 1, p.9

15/23
TBX variant for lexicon acquired from
comparable corpora


Default TBX data-categories


termType : entryTerm, variant



externalCrossReference, usageNote



partOfSpeech, frequency, reliabilityCode...



transactionType, responsability

+ Customized data-categories :


occurrences, occurrenceCount



relatedTerm



termDefinition, definitionRelevance



ntigReference

16/23
TBX variant : A term entry

17/23
TBX variant : 1-to-n alignments

18/23
TBX variant : approved alignment

19/23
Feed-back on TBX
TBX is made for stable terminologies with little
uncertainy on the status of translations not
machine-generated lexicons of “candidate
translations” :



difficult to separate of term + properties from its
alignments



no data category specific to automatically estimated
reliability





Difficult to make text-driven, term-oriented
knowledge fit in a concept oriented format


no definition category that would apply to a single term
and not the whole concept
Conclusion
Future work

21/23
Future work


Integration of prototype in Libellex


TBX import / export



edition of linguistic properties



User testing (ergonomics)



Evaluation of added-value for translation



Explore new ways of :


aligning terms



selecting contexts
22/23
References


Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,
no password



Metricc project : http://www.metricc.com/



Lingua et Machina : http://www.lingua-et-machina.com/



Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à
l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica,
Alignement Lexical dans les corpus multilingues, pp.1-22.



ArayaTermExtractor : http://www.heartsome.de



Xerox Terminology Suite : http://www.temis.com/









Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and
Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word
alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35
TMF : ISO 16642 - Terminological markup framework
TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase
eXchange (TBX)
Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language
resources

More Related Content

PPTX
Polymorphism
 
PDF
20190818 Bread Seminar
PDF
Hyponymy extraction of domain ontology
PPTX
Ontology-based Data Integration
PDF
GRDDL: A Pictorial Approach
PDF
Reclassification
PDF
Subtyping
Polymorphism
 
20190818 Bread Seminar
Hyponymy extraction of domain ontology
Ontology-based Data Integration
GRDDL: A Pictorial Approach
Reclassification
Subtyping

What's hot (20)

PDF
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
PDF
17. Anne Schuman (USAAR) Terminology and Ontologies 2
PPTX
ontology based- data_integration.
PDF
From Free-text User Reviews to Product Recommendation using Paragraph Vectors...
PPT
Ontology engineering: Ontology alignment
PPTX
Ontology For Data Integration
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PDF
SWiM – A Semantic Wiki for Mathematical Knowledge Management
PDF
Robust rule-based parsing
PPTX
Ontology mapping for the semantic web
PPT
GATE, HLT and Machine Learning, Sheffield, July 2003
PDF
16. Anne Schumann (USAAR) Terminology and Ontologies 1
PDF
ESSLLI2016 DTS Lecture Day 5-1: Introduction to day 5
PDF
SMalL - Semantic Malware Log Based Reporter
PDF
Introduction to Ontology Concepts and Terminology
PDF
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
PDF
AMBIGUITY-AWARE DOCUMENT SIMILARITY
PDF
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
PPTX
Ontology integration - Heterogeneity, Techniques and more
PDF
Ontology matching
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
17. Anne Schuman (USAAR) Terminology and Ontologies 2
ontology based- data_integration.
From Free-text User Reviews to Product Recommendation using Paragraph Vectors...
Ontology engineering: Ontology alignment
Ontology For Data Integration
14. Michael Oakes (UoW) Natural Language Processing for Translation
SWiM – A Semantic Wiki for Mathematical Knowledge Management
Robust rule-based parsing
Ontology mapping for the semantic web
GATE, HLT and Machine Learning, Sheffield, July 2003
16. Anne Schumann (USAAR) Terminology and Ontologies 1
ESSLLI2016 DTS Lecture Day 5-1: Introduction to day 5
SMalL - Semantic Malware Log Based Reporter
Introduction to Ontology Concepts and Terminology
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
AMBIGUITY-AWARE DOCUMENT SIMILARITY
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
Ontology integration - Heterogeneity, Techniques and more
Ontology matching
Ad

Viewers also liked (17)

PDF
Chelo Vargas-Sierra
PDF
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
PDF
Applicative evaluation of bilingual terminologies
PDF
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
PPTX
Cross-lingual ontology lexicalisation, translation and information extraction...
PDF
Embedded Human Computation for Knowledge Extraction and Evaluation
PPTX
Macro economische analyse van brazilië
PDF
Challenges in the linguistic exploitation of specialized republishable web co...
PDF
Parallel text extraction from multimodal comparable corpora
PDF
Bilingual terminology mining
PPTX
A cognitive view of the bilingual lexicon
PDF
Bilingual Terminology Extraction based on Translation Patterns
PDF
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
PDF
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
PDF
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
PDF
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
PPT
Word Formation in English
Chelo Vargas-Sierra
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Applicative evaluation of bilingual terminologies
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Cross-lingual ontology lexicalisation, translation and information extraction...
Embedded Human Computation for Knowledge Extraction and Evaluation
Macro economische analyse van brazilië
Challenges in the linguistic exploitation of specialized republishable web co...
Parallel text extraction from multimodal comparable corpora
Bilingual terminology mining
A cognitive view of the bilingual lexicon
Bilingual Terminology Extraction based on Translation Patterns
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Word Formation in English
Ad

Similar to Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange (20)

PDF
The Standards Mosaic Opening the Way to New Technologies
PDF
About the use of biomedical ontologies to play with text in the context of th...
PDF
Natural Language Processing for biomedical text mining - Thierry Hamon
PDF
Ontology learning
PDF
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
PDF
Lexigraf - a multilingual lexicography DTP engine
PDF
methods and resources
PPTX
Introduction to development of lexical databases
PPT
Taxonomy Development and Digital Projects
PDF
Hlava, Davis, Corson-Rikert, and Parr "Control Your Vocabulary: Real-World A...
PPTX
Presentation ASLIB 2014_Ghoula
PPTX
This presentation about corpus linguistics
PPTX
Corpus linguistics
PPTX
PDF
Using NLP to Explore Entity Relationships in COVID-19 Literature
PDF
Effective Classification of Clinical Reports: Natural Language Processing-Bas...
PPTX
Taxonomy Interoperability Standards
PDF
Description and retrieval of medical visual information based on language mod...
PDF
A Computer Science Electronic Dictionary for NOOJ 1st Edition by ISBN 978354...
PPTX
3. introduction to text mining
The Standards Mosaic Opening the Way to New Technologies
About the use of biomedical ontologies to play with text in the context of th...
Natural Language Processing for biomedical text mining - Thierry Hamon
Ontology learning
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Lexigraf - a multilingual lexicography DTP engine
methods and resources
Introduction to development of lexical databases
Taxonomy Development and Digital Projects
Hlava, Davis, Corson-Rikert, and Parr "Control Your Vocabulary: Real-World A...
Presentation ASLIB 2014_Ghoula
This presentation about corpus linguistics
Corpus linguistics
Using NLP to Explore Entity Relationships in COVID-19 Literature
Effective Classification of Clinical Reports: Natural Language Processing-Bas...
Taxonomy Interoperability Standards
Description and retrieval of medical visual information based on language mod...
A Computer Science Electronic Dictionary for NOOJ 1st Edition by ISBN 978354...
3. introduction to text mining

More from Estelle Delpech (14)

PDF
Génération automatique de texte
PDF
Identification de compatibilités entre tages descriptifs de lieux
PDF
Découverte du Traitement Automatique des Langues
PDF
Invited speaker, ATALA 2014 Ph. D. Thesis award
PDF
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
PDF
Identification de compatibilites sémantiques entre descripteurs de lieux
PDF
Usage du TAL dans des applications industrielles : gestion des contenus multi...
PDF
Nomao: data analysis for personalized local search
PDF
Nomao: carnet de bonnes adresses (entre amis)
PDF
Nomao: local search and recommendation engine
PDF
Évaluation applicative des terminologies destinées à la traduction spécialisée
PDF
R&D Lingua et Machina
PDF
Experimenting the TextTiling Algorithm
PDF
Text Processing for Procedural Question Answering
Génération automatique de texte
Identification de compatibilités entre tages descriptifs de lieux
Découverte du Traitement Automatique des Langues
Invited speaker, ATALA 2014 Ph. D. Thesis award
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Identification de compatibilites sémantiques entre descripteurs de lieux
Usage du TAL dans des applications industrielles : gestion des contenus multi...
Nomao: data analysis for personalized local search
Nomao: carnet de bonnes adresses (entre amis)
Nomao: local search and recommendation engine
Évaluation applicative des terminologies destinées à la traduction spécialisée
R&D Lingua et Machina
Experimenting the TextTiling Algorithm
Text Processing for Procedural Question Answering

Recently uploaded (20)

PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Comparative analysis of machine learning models for fake news detection in so...
PPT
What is a Computer? Input Devices /output devices
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
CloudStack 4.21: First Look Webinar slides
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPTX
The various Industrial Revolutions .pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPT
Geologic Time for studying geology for geologist
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
A proposed approach for plagiarism detection in Myanmar Unicode text
Custom Battery Pack Design Considerations for Performance and Safety
Comparative analysis of machine learning models for fake news detection in so...
What is a Computer? Input Devices /output devices
Chapter 5: Probability Theory and Statistics
Convolutional neural network based encoder-decoder for efficient real-time ob...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
UiPath Agentic Automation session 1: RPA to Agents
Microsoft Excel 365/2024 Beginner's training
Flame analysis and combustion estimation using large language and vision assi...
sbt 2.0: go big (Scala Days 2025 edition)
Module 1.ppt Iot fundamentals and Architecture
CloudStack 4.21: First Look Webinar slides
A review of recent deep learning applications in wood surface defect identifi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
The various Industrial Revolutions .pptx
sustainability-14-14877-v2.pddhzftheheeeee
Geologic Time for studying geology for geologist
Taming the Chaos: How to Turn Unstructured Data into Decisions
A contest of sentiment analysis: k-nearest neighbor versus neural network

Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

  • 1. Dealing with Lexicon Acquired from Comparable Corpora Post-edition and Exchange Estelle Delpech, Lingua et Machina Béatrice Daille, U. de Nantes - LINA 1/23
  • 2. Working w/ lexicon acquired from comparable corpora I. Terminology acquisition from comparable corpora : quick overview II. A tool for terminology post-edition III. Data exchange : a TBX variant for automatically acquired lexicons IV. Future work 2/23
  • 3. Part I Terminology Acquisition from Comparable Corpora 3/23
  • 4. Terminology acquisition from comparable corpora   Comparable corpora: “Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.” (my translation of [Déjan and Gaussier, 2002] )  Advantages :  Availabily  Real usages 4/23
  • 5. Terminology acquisition from comparable corpora   Terminology extraction : a contextual analysis     Compare contexts of source and target terms If contexts are similar, there's a good chance source and target terms are translations of each other, ex : mastectomy : reconstruction, prophylactic, treat, undergo, removal mastectomie : reconstruction, prophylactique, traiter, subir, ablation 5/23
  • 6. Terminology acquisition from comparable corpora   Outputs one-to-many alignments – Evaluation : precision on the TopNBest alignments mastectomy  Results    0,92 ablation 0,89 mastectomie 0,48 opération Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20 candidates Morin et al. (2004) : translation is usually the 34th for 6/23 complex terms
  • 7. Part II A Tool for Post-edition 7/23
  • 8. A tool for post-edition  Existing Tools :   ArayaTermExtractor (Waldhör 2006)   iView (Merkel and Foo, 2007) Xerox Terminology Suite ® Our needs :  Deal with one-to-many alignments  Non-aligned contexts  Allow non binary annotation  Display useful information to help finding the right candidate in the corpus 8/23
  • 9. “Useful” information → Knownledge that helps catching the in vivo behavior terms →Text-driven, term-oriented approach  Useful information :  Variants  Collocations  Distributional neighbors  Contexts → To be harvested during the term extraction / alignment process 9/23
  • 10. Useful information : example Mastectomy Mastectomie risk reducting ~ simple ~ ~ préventive ~ simple Tumorectomy Lumpectomy Oophorectomy Tumorectomie Ablation Opération ...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer... ...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer... 10/23
  • 12. Part III Data Exchange : a TBX variant for automatically acquired lexicon 12/23
  • 13. Quick introduction to TBX (1)    TBX : Term Base eXchange Open, XML-based standard for exchanging structured terminological data approved as an international standard by LISA and ISO (norm 30042)  Maps to TMF data model  Subset of MARTIF  Designed for various use cases  Customizable 13/23
  • 14. Quick introduction to TBX (2)  2 components :   Structure : core structure based on TMF metamodel Content : formalism to express data-categories and their constraints Content Form Core DTD/Schema Default TBX Default XCS XCS1 TBX variant 1 Adapted from ISO norm 30042:2008, Fig. 4, p.30 XCSn TBX variant n 14/23
  • 15. Quick introduction to TBX (3)  Form defined in DTD  Content defined in XCS respPerson responsability reliabilityCode partOfSpeech corpusTrace termType usageNote Taken from ISO norm 30042:2008, Fig. 1, p.9 15/23
  • 16. TBX variant for lexicon acquired from comparable corpora  Default TBX data-categories  termType : entryTerm, variant  externalCrossReference, usageNote  partOfSpeech, frequency, reliabilityCode...  transactionType, responsability + Customized data-categories :  occurrences, occurrenceCount  relatedTerm  termDefinition, definitionRelevance  ntigReference 16/23
  • 17. TBX variant : A term entry 17/23
  • 18. TBX variant : 1-to-n alignments 18/23
  • 19. TBX variant : approved alignment 19/23
  • 20. Feed-back on TBX TBX is made for stable terminologies with little uncertainy on the status of translations not machine-generated lexicons of “candidate translations” :  difficult to separate of term + properties from its alignments  no data category specific to automatically estimated reliability   Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format  no definition category that would apply to a single term and not the whole concept
  • 22. Future work  Integration of prototype in Libellex  TBX import / export  edition of linguistic properties  User testing (ergonomics)  Evaluation of added-value for translation  Explore new ways of :  aligning terms  selecting contexts 22/23
  • 23. References  Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”, no password  Metricc project : http://www.metricc.com/  Lingua et Machina : http://www.lingua-et-machina.com/  Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.  ArayaTermExtractor : http://www.heartsome.de  Xerox Terminology Suite : http://www.temis.com/     Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35 TMF : ISO 16642 - Terminological markup framework TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX) Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language resources