SlideShare a Scribd company logo
Active Curation of Bi-Text Resources in
Commercial Localization Workflows
Dave Lewis TCD, Andrzej Zydroń
XTM International
•  Open Data on the Web: W3C Semantic Web
standards allow data to be published on Web
– Fine-grained URI-based inter-linking
– Extensible meta-data
– Standard Query APIs
•  Enables a Localization Web
– Terms and translations become linkable resources
– Meta-data from L10n workflows adds value
– Leverage in training Machine Translation and
Automatic Term Extraction
The Localization Web
The Localization Web = Decentralised Annotated Global
Translation Memory and Term Base
Web of Multilingual Content
Domain Terminology
•  Rich word
and
phrase
resources
to assist
translators
Babelfy: Public Lexical Resources
•  Translation
suggestions
can be fed
into MT for
more reliable
translation
Links to BabelNet offer suggestions
for Definitions and Translations
Babelfy & Babelnet offer more term
suggestions
•  Public resources
may not always
yield the right
definitions or
translations for the
context
•  Need to track
human validation/
rejection to train
automatic term
extraction
Active Curation of Linked Language
Resources
The company has also reduced its production
capacity by ceasing manufacture of chest
freezers and freestanding microwave ovens
Extraction &
Segmentation
production capacity
capacité de production
✔
✔ Annotation with
Existing Terms
chest freezer
microwave oven
réfrigérateur
four à micro-onde
?
?
?
?
Auto suggestion from
Babelfy/Babelnet
D'autre part, la société a réduit sa capacité de
production en arrêtant la production de
réfrigérateur et de fours micro-onde pose-libre
Machine Translate
with Term Translations
MT Vendor?
D'autre part, la société a réduit sa capacité de
production en arrêtant la production de
congélateurs coffres et do fours micro-ondes
pose-libre
✗
congélateurs coffres
fours micro-ondes
✔
Postedit and capture
terms in context
✔✔
✔
✔
✔
✔
PE
PE
PE
PE
PE
PE
PE
✗
PE
✔
•  CSV of the Web: tables and JSON meta-
data
•  JSON-Linked Data
•  Provenance Vocabulary
•  Data Catalogue
•  Open Annotation
•  ITS2.0 Vocabulary
•  Also:
– Provenance Plan
– Open Data Rights Language
Linked Data Based on W3C
Standards
Language
Resource
s
Language
Workers
Language
Technology
Language Lifecycle Dependencies
Parallel
Text &
Term base
Posteditors
Machine
Translation
Active Curation: Dynamic MT Retraining
•  Tighten curation cycle: from projects to
segments
– Prioritise postedits for retraining
•  Prioritise Term
Identification by
posteditors
•  Assemble MT-ready,
lexically-rich term bases
•  TermWeb/XTM/DCU
•  Introducing Next Gen Machine Translation
•  Massive scale bilingual dictionaries
•  BabelNet
•  Automatic Term Extraction: forced
decoding
•  Dynamic retraining
•  Optimal segment translation route
•  L3Data curation, sharing
Next Generation Machine Translation
Data Management Lifecycles
Publish
Correct &
refine
Lex-
concept
lifecycleCorrect &
refine
Discover &
use
Discover &
use
Correct &
refine
Bitext lifecycle
Discover
data
(Re)train-
MT
Revise and
annotate
Publish
Content
lifecycle
Publish
I18n &
source QA
Trans
QA
Post-
edit
Automated
translation
Consume Create
•  Better in-context postediting:
– XTM-Easyling
•  Feeding term suggestions from posteditor to Terminology Management
– XTM-Interverbum
•  Dynamic Retraining
– XTM-DCU
•  Bilingual Dictionary SMT improvements
– XTM-DCU
•  NER, terminology enforcements, forced decoding
– XTM-Interverbum-DCU
•  Postediting prioritisation and term flagging
– TCD-DCU-XTM
•  Publishing interlinks of parallel text, lexically rich term bases
– TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet
•  Closing the loop – operational instrumentation of postediting
– XTM
Systems Integration

More Related Content

PPTX
Building a Scalable and Modern Infrastructure at CARFAX
PPTX
Web subjects
PPTX
MongoDB: How We Did It – Reanimating Identity at AOL
PPTX
.Net Distributed Caching
PPTX
Introduction about Mongo DB for Beginners
PPTX
Rebuilding from MongoDB for Scale on HBase
PPTX
NISO Standards update: KBart and Demand Driven Acquisitions Best Practices
PDF
Multilingual Data Value Chain for CEF Automated Translation: Interoperability...
Building a Scalable and Modern Infrastructure at CARFAX
Web subjects
MongoDB: How We Did It – Reanimating Identity at AOL
.Net Distributed Caching
Introduction about Mongo DB for Beginners
Rebuilding from MongoDB for Scale on HBase
NISO Standards update: KBart and Demand Driven Acquisitions Best Practices
Multilingual Data Value Chain for CEF Automated Translation: Interoperability...

Similar to Active Curation of Bi-Text Resources in Commercial Localization Workflows (20)

PPTX
EDF2013: Selected talk by David Lewis: Linked Data Reuse in the Language Serv...
PPTX
Sp2010 high availlability
PDF
J1 - Keynote Data Platform - Rohan Kumar
PDF
Modernize Your Infrastructure and Apps with Microsoft Azure
PDF
A Tight Ship: How Containers and SDS Optimize the Enterprise
PDF
The Standards Mosaic Opening the Way to New Technologies
PPTX
Why I don't use Semantic Web technologies anymore, event if they still influe...
PPTX
SharePoint Saturday Toronto 2015 - Inside the mind of a SharePoint Architect
PPT
PPT
song.ppt
PPT
song.ppt
DOC
Introduction On Integrating Translation Applications, Wcm To Achieve A Common...
PPTX
How companies-use-no sql-and-couchbase-10152013
PDF
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
PPT
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
PPTX
Real-world software design practices when developing ASP.NET web systems by B...
PPTX
MongoDB Partner Program Update - November 2013
PPTX
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
EDF2013: Selected talk by David Lewis: Linked Data Reuse in the Language Serv...
Sp2010 high availlability
J1 - Keynote Data Platform - Rohan Kumar
Modernize Your Infrastructure and Apps with Microsoft Azure
A Tight Ship: How Containers and SDS Optimize the Enterprise
The Standards Mosaic Opening the Way to New Technologies
Why I don't use Semantic Web technologies anymore, event if they still influe...
SharePoint Saturday Toronto 2015 - Inside the mind of a SharePoint Architect
song.ppt
song.ppt
Introduction On Integrating Translation Applications, Wcm To Achieve A Common...
How companies-use-no sql-and-couchbase-10152013
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Real-world software design practices when developing ASP.NET web systems by B...
MongoDB Partner Program Update - November 2013
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
Innovation in the Enterprise Rent-A-Car Data Warehouse
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to machine learning and Linear Models
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
annual-report-2024-2025 original latest.
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Business Analytics and business intelligence.pdf
Fluorescence-microscope_Botany_detailed content
Clinical guidelines as a resource for EBP(1).pdf
Foundation of Data Science unit number two notes
Introduction to machine learning and Linear Models
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Ad

Active Curation of Bi-Text Resources in Commercial Localization Workflows

  • 1. Active Curation of Bi-Text Resources in Commercial Localization Workflows Dave Lewis TCD, Andrzej Zydroń XTM International
  • 2. •  Open Data on the Web: W3C Semantic Web standards allow data to be published on Web – Fine-grained URI-based inter-linking – Extensible meta-data – Standard Query APIs •  Enables a Localization Web – Terms and translations become linkable resources – Meta-data from L10n workflows adds value – Leverage in training Machine Translation and Automatic Term Extraction The Localization Web The Localization Web = Decentralised Annotated Global Translation Memory and Term Base
  • 5. •  Rich word and phrase resources to assist translators Babelfy: Public Lexical Resources
  • 6. •  Translation suggestions can be fed into MT for more reliable translation Links to BabelNet offer suggestions for Definitions and Translations
  • 7. Babelfy & Babelnet offer more term suggestions
  • 8. •  Public resources may not always yield the right definitions or translations for the context •  Need to track human validation/ rejection to train automatic term extraction
  • 9. Active Curation of Linked Language Resources The company has also reduced its production capacity by ceasing manufacture of chest freezers and freestanding microwave ovens Extraction & Segmentation production capacity capacité de production ✔ ✔ Annotation with Existing Terms chest freezer microwave oven réfrigérateur four à micro-onde ? ? ? ? Auto suggestion from Babelfy/Babelnet D'autre part, la société a réduit sa capacité de production en arrêtant la production de réfrigérateur et de fours micro-onde pose-libre Machine Translate with Term Translations MT Vendor? D'autre part, la société a réduit sa capacité de production en arrêtant la production de congélateurs coffres et do fours micro-ondes pose-libre ✗ congélateurs coffres fours micro-ondes ✔ Postedit and capture terms in context ✔✔ ✔ ✔ ✔ ✔ PE PE PE PE PE PE PE ✗ PE ✔
  • 10. •  CSV of the Web: tables and JSON meta- data •  JSON-Linked Data •  Provenance Vocabulary •  Data Catalogue •  Open Annotation •  ITS2.0 Vocabulary •  Also: – Provenance Plan – Open Data Rights Language Linked Data Based on W3C Standards
  • 12. Parallel Text & Term base Posteditors Machine Translation Active Curation: Dynamic MT Retraining •  Tighten curation cycle: from projects to segments – Prioritise postedits for retraining •  Prioritise Term Identification by posteditors •  Assemble MT-ready, lexically-rich term bases
  • 13. •  TermWeb/XTM/DCU •  Introducing Next Gen Machine Translation •  Massive scale bilingual dictionaries •  BabelNet •  Automatic Term Extraction: forced decoding •  Dynamic retraining •  Optimal segment translation route •  L3Data curation, sharing Next Generation Machine Translation
  • 14. Data Management Lifecycles Publish Correct & refine Lex- concept lifecycleCorrect & refine Discover & use Discover & use Correct & refine Bitext lifecycle Discover data (Re)train- MT Revise and annotate Publish Content lifecycle Publish I18n & source QA Trans QA Post- edit Automated translation Consume Create
  • 15. •  Better in-context postediting: – XTM-Easyling •  Feeding term suggestions from posteditor to Terminology Management – XTM-Interverbum •  Dynamic Retraining – XTM-DCU •  Bilingual Dictionary SMT improvements – XTM-DCU •  NER, terminology enforcements, forced decoding – XTM-Interverbum-DCU •  Postediting prioritisation and term flagging – TCD-DCU-XTM •  Publishing interlinks of parallel text, lexically rich term bases – TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet •  Closing the loop – operational instrumentation of postediting – XTM Systems Integration