SlideShare a Scribd company logo
EVALUATION IN USE:
NAVIGATING THE MT ENGINE LANDSCAPE
WITH THE INTENTO EVALUATION PLATFORM
Konstantin Savenkov

CEO Intento, Inc.
© Intento, Inc.
SATT2018
School of Advanced Technologies for Translators
September 2018 - Milano (Italy)
Intento
AGENDA
1 MT USAGE AND EVALUATION
2 AVAILABLE MT SOLUTIONS
3 EVALUATION IN USE
4 PRACTICAL TIPS
2© Intento, Inc. / September 2018
Intento
1MT USAGE AND EVALUATION
USE CASES
Language Service Providers

- to improve turnaround and reduce costs
- to provide MT-first services
—

Translation Buyers

- to optimize MT part of translation projects
- to check if MT was used by vendor
—

Individual Translators

- to automate mundane work
3© Intento, Inc. / September 2018
Intento
WHY EVALUATE (I) ?
Building your own MT solution is hard

NMT requires a critical mass of talent, data and time
—

Many good off-the-shelf solutions

Stock MT engines with domain adaptation are much cheaper and often
better than custom-built
—

Huge difference in quality case-by-case, also in price

Quality differs 4 times, price - up to 200 (!) times
—

Evaluation is expensive and time-consuming

Different APIs complicate MT evaluation
4© Intento, Inc. / September 2018
Intento
WHY EVALUATE (II) ?
A proper MT engine improves speed, quality and
costs

—

Translation buyers start to demand transparency in
the MT choice
5© Intento, Inc. / September 2018
Intento
2 AVAILABLE MT SOLUTIONS
6
Alibaba Cloud
stock
Amazon
stock
Baidu
stock, custom
DeepL
stock
Google Cloud
stock, custom
Globalese
stock, custom
GTCom
stock
IBM Watson
stock, custom
KantanMT
custom
Lilt
stock, custom
Microsoft
stock, custom
ModernMT
stock, custom
Omniscien
custom
PangeaMT
custom
Prompsit
custom
PROMT
stock, custom
SAP
stock
SDL
stock, custom
Slate
custom
Systran
stock, custom
Tencent Cloud
stock
Tilde
custom
Yandex
stock
Youdao
stock
© Intento, Inc. / September 2018
Intento
WHAT TO LOOK AT
Technology: RBMT, SMT, NMT, Hybrid

—

Customization level: stock, custom, domain-adaptive

—

Data Protection: commercial vs. free

—

Deployment: cloud vs. on-premise

—

Price: total cost of ownership

—

Performance: absolute vs. relative
7© Intento, Inc. / September 2018
Intento
PERFORMANCE EVALUATION
Linguistic analysis

+ no reference translation required

+ identifies all types of errors, shows the absolute quality

- labor-intensive

- too expensive and slow to run regular statistically significant tests

—

Reference-based scores
+ quick and cheap

+ statistically significant sample size

- requires a reference translation

- shows only distance from the reference

- indicates only a relative measure of quality

—

They are complementary

1. Identify a group of candidate engines using reference-based metrics

2. Use the sentence-level scores to find segments with the most difference

3, Run linguistic analysis on the important segments
8© Intento, Inc. / September 2018
Intento
EVALUATION IN USE
SMALL PEMT PROJECTS
1. Get a list of stock MT engines for your language pair
—

2. Select 4-5 candidate MT engines
—

3a. Enable them in your CAT tool
—

3b. Translate everything by 4-5 candidate engines and
upload as a TM in your CAT tool
—

4. Choose per segment as you translate
9
i
i
i
© Intento, Inc. / September 2018
Intento
EVALUATION IN USE
MEDIUM / ONGOING / MT-FIRST
1. Prepare a reference translation (1,500-2,000 segments)
—

2. Get a list of stock MT engines for your language pair
—

3. Translate the sample by appropriate engines
—

4. Calculate a reference-based score for the MT results
—

5. Evaluate top-performing engines manually
—

6. Translate everything with the winning engine
10
i
i
i
© Intento, Inc. / September 2018
Intento
EVALUATION IN USE
LARGE PROJECTS / MT ONLY
1. Evaluate stock engines and get a baseline quality score
—

2. Prepare a term base and a domain adaptation corpus (from
10K segments)
—

3. Train appropriate custom NMT engines
—

4. Evaluate custom NMT to see if it works better than stock MT
—

5. Update and re-evaluate the winning model as you collect
more post-edited content
11
i
i
© Intento, Inc. / September 2018
Intento
4 PRACTICAL TIPS
4.1 LANGUAGE SUPPORT
4.2 CHOOSING CANDIDATE MT ENGINES
4.3 TRANSLATING PROJECT WITH MT
4.4 REFERENCE-BASED SCORING
4.5 MANUAL EVALUATION
4.6 TRAINING & USING CUSTOM NMT
12© Intento, Inc. / September 2018
Intento
4.1Language Support
13
All stock engines combined
support
13098
language pairs out of
29070
possible (45%)
* https://w3techs.com/technologies/overview/content_language/all
© Intento, Inc. / September 2018
Intento
Language Support by MT engines
14
1
100
10000
G
oogle
Yandex
M
icrosoftN
M
T
M
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language


C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
https://bit.ly/mt_jul2018
© Intento, Inc. / September 2018
Intento
Up-to-date language support often accessible only via API
—

Intento Command-Line Interface provides it in more human-friendly form:
> node index.js --key=<INTENTO_KEY> 
--intent=translate.providers --from=en --to=pt
API response:
ai.text.translate.promt.cloud_api.1-0
ai.text.translate.amazon.translate
ai.text.translate.baidu.translate_api
ai.text.translate.ibm-language-translator-v3
ai.text.translate.yandex.translate_api.1-5
ai.text.translate.google.translate_api.2-0
…
Language Support by MT engines
15
https://bit.ly/intento_cli
Also filter by HTML format support, language detection etc.
© Intento, Inc. / September 2018
Intento
4.2Choosing Candidate MT Engines
16
Evaluate all appropriate engines on a domain-specific
corpora (more on that later)
OR

Use Intento public evaluation reports
https://bit.ly/mt_jul2018
© Intento, Inc. / September 2018
Intento
Intento MT Evaluation Report
Stock Engines Evaluated
* We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide web-
based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated.
17© Intento, Inc. / September 2018
Intento
Intento MT Evaluation Report
Language Pairs
18
Focus on popular
language pairs (en-P1
and P1-en)
Partially P1-P1, en-P2,
P2-en
© Intento, Inc. / September 2018
Intento
WMT-2013 (translation task, news domain)
en-es, es-en
WMT-2015 (translation task, news domain)
fr-en, en-fr
WMT-2016 (translation task, news domain)
cs-en, en-cs, de-en, en-de, ro-en, en-ro, fi-en, en-fi, ru-en, en-ru, tr-en, en-tr
WMT-2017 (translation task, news domain)
zh-en, en-zh
NewsCommentary-2011
en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, de-ru, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-ru, fr-
es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, it-de, ja-zh, zh-ja
Tatoeba
en-ko, ko-en
19
Intento MT Evaluation Report
Datasets
© Intento, Inc. / September 2018
Intento
https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt
https://github.com/aaronlifenghan/aaron-project-lepor
20
Intento MT Evaluation Report
hLEPOR Score
LEPOR: automatic
machine translation
evaluation metric
considering the
enhanced Length
Penalty, n-gram
Position difference
Penalty and Recall
© Intento, Inc. / September 2018
Intento
21
Intento Evaluation Report: MT Prices
USD / 1M
символов
© Intento, Inc. / September 2018
Intento
22
Intento Evaluation Report
Available Quality
Изображение
>80 %
70 %
60 %
50 %
40 %
<40 %
Maximal
Available
Quality
$$$ ≥$20
$$ $10-15
$ <$10
Price
© Intento, Inc. / September 2018
Intento
23
Intento Evaluation Report
Best MT Engines
Best quality for each pair
Изображение
google
deepl
amazon
yandex
ibm-nmt
promt
msft-nmt
ibm-smt
tencent
© Intento, Inc. / September 2018
Intento
24
Intento Evaluation Report
Optimal MT Engines
Best price among top-5%
Изображение
msft-nmt
yandex
msft-smt
baidu
google
amazon
ibm-nmt
promt
ibm-smt
© Intento, Inc. / September 2018
Intento
25
Изображение
optimal
best
top-5%
Intento Evaluation Report
Optimal MT Engines
© Intento, Inc. / September 2018
Intento
26
Intento Evaluation Report
Candidate Best Engines
Intento
© Intento, Inc. / September 2018
Intento
4.3Translating a project with MT
27
Web demos and CAT integrations do not work well when you need
to translate a bunch of files
—

Intento Command-Line Interface provides some help:
> node index.js --key=$INTENTO_API_KEY 
--to=fr --async --bulk
--provider= 
ai.text.translate.google.translate_api.2-0 
--input=large_sample.txt 
--output=large_sample_results.txt
…
https://bit.ly/intento_cli
© Intento, Inc. / September 2018
Intento
Behind the curtains
The file is segmented with NLTK framework + some tweaks
—

The segments are packed in chunks according to the limits of the
specific MT engine
—

The chunks are sent to translation concurrently according to
request-per-second quota, retries in case of sporadic errors etc
—

The results are merged together and saved to a file
—

Works for text and CSV formats, HTML (if supported by MT engine)
28© Intento, Inc. / September 2018
Intento
4.4Reference-Based Scoring
29
hLEPOR, BLEU, TER, RIBES scores
—

Download tools from Github or use Intento API:
https://github.com/intento/intento-api/blob/master/
score.md
https://bit.ly/intento_cli
© Intento, Inc. / September 2018
Intento
4.5 Manual Evaluation
30
MQM-DQF - expensive, ~$50 / 1,000-1,500 words
—

Or just make a quick visual assessment:
—

Is MT good enough for this project?

- segments with similar high reference-based scores (how good it may
get)
- segments with similar low reference-based scores (on what segments
no engine works?)
—

If yes, which engine to pick?

- segments with low but different scores (shows NMT “quirks”)
- segments with high but different scores (shows the actionable difference)
© Intento, Inc. / September 2018
Intento
Where to look at
Extremely low scores for some engines
31
source I noodle battono gli spaghetti. hLEPOR:
reference Noodles are beating spaghetti. 1
5 MT engines Noodles beat spaghetti. 0.57
2 MT engines The noodle beat the spaghetti. / The noodles beat the spaghetti. 0.31
1 MT engine The noodles beat the noodles 0.31
1 MT engine The noodle they beat the spaghettis. 0
1 MT engine The noodles are noodles. 0.63
3 MT engine Flying spaghetti noodles / The noodle flying spaghetti. 0-0.31
1 MT engine The noodles are spaghetti 0.80
output from top candidate engines
© Intento, Inc. / September 2018
Intento
MT quirks
Extremely low scores for some engines
32
source I noodle battono gli spaghetti. hLEPOR:
reference Noodles are beating spaghetti. 1
5 MT engines Noodles beat spaghetti. 0.57
2 MT engines The noodle beat the spaghetti. / The noodles beat the spaghetti. 0.31
1 MT engine The noodles beat the noodles 0.31
1 MT engine The noodle they beat the spaghettis. 0
1 MT engine The noodles are noodles. 0.63
3 MT engine Flying spaghetti noodles / The noodle flying spaghetti. 0-0.31
1 MT engine The noodles are spaghetti 0.80
1.Both NMT and reference metrics are not very good good
at short sentences.
2.This manual check shows that one of the overall good
engines is not good at synonyms
© Intento, Inc. / September 2018
Intento
How “good” it may get?
Equally high scores among the engines
33
source
Il governo britannico non tollererà un'altra grande festa.Così le decisioni prese con la
legge del 2 agosto andranno a ridurre il deficit dalla parte “discrezionale non collegata
alla difesa” del budget federale, che equivale solo al 10% del bilancio totale.
hLEPOR:
reference
So the structure established by the August 2 law concentrates deficit reduction
on the “discretionary non-defense” part of the federal budget, which is only about
10% of it.
1
1. Amazon
Thus, the decisions taken under the Act of 2 August will reduce the deficit from
the “discretionary non-defence-related” part of the federal budget, which
amounts to only 10% of the total budget.
0.63
2. DeepL
Thus the decisions taken with the law of August 2 will reduce the deficit from the
"discretionary, non-defensive" part of the federal budget, which amounts to only
10% of the total budget.
0.63
3. Google
So the decisions made with the law of August 2 will reduce the deficit from the
"discretionary not related to the defense" part of the federal budget, which is
equivalent to only 10% of the total budget.
0.66
4. IBM NMT
Thus the decisions taken by the Act of 2 August will reduce the deficit from the
"discretionary unconnected to defence" part of the federal budget, which is
equivalent to only 10% of the total budget.
0.65
© Intento, Inc. / September 2018
Intento
How “bad” it may get?
Equally low scores among the engines
34
source Al momento, diverse aziende sembrano seguire procedimenti simili. hLEPOR:
reference There are hints of firms responding similarly now. 1
1. Amazon At the moment, several companies seem to be following similar procedures. 0
2. DeepL At the moment, several companies seem to be following similar procedures. 0
3. Google At the moment, several companies seem to follow similar procedures. 0
4. IBM NMT At present, several companies appear to follow similar procedures. 0
© Intento, Inc. / September 2018
Intento
How “bad” it may get?
Equally low scores among the engines
35
source Al momento, diverse aziende sembrano seguire procedimenti simili. hLEPOR:
reference There are hints of firms responding similarly now. 1
1. Amazon At the moment, several companies seem to be following similar procedures. 0
2. DeepL At the moment, several companies seem to be following similar procedures. 0
3. Google At the moment, several companies seem to follow similar procedures. 0
4. IBM NMT At present, several companies appear to follow similar procedures. 0
1.If all engines agree on something different, it’s likely to
be fine
© Intento, Inc. / September 2018
Intento
Actionable difference
Different high scores
36
source Si sono commessi grandi errori, alimentando ulteriori violenze. hLEPOR:
reference Big mistakes were made, fueling further violence. 1
1. Amazon Big mistakes have been made, fuelling further violence. 0.75
2. DeepL Major mistakes have been made, fuelling further violence. 0.64
3. Google Great mistakes have been made, fueling further violence. 0.75
4. IBM NMT There have been major errors, fuelling further violence. 0.35
© Intento, Inc. / September 2018
Intento
(N)MT quirks
37
“It’s not a story the Jedi
would tell you.”
“”
“Star Wars franchise
is overrated”
A good NMT engine, in
top-5% for a couple of pairs
© Intento, Inc. / September 2018
Intento
Otherwise a good NMT from
a famous brand
(N)MT quirks
38
“Unisex Nylon Laptop
Backpack School Travel
Rucksack Satchel
Shoulder Bag”
“рюкзак”
“Author is an idiot.
I will fix it!”
(a backpack)
© Intento, Inc. / September 2018
Intento
(N)MT quirks
39
“hello, world!” “”
“Are you kidding
me?!”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / September 2018
Intento
4.6 Training and Using Custom NMT
40
Domain-adaptive engines with public pricing: Globalese,
Microsoft Custom Translate, IBM Custom NMT, Google
AutoML
—

In-domain corpora (starting 10K segments) and/or a
glossary
—

Makes sense as long as performed better than stock MT
—

Different pricing models: training, translation, maintenance
© Intento, Inc. / September 2018
THANK YOU!
Konstantin Savenkov

ks@inten.to
Konstantin Savenkov
ks@inten.to
(415) 429-0021
2150 Shattuck Ave
Berkeley CA 94705
41

More Related Content

PDF
Intento Enterprise MT Hub
PDF
Building Multi-Purpose MT Portfolio
PDF
State of the Machine Translation by Intento (July 2018)
PDF
State of the Machine Translation by Intento (March 2018)
PDF
State of the Domain-Adaptive Machine Translation by Intento (November 2018)
PDF
State of the Machine Translation by Intento (stock engines, Jan 2019)
PDF
Machine Translation Insights
PDF
State of the Machine Translation by Intento (November 2017)
Intento Enterprise MT Hub
Building Multi-Purpose MT Portfolio
State of the Machine Translation by Intento (July 2018)
State of the Machine Translation by Intento (March 2018)
State of the Domain-Adaptive Machine Translation by Intento (November 2018)
State of the Machine Translation by Intento (stock engines, Jan 2019)
Machine Translation Insights
State of the Machine Translation by Intento (November 2017)

What's hot (10)

PDF
Cloud Artificial Intelligence Landscape
PDF
Cloud Sentiment Analysis - Vendor Overview (April 2018)
PDF
State of the Machine Translation by Intento (stock engines, Jun 2019)
PDF
Intento Machine Translation Benchmark, July 2017
PDF
Progress in Commercial Machine Translation Systems
PDF
Dodging AI biases in future-proof Machine Translation solutions
PDF
Intento Enterprise MT Hub
PDF
Improving the Demand Side of the AI Economy (API World 2018)
PDF
Intento Enterprise MT Hub
PDF
Sample MT Evaluation Report
Cloud Artificial Intelligence Landscape
Cloud Sentiment Analysis - Vendor Overview (April 2018)
State of the Machine Translation by Intento (stock engines, Jun 2019)
Intento Machine Translation Benchmark, July 2017
Progress in Commercial Machine Translation Systems
Dodging AI biases in future-proof Machine Translation solutions
Intento Enterprise MT Hub
Improving the Demand Side of the AI Economy (API World 2018)
Intento Enterprise MT Hub
Sample MT Evaluation Report
Ad

Similar to EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM (20)

PDF
Intento Enterprise MT Hub
PDF
iadaatpa gala boston
PDF
Pangeanic presentation at Elia Together Athens - Manuel Herranz
PPTX
Finding the best MT service for your language
PPTX
Neural Machine Translation: a report from the front line
PPT
Lexcelera MT Breaking Compromises
PDF
TAUS MT SHOWCASE, Moses in the Mix. A Technology Agnostic Approach to a Winni...
PPTX
Machine Translation Quality Estimation - A Linguist's Approach
PPTX
Tools-Driven Content Curation & Engine Training ATMA 2014
PPTX
Can Big Data Change the Translation Industry?
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PPTX
Past, Present, and Future: Machine Translation & Natural Language Processing ...
PPTX
Machine translation ppt by shantanu arora
PDF
MMT - Next Generation Machine Translation by Marcello Federico (Fondazione Br...
PDF
Developer Experience & API as a Product
PDF
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
PDF
What your sales team really needs to know about MT (but is afraid to ask?)
PDF
What your sales team really needs to know about MT (but is afraid to ask?)
PPT
Good Applications of Bad Machine Translation
PPTX
An MT Journey Intuit and Welocalize Localization World 2013
Intento Enterprise MT Hub
iadaatpa gala boston
Pangeanic presentation at Elia Together Athens - Manuel Herranz
Finding the best MT service for your language
Neural Machine Translation: a report from the front line
Lexcelera MT Breaking Compromises
TAUS MT SHOWCASE, Moses in the Mix. A Technology Agnostic Approach to a Winni...
Machine Translation Quality Estimation - A Linguist's Approach
Tools-Driven Content Curation & Engine Training ATMA 2014
Can Big Data Change the Translation Industry?
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Machine translation ppt by shantanu arora
MMT - Next Generation Machine Translation by Marcello Federico (Fondazione Br...
Developer Experience & API as a Product
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
What your sales team really needs to know about MT (but is afraid to ask?)
What your sales team really needs to know about MT (but is afraid to ask?)
Good Applications of Bad Machine Translation
An MT Journey Intuit and Welocalize Localization World 2013
Ad

More from Konstantin Savenkov (13)

PDF
GPT and other Text Transformers: Black Swans and Stochastic Parrots
PDF
Как выбрать и приручить машинный перевод / How to choose and tame the Machine...
PDF
Сравнительный анализ систем машинного перевода
PDF
NLU / Intent Detection Benchmark by Intento, August 2017
PDF
Building a Data Driven Business
PDF
Управление бизнесом на основе данных
PDF
Messengers, Bots and Personal Assistants
PDF
Рекомендательные системы: роль и оценка эффективности
PPTX
Measuring the agile process improvement
PDF
Lean production для SAAS
PDF
Driving Business Goals with Recommender Systems @ YAC/m 2015
PDF
The Economics of Recommender Systems
PPTX
Recommender Systems in a nutshell
GPT and other Text Transformers: Black Swans and Stochastic Parrots
Как выбрать и приручить машинный перевод / How to choose and tame the Machine...
Сравнительный анализ систем машинного перевода
NLU / Intent Detection Benchmark by Intento, August 2017
Building a Data Driven Business
Управление бизнесом на основе данных
Messengers, Bots and Personal Assistants
Рекомендательные системы: роль и оценка эффективности
Measuring the agile process improvement
Lean production для SAAS
Driving Business Goals with Recommender Systems @ YAC/m 2015
The Economics of Recommender Systems
Recommender Systems in a nutshell

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Spectroscopy.pptx food analysis technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Assigned Numbers - 2025 - Bluetooth® Document
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectroscopy.pptx food analysis technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine learning based COVID-19 study performance prediction

EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM

  • 1. EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM Konstantin Savenkov CEO Intento, Inc. © Intento, Inc. SATT2018 School of Advanced Technologies for Translators September 2018 - Milano (Italy)
  • 2. Intento AGENDA 1 MT USAGE AND EVALUATION 2 AVAILABLE MT SOLUTIONS 3 EVALUATION IN USE 4 PRACTICAL TIPS 2© Intento, Inc. / September 2018
  • 3. Intento 1MT USAGE AND EVALUATION USE CASES Language Service Providers - to improve turnaround and reduce costs - to provide MT-first services — Translation Buyers - to optimize MT part of translation projects - to check if MT was used by vendor — Individual Translators - to automate mundane work 3© Intento, Inc. / September 2018
  • 4. Intento WHY EVALUATE (I) ? Building your own MT solution is hard NMT requires a critical mass of talent, data and time — Many good off-the-shelf solutions Stock MT engines with domain adaptation are much cheaper and often better than custom-built — Huge difference in quality case-by-case, also in price Quality differs 4 times, price - up to 200 (!) times — Evaluation is expensive and time-consuming Different APIs complicate MT evaluation 4© Intento, Inc. / September 2018
  • 5. Intento WHY EVALUATE (II) ? A proper MT engine improves speed, quality and costs — Translation buyers start to demand transparency in the MT choice 5© Intento, Inc. / September 2018
  • 6. Intento 2 AVAILABLE MT SOLUTIONS 6 Alibaba Cloud stock Amazon stock Baidu stock, custom DeepL stock Google Cloud stock, custom Globalese stock, custom GTCom stock IBM Watson stock, custom KantanMT custom Lilt stock, custom Microsoft stock, custom ModernMT stock, custom Omniscien custom PangeaMT custom Prompsit custom PROMT stock, custom SAP stock SDL stock, custom Slate custom Systran stock, custom Tencent Cloud stock Tilde custom Yandex stock Youdao stock © Intento, Inc. / September 2018
  • 7. Intento WHAT TO LOOK AT Technology: RBMT, SMT, NMT, Hybrid — Customization level: stock, custom, domain-adaptive — Data Protection: commercial vs. free — Deployment: cloud vs. on-premise — Price: total cost of ownership — Performance: absolute vs. relative 7© Intento, Inc. / September 2018
  • 8. Intento PERFORMANCE EVALUATION Linguistic analysis + no reference translation required + identifies all types of errors, shows the absolute quality - labor-intensive - too expensive and slow to run regular statistically significant tests — Reference-based scores + quick and cheap + statistically significant sample size - requires a reference translation - shows only distance from the reference - indicates only a relative measure of quality — They are complementary 1. Identify a group of candidate engines using reference-based metrics 2. Use the sentence-level scores to find segments with the most difference 3, Run linguistic analysis on the important segments 8© Intento, Inc. / September 2018
  • 9. Intento EVALUATION IN USE SMALL PEMT PROJECTS 1. Get a list of stock MT engines for your language pair — 2. Select 4-5 candidate MT engines — 3a. Enable them in your CAT tool — 3b. Translate everything by 4-5 candidate engines and upload as a TM in your CAT tool — 4. Choose per segment as you translate 9 i i i © Intento, Inc. / September 2018
  • 10. Intento EVALUATION IN USE MEDIUM / ONGOING / MT-FIRST 1. Prepare a reference translation (1,500-2,000 segments) — 2. Get a list of stock MT engines for your language pair — 3. Translate the sample by appropriate engines — 4. Calculate a reference-based score for the MT results — 5. Evaluate top-performing engines manually — 6. Translate everything with the winning engine 10 i i i © Intento, Inc. / September 2018
  • 11. Intento EVALUATION IN USE LARGE PROJECTS / MT ONLY 1. Evaluate stock engines and get a baseline quality score — 2. Prepare a term base and a domain adaptation corpus (from 10K segments) — 3. Train appropriate custom NMT engines — 4. Evaluate custom NMT to see if it works better than stock MT — 5. Update and re-evaluate the winning model as you collect more post-edited content 11 i i © Intento, Inc. / September 2018
  • 12. Intento 4 PRACTICAL TIPS 4.1 LANGUAGE SUPPORT 4.2 CHOOSING CANDIDATE MT ENGINES 4.3 TRANSLATING PROJECT WITH MT 4.4 REFERENCE-BASED SCORING 4.5 MANUAL EVALUATION 4.6 TRAINING & USING CUSTOM NMT 12© Intento, Inc. / September 2018
  • 13. Intento 4.1Language Support 13 All stock engines combined support 13098 language pairs out of 29070 possible (45%) * https://w3techs.com/technologies/overview/content_language/all © Intento, Inc. / September 2018
  • 14. Intento Language Support by MT engines 14 1 100 10000 G oogle Yandex M icrosoftN M T M icrosoftSM T Baidu Tencent Systran Systran PN M T PRO M T SDL Language C loud Youdao SAP M odernM T DeepL IBM N M T Am azon IBM SM T Alibaba G TC om 2 11 2 56 138 119 1 074 3 022 6 8 20 24 34 424447 72 104106110110 210 812 3 7823 660 8 556 10 712 Total Unique https://bit.ly/mt_jul2018 © Intento, Inc. / September 2018
  • 15. Intento Up-to-date language support often accessible only via API — Intento Command-Line Interface provides it in more human-friendly form: > node index.js --key=<INTENTO_KEY> --intent=translate.providers --from=en --to=pt API response: ai.text.translate.promt.cloud_api.1-0 ai.text.translate.amazon.translate ai.text.translate.baidu.translate_api ai.text.translate.ibm-language-translator-v3 ai.text.translate.yandex.translate_api.1-5 ai.text.translate.google.translate_api.2-0 … Language Support by MT engines 15 https://bit.ly/intento_cli Also filter by HTML format support, language detection etc. © Intento, Inc. / September 2018
  • 16. Intento 4.2Choosing Candidate MT Engines 16 Evaluate all appropriate engines on a domain-specific corpora (more on that later) OR Use Intento public evaluation reports https://bit.ly/mt_jul2018 © Intento, Inc. / September 2018
  • 17. Intento Intento MT Evaluation Report Stock Engines Evaluated * We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide web- based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated. 17© Intento, Inc. / September 2018
  • 18. Intento Intento MT Evaluation Report Language Pairs 18 Focus on popular language pairs (en-P1 and P1-en) Partially P1-P1, en-P2, P2-en © Intento, Inc. / September 2018
  • 19. Intento WMT-2013 (translation task, news domain) en-es, es-en WMT-2015 (translation task, news domain) fr-en, en-fr WMT-2016 (translation task, news domain) cs-en, en-cs, de-en, en-de, ro-en, en-ro, fi-en, en-fi, ru-en, en-ru, tr-en, en-tr WMT-2017 (translation task, news domain) zh-en, en-zh NewsCommentary-2011 en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, de-ru, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-ru, fr- es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, it-de, ja-zh, zh-ja Tatoeba en-ko, ko-en 19 Intento MT Evaluation Report Datasets © Intento, Inc. / September 2018
  • 20. Intento https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt https://github.com/aaronlifenghan/aaron-project-lepor 20 Intento MT Evaluation Report hLEPOR Score LEPOR: automatic machine translation evaluation metric considering the enhanced Length Penalty, n-gram Position difference Penalty and Recall © Intento, Inc. / September 2018
  • 21. Intento 21 Intento Evaluation Report: MT Prices USD / 1M символов © Intento, Inc. / September 2018
  • 22. Intento 22 Intento Evaluation Report Available Quality Изображение >80 % 70 % 60 % 50 % 40 % <40 % Maximal Available Quality $$$ ≥$20 $$ $10-15 $ <$10 Price © Intento, Inc. / September 2018
  • 23. Intento 23 Intento Evaluation Report Best MT Engines Best quality for each pair Изображение google deepl amazon yandex ibm-nmt promt msft-nmt ibm-smt tencent © Intento, Inc. / September 2018
  • 24. Intento 24 Intento Evaluation Report Optimal MT Engines Best price among top-5% Изображение msft-nmt yandex msft-smt baidu google amazon ibm-nmt promt ibm-smt © Intento, Inc. / September 2018
  • 26. Intento 26 Intento Evaluation Report Candidate Best Engines Intento © Intento, Inc. / September 2018
  • 27. Intento 4.3Translating a project with MT 27 Web demos and CAT integrations do not work well when you need to translate a bunch of files — Intento Command-Line Interface provides some help: > node index.js --key=$INTENTO_API_KEY --to=fr --async --bulk --provider= ai.text.translate.google.translate_api.2-0 --input=large_sample.txt --output=large_sample_results.txt … https://bit.ly/intento_cli © Intento, Inc. / September 2018
  • 28. Intento Behind the curtains The file is segmented with NLTK framework + some tweaks — The segments are packed in chunks according to the limits of the specific MT engine — The chunks are sent to translation concurrently according to request-per-second quota, retries in case of sporadic errors etc — The results are merged together and saved to a file — Works for text and CSV formats, HTML (if supported by MT engine) 28© Intento, Inc. / September 2018
  • 29. Intento 4.4Reference-Based Scoring 29 hLEPOR, BLEU, TER, RIBES scores — Download tools from Github or use Intento API: https://github.com/intento/intento-api/blob/master/ score.md https://bit.ly/intento_cli © Intento, Inc. / September 2018
  • 30. Intento 4.5 Manual Evaluation 30 MQM-DQF - expensive, ~$50 / 1,000-1,500 words — Or just make a quick visual assessment: — Is MT good enough for this project? - segments with similar high reference-based scores (how good it may get) - segments with similar low reference-based scores (on what segments no engine works?) — If yes, which engine to pick? - segments with low but different scores (shows NMT “quirks”) - segments with high but different scores (shows the actionable difference) © Intento, Inc. / September 2018
  • 31. Intento Where to look at Extremely low scores for some engines 31 source I noodle battono gli spaghetti. hLEPOR: reference Noodles are beating spaghetti. 1 5 MT engines Noodles beat spaghetti. 0.57 2 MT engines The noodle beat the spaghetti. / The noodles beat the spaghetti. 0.31 1 MT engine The noodles beat the noodles 0.31 1 MT engine The noodle they beat the spaghettis. 0 1 MT engine The noodles are noodles. 0.63 3 MT engine Flying spaghetti noodles / The noodle flying spaghetti. 0-0.31 1 MT engine The noodles are spaghetti 0.80 output from top candidate engines © Intento, Inc. / September 2018
  • 32. Intento MT quirks Extremely low scores for some engines 32 source I noodle battono gli spaghetti. hLEPOR: reference Noodles are beating spaghetti. 1 5 MT engines Noodles beat spaghetti. 0.57 2 MT engines The noodle beat the spaghetti. / The noodles beat the spaghetti. 0.31 1 MT engine The noodles beat the noodles 0.31 1 MT engine The noodle they beat the spaghettis. 0 1 MT engine The noodles are noodles. 0.63 3 MT engine Flying spaghetti noodles / The noodle flying spaghetti. 0-0.31 1 MT engine The noodles are spaghetti 0.80 1.Both NMT and reference metrics are not very good good at short sentences. 2.This manual check shows that one of the overall good engines is not good at synonyms © Intento, Inc. / September 2018
  • 33. Intento How “good” it may get? Equally high scores among the engines 33 source Il governo britannico non tollererà un'altra grande festa.Così le decisioni prese con la legge del 2 agosto andranno a ridurre il deficit dalla parte “discrezionale non collegata alla difesa” del budget federale, che equivale solo al 10% del bilancio totale. hLEPOR: reference So the structure established by the August 2 law concentrates deficit reduction on the “discretionary non-defense” part of the federal budget, which is only about 10% of it. 1 1. Amazon Thus, the decisions taken under the Act of 2 August will reduce the deficit from the “discretionary non-defence-related” part of the federal budget, which amounts to only 10% of the total budget. 0.63 2. DeepL Thus the decisions taken with the law of August 2 will reduce the deficit from the "discretionary, non-defensive" part of the federal budget, which amounts to only 10% of the total budget. 0.63 3. Google So the decisions made with the law of August 2 will reduce the deficit from the "discretionary not related to the defense" part of the federal budget, which is equivalent to only 10% of the total budget. 0.66 4. IBM NMT Thus the decisions taken by the Act of 2 August will reduce the deficit from the "discretionary unconnected to defence" part of the federal budget, which is equivalent to only 10% of the total budget. 0.65 © Intento, Inc. / September 2018
  • 34. Intento How “bad” it may get? Equally low scores among the engines 34 source Al momento, diverse aziende sembrano seguire procedimenti simili. hLEPOR: reference There are hints of firms responding similarly now. 1 1. Amazon At the moment, several companies seem to be following similar procedures. 0 2. DeepL At the moment, several companies seem to be following similar procedures. 0 3. Google At the moment, several companies seem to follow similar procedures. 0 4. IBM NMT At present, several companies appear to follow similar procedures. 0 © Intento, Inc. / September 2018
  • 35. Intento How “bad” it may get? Equally low scores among the engines 35 source Al momento, diverse aziende sembrano seguire procedimenti simili. hLEPOR: reference There are hints of firms responding similarly now. 1 1. Amazon At the moment, several companies seem to be following similar procedures. 0 2. DeepL At the moment, several companies seem to be following similar procedures. 0 3. Google At the moment, several companies seem to follow similar procedures. 0 4. IBM NMT At present, several companies appear to follow similar procedures. 0 1.If all engines agree on something different, it’s likely to be fine © Intento, Inc. / September 2018
  • 36. Intento Actionable difference Different high scores 36 source Si sono commessi grandi errori, alimentando ulteriori violenze. hLEPOR: reference Big mistakes were made, fueling further violence. 1 1. Amazon Big mistakes have been made, fuelling further violence. 0.75 2. DeepL Major mistakes have been made, fuelling further violence. 0.64 3. Google Great mistakes have been made, fueling further violence. 0.75 4. IBM NMT There have been major errors, fuelling further violence. 0.35 © Intento, Inc. / September 2018
  • 37. Intento (N)MT quirks 37 “It’s not a story the Jedi would tell you.” “” “Star Wars franchise is overrated” A good NMT engine, in top-5% for a couple of pairs © Intento, Inc. / September 2018
  • 38. Intento Otherwise a good NMT from a famous brand (N)MT quirks 38 “Unisex Nylon Laptop Backpack School Travel Rucksack Satchel Shoulder Bag” “рюкзак” “Author is an idiot. I will fix it!” (a backpack) © Intento, Inc. / September 2018
  • 39. Intento (N)MT quirks 39 “hello, world!” “” “Are you kidding me?!” Good new NMT engine, best at some language pairs © Intento, Inc. / September 2018
  • 40. Intento 4.6 Training and Using Custom NMT 40 Domain-adaptive engines with public pricing: Globalese, Microsoft Custom Translate, IBM Custom NMT, Google AutoML — In-domain corpora (starting 10K segments) and/or a glossary — Makes sense as long as performed better than stock MT — Different pricing models: training, translation, maintenance © Intento, Inc. / September 2018
  • 41. THANK YOU! Konstantin Savenkov ks@inten.to Konstantin Savenkov ks@inten.to (415) 429-0021 2150 Shattuck Ave Berkeley CA 94705 41