EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM

EVALUATION IN USE:
NAVIGATING THE MT ENGINE LANDSCAPE
WITH THE INTENTO EVALUATION PLATFORM
Konstantin Savenkov

CEO Intento, Inc.
© Intento, Inc.
SATT2018
School of Advanced Technologies for Translators
September 2018 - Milano (Italy)

Intento
AGENDA
1 MT USAGE AND EVALUATION
2 AVAILABLE MT SOLUTIONS
3 EVALUATION IN USE
4 PRACTICAL TIPS
2© Intento, Inc. / September 2018

Intento
1MT USAGE AND EVALUATION
USE CASES
Language Service Providers

- to improve turnaround and reduce costs
- to provide MT-ﬁrst services
—

Translation Buyers

- to optimize MT part of translation projects
- to check if MT was used by vendor
—

Individual Translators

- to automate mundane work

Intento
WHY EVALUATE (I) ?
Building your own MT solution is hard

NMT requires a critical mass of talent, data and time
—

Many good off-the-shelf solutions

Stock MT engines with domain adaptation are much cheaper and often
better than custom-built
—

Huge difference in quality case-by-case, also in price

Quality differs 4 times, price - up to 200 (!) times
—

Evaluation is expensive and time-consuming

Different APIs complicate MT evaluation

Intento
WHY EVALUATE (II) ?
A proper MT engine improves speed, quality and
costs

—

Translation buyers start to demand transparency in
the MT choice

Intento
2 AVAILABLE MT SOLUTIONS
6
Alibaba Cloud
stock
Amazon
stock
Baidu
stock, custom
DeepL
stock
Google Cloud
stock, custom
Globalese
stock, custom
GTCom
stock
IBM Watson
stock, custom
KantanMT
custom
Lilt
stock, custom
Microsoft
stock, custom
ModernMT
stock, custom
Omniscien
custom
PangeaMT
custom
Prompsit
custom
PROMT
stock, custom
SAP
stock
SDL
stock, custom
Slate
custom
Systran
stock, custom
Tencent Cloud
stock
Tilde
custom
Yandex
stock
Youdao
stock
© Intento, Inc. / September 2018

Intento
WHAT TO LOOK AT
Technology: RBMT, SMT, NMT, Hybrid

—

Customization level: stock, custom, domain-adaptive

—

Data Protection: commercial vs. free

—

Deployment: cloud vs. on-premise

—

Price: total cost of ownership

—

Performance: absolute vs. relative

Intento
PERFORMANCE EVALUATION
Linguistic analysis

+ no reference translation required

+ identifies all types of errors, shows the absolute quality

- labor-intensive

- too expensive and slow to run regular statistically significant tests

—

Reference-based scores
+ quick and cheap

+ statistically significant sample size

- requires a reference translation

- shows only distance from the reference

- indicates only a relative measure of quality

—

They are complementary

1. Identify a group of candidate engines using reference-based metrics

2. Use the sentence-level scores to find segments with the most difference

3, Run linguistic analysis on the important segments

Intento
EVALUATION IN USE
SMALL PEMT PROJECTS
1. Get a list of stock MT engines for your language pair
—

2. Select 4-5 candidate MT engines
—

3a. Enable them in your CAT tool
—

3b. Translate everything by 4-5 candidate engines and
upload as a TM in your CAT tool
—

4. Choose per segment as you translate
9
i
i
i

Intento
EVALUATION IN USE
MEDIUM / ONGOING / MT-FIRST
1. Prepare a reference translation (1,500-2,000 segments)
—

2. Get a list of stock MT engines for your language pair
—

3. Translate the sample by appropriate engines
—

4. Calculate a reference-based score for the MT results
—

5. Evaluate top-performing engines manually
—

6. Translate everything with the winning engine
10
i
i
i

Intento
EVALUATION IN USE
LARGE PROJECTS / MT ONLY
1. Evaluate stock engines and get a baseline quality score
—

2. Prepare a term base and a domain adaptation corpus (from
10K segments)
—

3. Train appropriate custom NMT engines
—

4. Evaluate custom NMT to see if it works better than stock MT
—

5. Update and re-evaluate the winning model as you collect
more post-edited content
11
i
i

Intento
4 PRACTICAL TIPS
4.1 LANGUAGE SUPPORT
4.2 CHOOSING CANDIDATE MT ENGINES
4.3 TRANSLATING PROJECT WITH MT
4.4 REFERENCE-BASED SCORING
4.5 MANUAL EVALUATION
4.6 TRAINING & USING CUSTOM NMT

Intento
4.1Language Support
13
All stock engines combined
support
13098
language pairs out of
29070
possible (45%)
* https://w3techs.com/technologies/overview/content_language/all

Intento
Language Support by MT engines
14
1
100
10000
G
oogle
Yandex
M
icrosoftN
M
T
M
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language

C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
https://bit.ly/mt_jul2018

Intento
Up-to-date language support often accessible only via API
—

Intento Command-Line Interface provides it in more human-friendly form:
> node index.js --key=<INTENTO_KEY>
--intent=translate.providers --from=en --to=pt
API response:
ai.text.translate.promt.cloud_api.1-0
ai.text.translate.amazon.translate
ai.text.translate.baidu.translate_api
ai.text.translate.ibm-language-translator-v3
ai.text.translate.yandex.translate_api.1-5
ai.text.translate.google.translate_api.2-0
…
Language Support by MT engines
15
https://bit.ly/intento_cli
Also ﬁlter by HTML format support, language detection etc.

Intento
4.2Choosing Candidate MT Engines
16
Evaluate all appropriate engines on a domain-speciﬁc
corpora (more on that later)
OR

Use Intento public evaluation reports
https://bit.ly/mt_jul2018

Intento
Intento MT Evaluation Report
Stock Engines Evaluated
* We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide web-
based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated.

Intento
Language Pairs
18
Focus on popular
language pairs (en-P1
and P1-en)
Partially P1-P1, en-P2,
P2-en

Intento
WMT-2013 (translation task, news domain)
en-es, es-en
fr-en, en-fr
cs-en, en-cs, de-en, en-de, ro-en, en-ro, ﬁ-en, en-ﬁ, ru-en, en-ru, tr-en, en-tr
zh-en, en-zh
NewsCommentary-2011
en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, de-ru, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-ru, fr-
es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, it-de, ja-zh, zh-ja
Tatoeba
en-ko, ko-en
19
Datasets

Intento
https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt
https://github.com/aaronlifenghan/aaron-project-lepor
20
hLEPOR Score
LEPOR: automatic
machine translation
evaluation metric
considering the
enhanced Length
Penalty, n-gram
Position difference
Penalty and Recall

Intento
21
Intento Evaluation Report: MT Prices
USD / 1M
символов

Intento
22
Intento Evaluation Report
Available Quality
Изображение
>80 %
70 %
60 %
50 %
40 %
<40 %
Maximal
Available
Quality
$$$ ≥$20
$$ $10-15
$ <$10
Price

Intento
23
Best MT Engines
Best quality for each pair
google
deepl
amazon
yandex
ibm-nmt
promt
msft-nmt
ibm-smt
tencent

Intento
24
Optimal MT Engines
Best price among top-5%
msft-nmt
yandex
msft-smt
baidu
google
amazon
ibm-nmt
promt
ibm-smt

Intento
25
optimal
best
top-5%
Optimal MT Engines

Intento
26
Candidate Best Engines
Intento

Intento
4.3Translating a project with MT
27
Web demos and CAT integrations do not work well when you need
to translate a bunch of ﬁles
—

Intento Command-Line Interface provides some help:
> node index.js --key=$INTENTO_API_KEY
--to=fr --async --bulk
--provider=
ai.text.translate.google.translate_api.2-0
--input=large_sample.txt
--output=large_sample_results.txt
…

Intento
Behind the curtains
The file is segmented with NLTK framework + some tweaks
—

The segments are packed in chunks according to the limits of the
specific MT engine
—

The chunks are sent to translation concurrently according to
request-per-second quota, retries in case of sporadic errors etc
—

The results are merged together and saved to a file
—

Works for text and CSV formats, HTML (if supported by MT engine)

Intento
4.4Reference-Based Scoring
29
hLEPOR, BLEU, TER, RIBES scores
—

Download tools from Github or use Intento API:
https://github.com/intento/intento-api/blob/master/
score.md

Intento
4.5 Manual Evaluation
30
MQM-DQF - expensive, ~$50 / 1,000-1,500 words
—

Or just make a quick visual assessment:
—

Is MT good enough for this project?

- segments with similar high reference-based scores (how good it may
get)
- segments with similar low reference-based scores (on what segments
no engine works?)
—

If yes, which engine to pick?

- segments with low but different scores (shows NMT “quirks”)
- segments with high but different scores (shows the actionable difference)

Intento
Where to look at
Extremely low scores for some engines
31
source I noodle battono gli spaghetti. hLEPOR:
reference Noodles are beating spaghetti. 1
5 MT engines Noodles beat spaghetti. 0.57
2 MT engines The noodle beat the spaghetti. / The noodles beat the spaghetti. 0.31
1 MT engine The noodles beat the noodles 0.31
1 MT engine The noodle they beat the spaghettis. 0
1 MT engine The noodles are noodles. 0.63
3 MT engine Flying spaghetti noodles / The noodle ﬂying spaghetti. 0-0.31
1 MT engine The noodles are spaghetti 0.80
output from top candidate engines

Intento
MT quirks
Extremely low scores for some engines
32
source I noodle battono gli spaghetti. hLEPOR:
reference Noodles are beating spaghetti. 1
5 MT engines Noodles beat spaghetti. 0.57
2 MT engines The noodle beat the spaghetti. / The noodles beat the spaghetti. 0.31
1 MT engine The noodles beat the noodles 0.31
1 MT engine The noodle they beat the spaghettis. 0
1 MT engine The noodles are noodles. 0.63
3 MT engine Flying spaghetti noodles / The noodle ﬂying spaghetti. 0-0.31
1 MT engine The noodles are spaghetti 0.80
1.Both NMT and reference metrics are not very good good
at short sentences.
2.This manual check shows that one of the overall good
engines is not good at synonyms

Intento
How “good” it may get?
Equally high scores among the engines
33
source
Il governo britannico non tollererà un'altra grande festa.Così le decisioni prese con la
legge del 2 agosto andranno a ridurre il deficit dalla parte “discrezionale non collegata
alla difesa” del budget federale, che equivale solo al 10% del bilancio totale.
hLEPOR:
reference
So the structure established by the August 2 law concentrates deficit reduction
on the “discretionary non-defense” part of the federal budget, which is only about
10% of it.
1
1. Amazon
Thus, the decisions taken under the Act of 2 August will reduce the deficit from
the “discretionary non-defence-related” part of the federal budget, which
amounts to only 10% of the total budget.
0.63
2. DeepL
Thus the decisions taken with the law of August 2 will reduce the deficit from the
"discretionary, non-defensive" part of the federal budget, which amounts to only
10% of the total budget.
0.63
3. Google
So the decisions made with the law of August 2 will reduce the deficit from the
"discretionary not related to the defense" part of the federal budget, which is
equivalent to only 10% of the total budget.
0.66
4. IBM NMT
Thus the decisions taken by the Act of 2 August will reduce the deficit from the
"discretionary unconnected to defence" part of the federal budget, which is
equivalent to only 10% of the total budget.
0.65

Intento
How “bad” it may get?
Equally low scores among the engines
34
source Al momento, diverse aziende sembrano seguire procedimenti simili. hLEPOR:
reference There are hints of ﬁrms responding similarly now. 1
1. Amazon At the moment, several companies seem to be following similar procedures. 0
2. DeepL At the moment, several companies seem to be following similar procedures. 0
3. Google At the moment, several companies seem to follow similar procedures. 0
4. IBM NMT At present, several companies appear to follow similar procedures. 0

Intento
How “bad” it may get?
Equally low scores among the engines
35
source Al momento, diverse aziende sembrano seguire procedimenti simili. hLEPOR:
reference There are hints of ﬁrms responding similarly now. 1
1. Amazon At the moment, several companies seem to be following similar procedures. 0
2. DeepL At the moment, several companies seem to be following similar procedures. 0
3. Google At the moment, several companies seem to follow similar procedures. 0
4. IBM NMT At present, several companies appear to follow similar procedures. 0
1.If all engines agree on something different, it’s likely to
be ﬁne

Intento
Actionable diﬀerence
Diﬀerent high scores
36
source Si sono commessi grandi errori, alimentando ulteriori violenze. hLEPOR:
reference Big mistakes were made, fueling further violence. 1
1. Amazon Big mistakes have been made, fuelling further violence. 0.75
2. DeepL Major mistakes have been made, fuelling further violence. 0.64
3. Google Great mistakes have been made, fueling further violence. 0.75
4. IBM NMT There have been major errors, fuelling further violence. 0.35

Intento
(N)MT quirks
37
“It’s not a story the Jedi
would tell you.”
“”
“Star Wars franchise
is overrated”
A good NMT engine, in
top-5% for a couple of pairs

Intento
Otherwise a good NMT from
a famous brand
(N)MT quirks
38
“Unisex Nylon Laptop
Backpack School Travel
Rucksack Satchel
Shoulder Bag”
“рюкзак”
“Author is an idiot.
I will ﬁx it!”
(a backpack)

Intento
(N)MT quirks
39
“hello, world!” “”
“Are you kidding
me?!”
Good new NMT engine, best
at some language pairs

Intento
4.6 Training and Using Custom NMT
40
Domain-adaptive engines with public pricing: Globalese,
Microsoft Custom Translate, IBM Custom NMT, Google
AutoML
—

In-domain corpora (starting 10K segments) and/or a
glossary
—

Makes sense as long as performed better than stock MT
—

Different pricing models: training, translation, maintenance

THANK YOU!
Konstantin Savenkov

ks@inten.to
Konstantin Savenkov
ks@inten.to
(415) 429-0021
2150 Shattuck Ave
Berkeley CA 94705
41

EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM

More Related Content

What's hot (10)

Similar to EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM (20)

More from Konstantin Savenkov (13)

Recently uploaded (20)

EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUATION PLATFORM