SlideShare a Scribd company logo
Deep Learning
for Machine Translation
A dramatic turn of paradigm
Alberto Massidda
Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ Open Source
○ DevOps
○ Public and private cloud
○ Search
○ BigData and many more...
This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/
Outline
1. Statistical Machine Translation
2. Neural Machine Translation
3. Domain Adaptation
4. Zero shot translation
5. Unsupervised Neural MT
Statistical Machine Translation
Translating as a ciphered message recovery through probability laws:
1. Foreign language as a noisy channel
2. Language model and Translation model
3. Training (building the translation model)
4. Decoding (translating with the translation model)
Noisy channel model
Goal
Translate a sentence in foreign language f to our language e:
The abstract model
1. Transmit e over a noisy channel.
2. Channel garbles sentence and f is received.
3. Try to recover e by thinking about:
a. how likely is that e was the message, p(e) (source model)
b. how f is turned into e, p(e|f) (channel model)
Word choice and word reordering
P(f|e) cares about words, in any order.
● “It’s too late” → “Tardi troppo è” ✓
● “It’s too late” → “È troppo tardi” ✓
● “It’s too late” → “È troppa birra” ✗
P(e) cares about words order.
● “È troppo tardi” ✓
● “Tardi troppo è” ✗
P(e) and P(f|e)
Where does
these numbers
come from?
P(e) comes from a Language model, a machine that assigns scores to
sentences, estimating their likelihood.
1. Record every sentence ever said in English (1 Billion?)
2. If the sentence “how’s it going?” appears 76413 times in that database, then
we say:
Language model
Translation model
Next we need to worry about P(f|e), the probability of a French string f given an
English string e.
This is called a translation model.
It boils down to computing alignments between source and target languages.
Computing alignments intuition
Pairs of English and Chinese words which come together in a parallel example
may be translations of each other.
Training Data
A parallel corpus is a collection of texts, each of which is translated into one or
more other languages than the original.
EN IT
Look at that! Guarda lì!
I' ve never seen anything like that! Non ho mai visto nulla di simile!
That's incredible! É incredibile!
That's terrific. É eccezionale.
Computing alignments: Expectation Maximization
This algorithm iterates over data,
exacerbating latent properties of a
system.
It finds a local optimum convergence
point without any user supervision.
Example with a 2 sentence corpus:
b c
b y
yx
Decoding
Now it’s time to decode our string encoded by the noisy channel.
Word alignments are leveraged to build a “space” for a search algorithm.
Translating is searching in a space of options.
Translation options as a coverage set
Decoding in action
1. The algorithm builds the search space as a tree of options, sorted by p(e|f).
a. Search space is limited to a fixed size named “beam”.
2. Each option is picked on highest probability first.
a. Reordering adds a penalty.
b. Language model penalizes each stage output.
3. Translation stops when all source words are translated, or covered.
Decoding in action
Neural machine translation
NMT is based on probability too, but has some differences:
● End-to-end training: no more separate Translation + Language Models.
● Markovian assumption, instead of Naive Bayesian: words move together.
If a sentence f of length n is a sequence of words , then p(f) is:
Neural network review: feed-forward
Weighted links determine the strength a neuron can influence its neighbours.
Deviation between outputs and expected values affects rebalancing of weights.
But a feed forward network is not suitable to map the temporal dependencies
between words. We need an architecture than can explicitly map sequences.
Recurrent network
Neural language model
Encoder - Decoder architecture
With a sentence f and e :
(one single sequence)
Languages are independent (vocabulary and domain), so
we can split in 2 separate RNNs:
1. (summary vector of source)
2. Each new word depends on history
Sequence-to-sequence (seq2seq) architecture
THE WAITER TOOK THE PLATES
h h h h h
g g g g g
IL
CAMERI
ERE
PRESE I PIATTI
Summary vector as information bottleneck
Fixed sized representation degrades as sentence length increases.
This is because the alignment learning operates on many-to-many logic.
Gradient flows towards everybody for any alignment mistake.
Let’s gate gradient flow through a context vector, as a weighted average of
source hidden states (also known as “soft search” or “attention”).
Weights computed by feed-forward network with softmax activation.
Attention model
THE WAITER TOOK THE PLATES
h h h h h
g g g g g
IL
CAMERI
ERE
PRESE I PIATTI
+
0.7 0.05
0.1 0.050.1
Attention model
THE WAITER TOOK THE PLATES
h h h h h
g g g g g
IL
CAMERI
ERE
PRESE I PIATTI
+
0.1 0.05
0.1 0.050.7
Attention model
THE WAITER TOOK THE PLATES
h h h h h
g g g g g
IL
CAMERI
ERE
PRESE I PIATTI
+
0.05 0.05
0.7 0.10.1
Attention model
THE WAITER TOOK THE PLATES
h h h h h
g g g g g
IL
CAMERI
ERE
PRESE I PIATTI
+
0.05 0.1
0.1 0.70.05
Attention model
THE WAITER TOOK THE PLATES
h h h h h
g g g g g
IL
CAMERI
ERE
PRESE I PIATTI
+
0.05 0.7
0.1 0.10.05
Neural domain adaptation
Sometimes we want our network to assume a
particular style, but we don’t have enough data.
Solution: adapt an already trained network.
1. First, train the full network with general data to
obtain a general model.
2. Then, train last layers on new data to have it
influence stylistically the output.
Zero shot translation: Google Neural MT
We can use a single system for
multilingual MT: just feed all the different
parallel data inside the same system.
Tag input data with desired target
language: NMT will translate in target
language!
As a side effect, we build an internal
“shared knowledge representation”.
This enables to translate between unseen
language pairs.
GNMT
French English
German Italian
<2IT> I am here<2DE> je suis ici
Sono quiIch bin hier
FR → DE
EN → IT
EN → DE?
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - Codemotion Rome 2018
Unsupervised NMT
We can translate even without parallel data, using just two monolingual corpora.
Each corpus builds a latent semantic space. Similar languages build similar spaces.
Translation as geometrical mapping between affine latent semantic spaces.
x z
encoder
decoder
source sentence latent space
target sentence
ydecoder
auto encoder
x^
Links
https://www.tensorflow.org/tutorials/seq2seq
NMT (seq2seq) Tutorial
https://github.com/google/seq2seq
A general-purpose encoder-decoder framework for Tensorflow
https://github.com/awslabs/sockeye
seq2seq framework with a focus on NMT based on Apache MXNet
http://www.statmt.org/
Old school statistical MT reference site
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - Codemotion Rome 2018
QA

More Related Content

PDF
Deep Learning for Machine Translation - A dramatic turn of paradigm
PPTX
Artificial Intelligence
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PPTX
Multimedia lossless compression algorithms
PDF
Arithmetic coding
PDF
Building streaming pipelines for neural machine translation
PDF
Allganize AI seminar - GPT3 and PET
PPTX
C programming interview questions
Deep Learning for Machine Translation - A dramatic turn of paradigm
Artificial Intelligence
MACHINE-DRIVEN TEXT ANALYSIS
Multimedia lossless compression algorithms
Arithmetic coding
Building streaming pipelines for neural machine translation
Allganize AI seminar - GPT3 and PET
C programming interview questions

What's hot (20)

PPT
Image (PNG) Forensic Analysis
PPTX
Transformer Zoo
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
PDF
Smt in-a-few-slides
PDF
Transformers in 2021
DOCX
Bca1020 programming in c
PDF
17430 data communication &amp; net
PDF
The road ahead for scientific computing with Python
PPTX
Comparative Analysis of Transformer Based Pre-Trained NLP Models
PDF
Neural Machine Translation via Binary Code Prediction
PPTX
Automatski - NP-Complete - TSP - Travelling Salesman Problem Solved in O(N^4)
PPT
Introduction to python
PDF
Universal programmability how ai can help
PPTX
python classes in thane
PPTX
Ctypes
PPT
Good Old Fashioned Artificial Intelligence
ODP
C Types - Extending Python
PDF
Extending Python with ctypes
PPTX
Python programming | Fundamentals of Python programming
PDF
NTUT Information Security Homework 1
Image (PNG) Forensic Analysis
Transformer Zoo
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Smt in-a-few-slides
Transformers in 2021
Bca1020 programming in c
17430 data communication &amp; net
The road ahead for scientific computing with Python
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Neural Machine Translation via Binary Code Prediction
Automatski - NP-Complete - TSP - Travelling Salesman Problem Solved in O(N^4)
Introduction to python
Universal programmability how ai can help
python classes in thane
Ctypes
Good Old Fashioned Artificial Intelligence
C Types - Extending Python
Extending Python with ctypes
Python programming | Fundamentals of Python programming
NTUT Information Security Homework 1
Ad

Similar to Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - Codemotion Rome 2018 (20)

PDF
The effect of distributed archetypes on complexity theory
PDF
Introducing Parallel Pixie Dust
PDF
XML Considered Harmful
PPT
Machine Learning_ How to Do Speech Recognition with Deep Learning
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PDF
Peyton jones-2011-parallel haskell-the_future
PDF
Simon Peyton Jones: Managing parallelism
PDF
Model checking
PPTX
PDF
Free Python Notes PDF - Python Crash Course
PPTX
Natural language processing and transformer models
PDF
IN4308 1
PDF
05-transformers.pdf
PPTX
Chatbot ppt
PPTX
Chatbot_Presentation
PDF
IRJET- On-Screen Translator using NLP and Text Detection
PDF
Tensorflow 2.0 and Coral Edge TPU
PPT
Os Worthington
PDF
PDF
Text Advertisements Analysis using Convolutional Neural Networks
The effect of distributed archetypes on complexity theory
Introducing Parallel Pixie Dust
XML Considered Harmful
Machine Learning_ How to Do Speech Recognition with Deep Learning
Introduction to Neural Information Retrieval and Large Language Models
Peyton jones-2011-parallel haskell-the_future
Simon Peyton Jones: Managing parallelism
Model checking
Free Python Notes PDF - Python Crash Course
Natural language processing and transformer models
IN4308 1
05-transformers.pdf
Chatbot ppt
Chatbot_Presentation
IRJET- On-Screen Translator using NLP and Text Detection
Tensorflow 2.0 and Coral Edge TPU
Os Worthington
Text Advertisements Analysis using Convolutional Neural Networks
Ad

More from Codemotion (20)

PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
PDF
Pompili - From hero to_zero: The FatalNoise neverending story
PPTX
Pastore - Commodore 65 - La storia
PPTX
Pennisi - Essere Richard Altwasser
PPTX
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
PPTX
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
PPTX
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
PPTX
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
PDF
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
PDF
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
PDF
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
PDF
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
PDF
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
PPTX
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
PDF
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
PDF
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
PDF
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
PDF
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Pompili - From hero to_zero: The FatalNoise neverending story
Pastore - Commodore 65 - La storia
Pennisi - Essere Richard Altwasser
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Dropbox Q2 2025 Financial Results & Investor Presentation
Group 1 Presentation -Planning and Decision Making .pptx
cuic standard and advanced reporting.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
20250228 LYD VKU AI Blended-Learning.pptx
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
Getting Started with Data Integration: FME Form 101
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”

Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - Codemotion Rome 2018

  • 1. Deep Learning for Machine Translation A dramatic turn of paradigm Alberto Massidda
  • 2. Who we are ● Founded in 2001; ● Branches in Milan, Rome and London; ● Market leader in enterprise ready solutions based on Open Source tech; ● Expertise: ○ Open Source ○ DevOps ○ Public and private cloud ○ Search ○ BigData and many more...
  • 3. This presentation is Open Source (yay!) https://creativecommons.org/licenses/by-nc-sa/3.0/
  • 4. Outline 1. Statistical Machine Translation 2. Neural Machine Translation 3. Domain Adaptation 4. Zero shot translation 5. Unsupervised Neural MT
  • 5. Statistical Machine Translation Translating as a ciphered message recovery through probability laws: 1. Foreign language as a noisy channel 2. Language model and Translation model 3. Training (building the translation model) 4. Decoding (translating with the translation model)
  • 6. Noisy channel model Goal Translate a sentence in foreign language f to our language e: The abstract model 1. Transmit e over a noisy channel. 2. Channel garbles sentence and f is received. 3. Try to recover e by thinking about: a. how likely is that e was the message, p(e) (source model) b. how f is turned into e, p(e|f) (channel model)
  • 7. Word choice and word reordering P(f|e) cares about words, in any order. ● “It’s too late” → “Tardi troppo è” ✓ ● “It’s too late” → “È troppo tardi” ✓ ● “It’s too late” → “È troppa birra” ✗ P(e) cares about words order. ● “È troppo tardi” ✓ ● “Tardi troppo è” ✗
  • 8. P(e) and P(f|e) Where does these numbers come from?
  • 9. P(e) comes from a Language model, a machine that assigns scores to sentences, estimating their likelihood. 1. Record every sentence ever said in English (1 Billion?) 2. If the sentence “how’s it going?” appears 76413 times in that database, then we say: Language model
  • 10. Translation model Next we need to worry about P(f|e), the probability of a French string f given an English string e. This is called a translation model. It boils down to computing alignments between source and target languages.
  • 11. Computing alignments intuition Pairs of English and Chinese words which come together in a parallel example may be translations of each other.
  • 12. Training Data A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. EN IT Look at that! Guarda lì! I' ve never seen anything like that! Non ho mai visto nulla di simile! That's incredible! É incredibile! That's terrific. É eccezionale.
  • 13. Computing alignments: Expectation Maximization This algorithm iterates over data, exacerbating latent properties of a system. It finds a local optimum convergence point without any user supervision. Example with a 2 sentence corpus: b c b y yx
  • 14. Decoding Now it’s time to decode our string encoded by the noisy channel. Word alignments are leveraged to build a “space” for a search algorithm. Translating is searching in a space of options.
  • 15. Translation options as a coverage set
  • 16. Decoding in action 1. The algorithm builds the search space as a tree of options, sorted by p(e|f). a. Search space is limited to a fixed size named “beam”. 2. Each option is picked on highest probability first. a. Reordering adds a penalty. b. Language model penalizes each stage output. 3. Translation stops when all source words are translated, or covered.
  • 18. Neural machine translation NMT is based on probability too, but has some differences: ● End-to-end training: no more separate Translation + Language Models. ● Markovian assumption, instead of Naive Bayesian: words move together. If a sentence f of length n is a sequence of words , then p(f) is:
  • 19. Neural network review: feed-forward Weighted links determine the strength a neuron can influence its neighbours. Deviation between outputs and expected values affects rebalancing of weights. But a feed forward network is not suitable to map the temporal dependencies between words. We need an architecture than can explicitly map sequences.
  • 22. Encoder - Decoder architecture With a sentence f and e : (one single sequence) Languages are independent (vocabulary and domain), so we can split in 2 separate RNNs: 1. (summary vector of source) 2. Each new word depends on history
  • 23. Sequence-to-sequence (seq2seq) architecture THE WAITER TOOK THE PLATES h h h h h g g g g g IL CAMERI ERE PRESE I PIATTI
  • 24. Summary vector as information bottleneck Fixed sized representation degrades as sentence length increases. This is because the alignment learning operates on many-to-many logic. Gradient flows towards everybody for any alignment mistake. Let’s gate gradient flow through a context vector, as a weighted average of source hidden states (also known as “soft search” or “attention”). Weights computed by feed-forward network with softmax activation.
  • 25. Attention model THE WAITER TOOK THE PLATES h h h h h g g g g g IL CAMERI ERE PRESE I PIATTI + 0.7 0.05 0.1 0.050.1
  • 26. Attention model THE WAITER TOOK THE PLATES h h h h h g g g g g IL CAMERI ERE PRESE I PIATTI + 0.1 0.05 0.1 0.050.7
  • 27. Attention model THE WAITER TOOK THE PLATES h h h h h g g g g g IL CAMERI ERE PRESE I PIATTI + 0.05 0.05 0.7 0.10.1
  • 28. Attention model THE WAITER TOOK THE PLATES h h h h h g g g g g IL CAMERI ERE PRESE I PIATTI + 0.05 0.1 0.1 0.70.05
  • 29. Attention model THE WAITER TOOK THE PLATES h h h h h g g g g g IL CAMERI ERE PRESE I PIATTI + 0.05 0.7 0.1 0.10.05
  • 30. Neural domain adaptation Sometimes we want our network to assume a particular style, but we don’t have enough data. Solution: adapt an already trained network. 1. First, train the full network with general data to obtain a general model. 2. Then, train last layers on new data to have it influence stylistically the output.
  • 31. Zero shot translation: Google Neural MT We can use a single system for multilingual MT: just feed all the different parallel data inside the same system. Tag input data with desired target language: NMT will translate in target language! As a side effect, we build an internal “shared knowledge representation”. This enables to translate between unseen language pairs. GNMT French English German Italian <2IT> I am here<2DE> je suis ici Sono quiIch bin hier FR → DE EN → IT EN → DE?
  • 33. Unsupervised NMT We can translate even without parallel data, using just two monolingual corpora. Each corpus builds a latent semantic space. Similar languages build similar spaces. Translation as geometrical mapping between affine latent semantic spaces. x z encoder decoder source sentence latent space target sentence ydecoder auto encoder x^
  • 34. Links https://www.tensorflow.org/tutorials/seq2seq NMT (seq2seq) Tutorial https://github.com/google/seq2seq A general-purpose encoder-decoder framework for Tensorflow https://github.com/awslabs/sockeye seq2seq framework with a focus on NMT based on Apache MXNet http://www.statmt.org/ Old school statistical MT reference site
  • 36. QA