SlideShare a Scribd company logo
How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io
Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updated
RNN vs CNN
IN MACHINE
TRANSLATION
Why own translator?
• Private / sensitive data
• Huge amount of data – eg. e-mail translation (cost)
• Off-line / off-cloud / on-premise
• Custom domain-specific translation / vocabulary
Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess (build vocabulary, remove too long sentences, …)
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! ☺
Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE
Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"
Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"
Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level
BLEU
HTTP://OPENNMT.NET/
OPENNMT (RNN) – DECEMBER 2016
HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ (RNN) – MARCH 2017
HTTPS://GITHUB.COM/FACEBOOKRESEARCH/FAIRSEQ/
FACEBOOK FAIRSEQ (CNN) – MAY 2017
CONVOLUTIONAL NEURAL NETWORK
VS
RECURRENT NEURAL NETWORK
MACHINE TRANSLATION
9X
SPEEDUP
Our experience from PL=>EN training
• 100k vocabulary (word-level)
• Bidirectional LSTM, 2 layers, RNN size 500
• 5M sentences from public data sources
• 2 weeks of training on 1 GPU NVIDIA Tesla K80
• ~ 20 BLEU
Our experience from PL=>EN translation (word level)
• [PL] Kora mózgowa jest odpowiedzialna za
wszystkie nasze racjonalne i analityczne myśli
oraz język.
• [EN] The neocortex is responsible for all of our
rational and analytical thought and language.
• [HYPOTHESIS] <unk> cortex is responsible for all
our rational and analytical thoughts and language.
Our experience from PL=>EN translation (word level)
• [PL] Jesteśmy firmą zajmującą się automatyzacją, która ma na celu
budowanie lekkich struktur bo są bardziej wydajne energetycznie.
Chcemy się nauczyć więcej o pneumatyce i przepływie powietrza.
• [EN] We are a company in the field of automation, and we'd like to
do very lightweight structures because that's energy efficient, and
we'd like to learn more about pneumatics and air flow phenomena.
• [HYPOTHESIS] We're a <unk> company, which is designed to build
light structures because they're more energy efficient, and we want
to learn more about <unk> and air flow.
OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash
OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt
OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1
Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906
Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots
HTTP://WEB.STANFORD.EDU/CLASS/CS224N/
Thanks!
Bartek Rozkrut
bartek@2040.io

More Related Content

PDF
AIMeetup #4: Neural-machine-translation
PDF
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
PDF
Kernel Recipes 2016 - Would an ABI changes visualization tool be useful to Li...
ODP
Gsummit apis-2013
PDF
Go at uber
PDF
Briefly Rust - Daniele Esposti - Codemotion Rome 2017
ODP
Duplicity
PPTX
Building your First gRPC Service
AIMeetup #4: Neural-machine-translation
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Kernel Recipes 2016 - Would an ABI changes visualization tool be useful to Li...
Gsummit apis-2013
Go at uber
Briefly Rust - Daniele Esposti - Codemotion Rome 2017
Duplicity
Building your First gRPC Service

What's hot (20)

ODP
Open Source .NET
PDF
Experimental dtrace
PPTX
Compiling P4 to XDP, IOVISOR Summit 2017
PDF
introduction to linux kernel tcp/ip ptocotol stack
PDF
tokyotalk
PPTX
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
KEY
Playing Nice with Others
PDF
Ns2pre
PDF
Text tagging with finite state transducers
PDF
Memory Barriers in the Linux Kernel
PDF
Automata Invasion
PPT
Linux50commands
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
The linux networking architecture
PDF
Learning RSocket Using RSC
PDF
TLPI - Chapter 44 Pipe and Fifos
PDF
Versioned Triple Pattern Fragments
PDF
Serialization in Go
PDF
OpenZFS send and receive
PDF
Networking and Go: An Epic Journey
Open Source .NET
Experimental dtrace
Compiling P4 to XDP, IOVISOR Summit 2017
introduction to linux kernel tcp/ip ptocotol stack
tokyotalk
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
Playing Nice with Others
Ns2pre
Text tagging with finite state transducers
Memory Barriers in the Linux Kernel
Automata Invasion
Linux50commands
The TCP/IP Stack in the Linux Kernel
The linux networking architecture
Learning RSocket Using RSC
TLPI - Chapter 44 Pipe and Fifos
Versioned Triple Pattern Fragments
Serialization in Go
OpenZFS send and receive
Networking and Go: An Epic Journey
Ad

Similar to Ai meetup Neural machine translation updated (20)

PDF
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PDF
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
PDF
Architecting a 35 PB distributed parallel file system for science
PDF
Apache Spark Performance: Past, Future and Present
PDF
PuppetDB: Sneaking Clojure into Operations
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Cytoscape and External Data Analysis Tools
PDF
Softshake 2013: Introduction to NoSQL with Couchbase
PPTX
Modern javascript localization with c-3po and the good old gettext
PDF
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
PPT
Pacemaker+DRBD
PPT
How Many Slaves (Ukoug)
PPTX
Seastar at Linux Foundation Collaboration Summit
PDF
Distributed tracing with erlang/elixir
PDF
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
PDF
Language-agnostic data analysis workflows and reproducible research
PDF
Large Scale Processing of Unstructured Text
PDF
Large Scale Text Processing
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Architecting a 35 PB distributed parallel file system for science
Apache Spark Performance: Past, Future and Present
PuppetDB: Sneaking Clojure into Operations
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Spark Summit EU talk by Sameer Agarwal
Cytoscape and External Data Analysis Tools
Softshake 2013: Introduction to NoSQL with Couchbase
Modern javascript localization with c-3po and the good old gettext
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
Pacemaker+DRBD
How Many Slaves (Ukoug)
Seastar at Linux Foundation Collaboration Summit
Distributed tracing with erlang/elixir
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Language-agnostic data analysis workflows and reproducible research
Large Scale Processing of Unstructured Text
Large Scale Text Processing
Ad

More from 2040.io (16)

PPTX
Jak budujemy inteligentnego asystenta biznesowego
PPTX
Obsługa klienta z wykorzystaniem sztucznej inteligencji
PDF
Jak AI pozwala nam usłyszeć głos klienta
PDF
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
PDF
Rozpoznawanie mowy: problem rozwiązany?
PDF
Czy Deep Learning działa?
PDF
Analiza semantyczna zasosowana w środowisku Menerva
PDF
Time-series prediction with neural networks
PDF
AIMeetup #4: Artificial intelligence and economics
PDF
AIMeetup #4: Let’s compete with machine! edrone crm
PDF
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
PDF
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
PDF
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
PDF
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
PDF
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
PDF
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
Jak budujemy inteligentnego asystenta biznesowego
Obsługa klienta z wykorzystaniem sztucznej inteligencji
Jak AI pozwala nam usłyszeć głos klienta
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
Rozpoznawanie mowy: problem rozwiązany?
Czy Deep Learning działa?
Analiza semantyczna zasosowana w środowisku Menerva
Time-series prediction with neural networks
AIMeetup #4: Artificial intelligence and economics
AIMeetup #4: Let’s compete with machine! edrone crm
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
Advanced IT Governance
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Advanced IT Governance
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)

Ai meetup Neural machine translation updated

  • 1. How to build own translator in 15 minutes Neural Machine Translation in practice Bartek Rozkrut 2040.io
  • 2. Why so important? 40 billion USD / year industry Huge barrier for many people Provide unlimited access to knowledge Scale NLP problems
  • 11. RNN vs CNN IN MACHINE TRANSLATION
  • 12. Why own translator? • Private / sensitive data • Huge amount of data – eg. e-mail translation (cost) • Off-line / off-cloud / on-premise • Custom domain-specific translation / vocabulary
  • 13. Neural Machine Translation – example workflow 1. Download Parallel Corpus files 2. Append all corpus files (source + target) in same order 3. Split TRAIN / VAL set 4. Tokenization 5. Preprocess (build vocabulary, remove too long sentences, …) 6. Train 7. Release model (CPU compatible) 8. Translate! 9. REPEAT! ☺
  • 14. Parallel Corpus – public data HTTP://OPUS.LINGFIL.UU.SE
  • 15. Parallel Corpus (source file – PL, EUROPARL) 1.Tytuł: Admirał NATO potrzebuje przyjaciół. 2.Dziękuję. 3.Naprawdę potrzebuję... 4.Ten program stał się katalizatorem. Następnego dnia setki osób chciały mnie dodać do znajomych. Indonezyjczycy i Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan znajomych, a tak przy okazji, co to jest NATO?"
  • 16. Parallel Corpus (target file - EN , EUROPARL) 1.The headline was: NATO Admiral Needs Friends. 2.Thank you. 3.Which I do. 4.And the story was a catalyst, and the next morning I had hundreds of Facebook friend requests from Indonesians and Finns, mostly saying, "Admiral, we heard you need a friend, and oh, by the way, what is NATO?"
  • 17. Vocabulary 1.Word level 2.Sub-word level (eg. Byte Pair Encoding) 3.Character level
  • 18. BLEU
  • 22. CONVOLUTIONAL NEURAL NETWORK VS RECURRENT NEURAL NETWORK MACHINE TRANSLATION 9X SPEEDUP
  • 23. Our experience from PL=>EN training • 100k vocabulary (word-level) • Bidirectional LSTM, 2 layers, RNN size 500 • 5M sentences from public data sources • 2 weeks of training on 1 GPU NVIDIA Tesla K80 • ~ 20 BLEU
  • 24. Our experience from PL=>EN translation (word level) • [PL] Kora mózgowa jest odpowiedzialna za wszystkie nasze racjonalne i analityczne myśli oraz język. • [EN] The neocortex is responsible for all of our rational and analytical thought and language. • [HYPOTHESIS] <unk> cortex is responsible for all our rational and analytical thoughts and language.
  • 25. Our experience from PL=>EN translation (word level) • [PL] Jesteśmy firmą zajmującą się automatyzacją, która ma na celu budowanie lekkich struktur bo są bardziej wydajne energetycznie. Chcemy się nauczyć więcej o pneumatyce i przepływie powietrza. • [EN] We are a company in the field of automation, and we'd like to do very lightweight structures because that's energy efficient, and we'd like to learn more about pneumatics and air flow phenomena. • [HYPOTHESIS] We're a <unk> company, which is designed to build light structures because they're more energy efficient, and we want to learn more about <unk> and air flow.
  • 26. OpenNMT – run Docker container Run CPU-based interactive session with command: sudo docker run -it 2040/opennmt bash Run GPU-based interactive session with command: sudo nvidia-docker run -it 2040/opennmt bash
  • 27. OpenNMT – split paralell corpus split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt mv xaa train-src.txt mv xab val-src.txt split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt mv xaa train-tgt.txt mv xab val-tgt.txt
  • 28. OpenNMT – preprocess paralell corpus th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt > train-src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt > train-tgt.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val- src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val- tgt.txt.tok th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok - valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
  • 29. OpenNMT – train && release && translate th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model model -gpuid 1 th tools/release_model.lua -model model.t7 -gpuid 1 th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid 1
  • 30. Best hyperparams from 250k GPU hours (thx Google) HTTPS://ARXIV.ORG/ABS/1703.03906
  • 31. Other applications 1.Image 2 Text 2.OCR (eg. Tesseract OCR v4.0 – LSTM) 3.Lip reading 4.Simple Q&A 5.Chatbots