Ai meetup Neural machine translation updated

How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io

Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems

RNN vs CNN
IN MACHINE
TRANSLATION

Why own translator?
• Private / sensitive data
• Huge amount of data – eg. e-mail translation (cost)
• Off-line / off-cloud / on-premise
• Custom domain-specific translation / vocabulary

Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess (build vocabulary, remove too long sentences, …)
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! ☺

Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE

Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"

Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"

Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level

HTTP://OPENNMT.NET/
OPENNMT (RNN) – DECEMBER 2016

HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ (RNN) – MARCH 2017

HTTPS://GITHUB.COM/FACEBOOKRESEARCH/FAIRSEQ/
FACEBOOK FAIRSEQ (CNN) – MAY 2017

CONVOLUTIONAL NEURAL NETWORK
VS
RECURRENT NEURAL NETWORK
MACHINE TRANSLATION
9X
SPEEDUP

Our experience from PL=>EN training
• 100k vocabulary (word-level)
• Bidirectional LSTM, 2 layers, RNN size 500
• 5M sentences from public data sources
• 2 weeks of training on 1 GPU NVIDIA Tesla K80
• ~ 20 BLEU

Our experience from PL=>EN translation (word level)
• [PL] Kora mózgowa jest odpowiedzialna za
wszystkie nasze racjonalne i analityczne myśli
oraz język.
• [EN] The neocortex is responsible for all of our
rational and analytical thought and language.
• [HYPOTHESIS] <unk> cortex is responsible for all
our rational and analytical thoughts and language.

Our experience from PL=>EN translation (word level)
• [PL] Jesteśmy firmą zajmującą się automatyzacją, która ma na celu
budowanie lekkich struktur bo są bardziej wydajne energetycznie.
Chcemy się nauczyć więcej o pneumatyce i przepływie powietrza.
• [EN] We are a company in the field of automation, and we'd like to
do very lightweight structures because that's energy efficient, and
we'd like to learn more about pneumatics and air flow phenomena.
• [HYPOTHESIS] We're a <unk> company, which is designed to build
light structures because they're more energy efficient, and we want
to learn more about <unk> and air flow.

OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash

OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt

OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data

OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1

Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906

Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots

HTTP://WEB.STANFORD.EDU/CLASS/CS224N/

Thanks!
Bartek Rozkrut
bartek@2040.io

Ai meetup Neural machine translation updated

More Related Content

What's hot (20)

Similar to Ai meetup Neural machine translation updated (20)

More from 2040.io (16)

Recently uploaded (20)

Ai meetup Neural machine translation updated