SlideShare a Scribd company logo
Toward Real-Time
Simultaneous Translation
with LLM
Presented by Siqi Ouyang
CMU
Simultaneous Translation
• Simultaneous translation generates translation given partial
source input.
1. The =>
2. The cat => 猫
3. The cat is on => 猫在
4. The cat is on the => 猫在
5. The cat is on the mat => 猫在垫子上
Simultaneous Translation
• Simultaneous translation generates translation given partial
source input.
• Text-to-Text (SMT)
• Speech-to-Text (SST)
• Speech-to-Speech
What means Real-Time?
• Real-time = extremely low-latency = sub second latency
• Latency
• Computational: time taken to compute model features
• Algorithmic: latency induced purely by the model decision
1. The =>
2. The cat => 猫
3. The cat is on => 猫在
4. The cat is on the => 猫在
5. The cat is on the mat => 猫在垫子上
What means Real-Time?
• Real-time = extremely low-latency = sub second latency
• Latency
• Computational: time taken to compute model features
• Algorithmic: latency induced purely by the model decision
• Why we want sub second latency?
• Seamless cross-lingual conversation
Status Quo
• State-of-the-art SST works well at 2-second algorithmic
latency, but quality suffers at sub-second algorithmic
latency.
Path Toward Real-Time
• Achieve better translation quality using LLM as the
backbone
• CMU IWSLT24 SST, 1st place on En-De in terms of human rating
• Reduce the computational latency of above system
• Avoid feature recomputation without sacrificing too much quality
• Future anticipation with LLM
• Ongoing work
CMU’s IWSLT 2024
Simultaneous Speech
translation System
Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li,
Graham Neubig, Shinji Watanabe
• AL must below 2s on MuST-C V2.0 tst-COMMON set
• Data adhere the Constraints with LLMs in offline task
9
SimulST Settings
• We use MuST-C v2.0 as the only
training set and leverage
pretrained models of WavLM and
Llama2-7B-Base.
• The total parameter size is 7.4B.
• Training time is 38 hours.
10
Training
• Simply remove unnecessary
names, gives us a 0.8 BLEU
score increase.
11
Data Filtering
12
Other teams
• FBK:
o Seamless M4T
o policy: AlignAtt
• HW-TSC:
o Cascade w/ offline ASR & ensemble offline MT
o policy: modification of onlinization (Polák et al., 2022)
• NAIST:
o Hubert + mBART
o policy:local agreement
• Best Human Rating!
• Our large model achieve competitive
computation aware latency
13
Results
Fast LLM-based
Simultaneous Speech
Translation
Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li
Motivation
• In the previous IWSLT submission, we recompute the
feature of entire speech input and previously generated text
after receiving each speech segment to get the best quality.
Motivation
• In the previous IWSLT submission, we recompute the
feature of entire speech input and previously generated text
after receiving each speech segment to get the best quality.
• We want a method that avoids recomputation while keeping
the translation quality.
• During inference, the order of speech and text is
[speech] [text] [speech] [text] [speech]…
• If we want to avoid recomputation, we need to keep this
order.
• However, this is not consistent with how model is trained.
• During training, we always have full speech before text.
• We need to address training inference mismatch.
How to avoid recomputation?
Method
Experiment Setup
• Dataset: MuST-C En-De/Es
• Concat adjacent audio clips to make 30-second audio clips
• Model Config
• Wav2vec 2.0 large as encoder, Llama2-7B-Base as decoder
• Train stage 1 for 500k steps and stage 2 for 1 epoch
• Speech segment size = 1 second ~ blocksize of speech encoder
• Wait-K-stride-N with N=3
• Greedy decoding during inference
Experiment Setup
• Evaluation
• BLEU and LAAL-CA
• LAAL-CA is the simuleval version, but it is problematic
• Batch size = 8 to stress test
• Single A6000 GPU
• Baselines
• Wait-K LST (7B): test-time wait-k of the offline speech LLM
• EDAtt, AlignAtt (0.1B): previous state-of-the-art
• SeamlessStreaming
FASST works for Wait-K on En-Es
FASST reduces computation overhead
FASST works for various combinations
FASST also “works” for Hold-N
Discussion
• FASST works for trainable policies like Wait-K
• If wait-k works for a language direction, then FASST will work
• Otherwise, it falls behind other methods
• Attention methods like AlignAtt is not compatible with FASST
• AlignAtt is not trainable, so we can only train offline model
• If we avoid re-computation during inference like FASST, then the
training-inference mismatch makes attention information not
trustable.
• LAAL-CA is flawed
Future Anticipation with LLM
Siqi Ouyang, Oleksii Hrinchuk, Zhehuai Chen, Vitaly Lavrukhin,
Jagadeesh Balam, Lei Li, Boris Ginsburg
Motivation
• One technique frequently used by Human interpreters to
reduce latency is anticipating future speech.
Example of Future Anticipation
• Context: A speaker is giving a presentation on the
importance of exercise for heart health. The interpreter is
translating from English to Chinese.
• Speaker: Regular physical activity can significantly reduce
the risk of...
• Interpreter Anticipates: Based on common phrasing in
health-related talks, the interpreter anticipates that the
speaker will mention "heart disease" next.
• Interpretation: 定期的体育活动可以显著降低心脏病的风险 ...
Motivation
• One technique frequently used by Human interpreters to
reduce latency is anticipating future speech.
• The accuracy of anticipation depends on the context, the language,
the prior knowledge of the interpreters and etc.
• LLMs are good at predicting future tokens!
• Prior methods tried to implicitly model the future, but not
explicitly especially with LLM
Perfect prediction is impossible
• Prompt: The cat chased the
• LM Predictions
• The cat chased the mouse.
• The cat chased the bird.
• The cat chased the laser pointer.
• Etc
• A mechanism is required to make use of the predictions.
Method
LLM Sampling
• The sampled sentences should follow the LLM distribution
instead of those with maximum probabilities
• Why?
• Let xj be partial source and yi be partial hypothesis. Ideally
we want the model’s output distribution to be like this
LLM Sampling
• The sampled sentences should follow the LLM distribution
instead of those with maximum probabilities
• Thus we use top-k/p sampling here.
Relaxed Agreement Longest Common Prefix
(RALCP)
• First proposed in [1].
• Given candidate translations t[1~n] and a threshold
gamma,
• We find the longest common prefix of at least gamma * n
translations
[1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv,
abs/2309.06706.
Relaxed Agreement Longest Common Prefix
(RALCP)
• gamma = 0.6
• Candidates
• Einige Wochen später, die Abteilung
• Einige Wochen später, der Abteilung
• Ein paar Wochen später, der Abteilung
• Ein paar Wochen später, das Department
• Ein paar Wochen später, das Departement
[1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv,
abs/2309.06706.
Relaxed Agreement Longest Common Prefix
(RALCP)
• First proposed in [1].
• Given candidate translations t[1~n] and a threshold
gamma,
• We find the longest common prefix of at least gamma * n
translations
• RALCP ensures the output translation is agreed by most
candidates and (in most cases) consistent with the source.
[1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv,
abs/2309.06706.
Architecture with SMT
Experiment Setup
• Dataset
• WMT15 De-En
• WMT20 En-Zh
• Model Config
• LLM: Llama2-7B-Base
• MT: Transformer-Big
• Sampling 10 candidates, each 10 new tokens with top-10 sampling
Experiment Setup
• Evaluation
• BLEU & LAAL
• Baselines
• Wait-K-N
• Local Agreement
• RALCP
• SM2 (state-of-the-art)
Wait-K-N
• Wait K tokens at the beginning, then start generating a N tokens each step.
• Example of Wait-5-3
• 1. The =>
• 2. The cat =>
• 3. The cat is =>
• 4. The cat is on =>
• 5. The cat is on the => 这猫在
• 6. The cat is on the mat => 这猫在垫子上
• For De-En, N=1. For En-Zh, N=3.
• K = 1-10
Local Agreement
• Generate hypothesis until EOS at each step. Output the LCP of previous
N as the current translation.
• Example of Local Agreement N=2 (LA-2)
• The president will announce new
• Hypo: 总统将宣布新的施政方针
• Translation:
• The president will announce new policy
• Hypo: 总统将宣布新的政策明天
• Translation: 总统将宣布新的
• The president will announce new policy today
• Hypo: 总统将宣布新的政策今天
• Translation: 总统将宣布新的政策
RALCP
• The original use case of RALCP
• The model conducts beam search at each step
• Then output the RALCP of beam search candidates as the
translation
Future Anticipation improves
translation quality on extremely low
latency
Future Anticipation works for different
LLMs
Impact of Hyperparameter
Case Study #1: Prediction is Correct
• Wait-K
• The European Union's chief 欧盟的首席
• The European Union's chief Brexit 欧盟的首席执政者
• The European Union's chief Brexit negotiator 欧盟的首席执政者布雷
克
• Wait-K + FP
• The European Union's chief 欧盟的首席
• The European Union‘s chief Brexit 欧盟的首席脱欧谈
• The European Union‘s chief Brexit negotiator 欧盟的首席脱欧谈判官
Case Study #1: Prediction is Correct
• The European Union's chief Brexit
• secretary said Brexit secretary John Hainther
• officer, Brett Hollin, was born
• adviser says there will be a general "The
• negotiator said the UK had a "strong"
• secretary and Brexit secretary John Berne,
• negotiator, John Tudu, has been
• negotiator has revealed that a majority of Brex
• negotiator and Brexit Party leader, have
• negotiator has suggested that the UK should
• negotiator has criticised Brexit talks
Case Study #2: Reduce Hallucination
• RALCP
• Sources said the action was in 消息来源说,这一行动
• Sources said the action was in line 消息来源说,这一行动符合《联合
国
• RALCP + FP
• Sources said the action was in 消息来源说,这一行动是对
• Sources said the action was in line 消息来源说,这一行动是对
Case Study #3: Introduce New
Hallucination
• RALCP + FP
• In August, the government 8 月 , 政府
• In August, the government compulsorily 8 月 , 政府强制收购了
• Predictions
• retired three generals and a lieutenant colonel and
• acquired land in Western Sydney to make way for the
• amended the general data protection regulation (G
• acquired the former Daily Mirror printing plant. Then
• acquired farms that supply the Kangaroo
• acquired The Bluestone estate in Southern Laos
• acquired 7,000 acres of land
• took control of the collapsed Halkbank,
• purchased the 54,408 shares
• retired nine commanders and investigations revealed it spent
Discussion
• FP makes offline model outperform SOTA SMT model.
• LLM and offline MT are easy to get.
• Inference cost is higher, but will be less higher if offline MT
is also large.
• FP still introduces a small amount of hallucination.
Q & A

More Related Content

PPTX
Neural machine translation by jointly learning to align and translate.pptx
PPTX
presenttat related toautomated text summ
PPTX
Decision Making & Loops
PPTX
Dataworkz odsc london 2018
KEY
Real time system_performance_mon
PPT
PDA and Turing Machine (1).ppt
PPTX
What to do when detect deadlock
PDF
Search at Twitter: Presented by Michael Busch, Twitter
Neural machine translation by jointly learning to align and translate.pptx
presenttat related toautomated text summ
Decision Making & Loops
Dataworkz odsc london 2018
Real time system_performance_mon
PDA and Turing Machine (1).ppt
What to do when detect deadlock
Search at Twitter: Presented by Michael Busch, Twitter

Similar to Toward Real-Time Simultaneous Translation with LLM (20)

PPTX
Real-Time Voice Actuation
KEY
Message:Passing - lpw 2012
PPTX
The Factoring Dead: Preparing for the Cryptopocalypse
PDF
Need for Async: Hot pursuit for scalable applications
PPTX
sCode optimization
PDF
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
PDF
speech technologies with neural networks present
PDF
Natural Language Processing using Java
PDF
Performance and Abstractions
PDF
Assessing quick update methods of statistical translation models
PDF
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
KEY
Verification with LoLA: 4 Using LoLA
PPTX
Deep Learning for Machine Translation
PDF
Start MPC
PDF
es_hardware_handout
PPTX
final ppt BATCH 3.pptx
PDF
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
PDF
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
PDF
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
PPTX
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Real-Time Voice Actuation
Message:Passing - lpw 2012
The Factoring Dead: Preparing for the Cryptopocalypse
Need for Async: Hot pursuit for scalable applications
sCode optimization
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
speech technologies with neural networks present
Natural Language Processing using Java
Performance and Abstractions
Assessing quick update methods of statistical translation models
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Verification with LoLA: 4 Using LoLA
Deep Learning for Machine Translation
Start MPC
es_hardware_handout
final ppt BATCH 3.pptx
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...
Ad

Recently uploaded (20)

PPT
Project quality management in manufacturing
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Well-logging-methods_new................
PPTX
web development for engineering and engineering
PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Drone Technology Electronics components_1
PDF
Digital Logic Computer Design lecture notes
Project quality management in manufacturing
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
CH1 Production IntroductoryConcepts.pptx
Well-logging-methods_new................
web development for engineering and engineering
ETO & MEO Certificate of Competency Questions and Answers
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Lecture Notes Electrical Wiring System Components
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Construction Project Organization Group 2.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Drone Technology Electronics components_1
Digital Logic Computer Design lecture notes
Ad

Toward Real-Time Simultaneous Translation with LLM

  • 1. Toward Real-Time Simultaneous Translation with LLM Presented by Siqi Ouyang CMU
  • 2. Simultaneous Translation • Simultaneous translation generates translation given partial source input. 1. The => 2. The cat => 猫 3. The cat is on => 猫在 4. The cat is on the => 猫在 5. The cat is on the mat => 猫在垫子上
  • 3. Simultaneous Translation • Simultaneous translation generates translation given partial source input. • Text-to-Text (SMT) • Speech-to-Text (SST) • Speech-to-Speech
  • 4. What means Real-Time? • Real-time = extremely low-latency = sub second latency • Latency • Computational: time taken to compute model features • Algorithmic: latency induced purely by the model decision 1. The => 2. The cat => 猫 3. The cat is on => 猫在 4. The cat is on the => 猫在 5. The cat is on the mat => 猫在垫子上
  • 5. What means Real-Time? • Real-time = extremely low-latency = sub second latency • Latency • Computational: time taken to compute model features • Algorithmic: latency induced purely by the model decision • Why we want sub second latency? • Seamless cross-lingual conversation
  • 6. Status Quo • State-of-the-art SST works well at 2-second algorithmic latency, but quality suffers at sub-second algorithmic latency.
  • 7. Path Toward Real-Time • Achieve better translation quality using LLM as the backbone • CMU IWSLT24 SST, 1st place on En-De in terms of human rating • Reduce the computational latency of above system • Avoid feature recomputation without sacrificing too much quality • Future anticipation with LLM • Ongoing work
  • 8. CMU’s IWSLT 2024 Simultaneous Speech translation System Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, Shinji Watanabe
  • 9. • AL must below 2s on MuST-C V2.0 tst-COMMON set • Data adhere the Constraints with LLMs in offline task 9 SimulST Settings
  • 10. • We use MuST-C v2.0 as the only training set and leverage pretrained models of WavLM and Llama2-7B-Base. • The total parameter size is 7.4B. • Training time is 38 hours. 10 Training
  • 11. • Simply remove unnecessary names, gives us a 0.8 BLEU score increase. 11 Data Filtering
  • 12. 12 Other teams • FBK: o Seamless M4T o policy: AlignAtt • HW-TSC: o Cascade w/ offline ASR & ensemble offline MT o policy: modification of onlinization (Polák et al., 2022) • NAIST: o Hubert + mBART o policy:local agreement
  • 13. • Best Human Rating! • Our large model achieve competitive computation aware latency 13 Results
  • 14. Fast LLM-based Simultaneous Speech Translation Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li
  • 15. Motivation • In the previous IWSLT submission, we recompute the feature of entire speech input and previously generated text after receiving each speech segment to get the best quality.
  • 16. Motivation • In the previous IWSLT submission, we recompute the feature of entire speech input and previously generated text after receiving each speech segment to get the best quality. • We want a method that avoids recomputation while keeping the translation quality.
  • 17. • During inference, the order of speech and text is [speech] [text] [speech] [text] [speech]… • If we want to avoid recomputation, we need to keep this order. • However, this is not consistent with how model is trained. • During training, we always have full speech before text. • We need to address training inference mismatch. How to avoid recomputation?
  • 19. Experiment Setup • Dataset: MuST-C En-De/Es • Concat adjacent audio clips to make 30-second audio clips • Model Config • Wav2vec 2.0 large as encoder, Llama2-7B-Base as decoder • Train stage 1 for 500k steps and stage 2 for 1 epoch • Speech segment size = 1 second ~ blocksize of speech encoder • Wait-K-stride-N with N=3 • Greedy decoding during inference
  • 20. Experiment Setup • Evaluation • BLEU and LAAL-CA • LAAL-CA is the simuleval version, but it is problematic • Batch size = 8 to stress test • Single A6000 GPU • Baselines • Wait-K LST (7B): test-time wait-k of the offline speech LLM • EDAtt, AlignAtt (0.1B): previous state-of-the-art • SeamlessStreaming
  • 21. FASST works for Wait-K on En-Es
  • 23. FASST works for various combinations
  • 25. Discussion • FASST works for trainable policies like Wait-K • If wait-k works for a language direction, then FASST will work • Otherwise, it falls behind other methods • Attention methods like AlignAtt is not compatible with FASST • AlignAtt is not trainable, so we can only train offline model • If we avoid re-computation during inference like FASST, then the training-inference mismatch makes attention information not trustable. • LAAL-CA is flawed
  • 26. Future Anticipation with LLM Siqi Ouyang, Oleksii Hrinchuk, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Lei Li, Boris Ginsburg
  • 27. Motivation • One technique frequently used by Human interpreters to reduce latency is anticipating future speech.
  • 28. Example of Future Anticipation • Context: A speaker is giving a presentation on the importance of exercise for heart health. The interpreter is translating from English to Chinese. • Speaker: Regular physical activity can significantly reduce the risk of... • Interpreter Anticipates: Based on common phrasing in health-related talks, the interpreter anticipates that the speaker will mention "heart disease" next. • Interpretation: 定期的体育活动可以显著降低心脏病的风险 ...
  • 29. Motivation • One technique frequently used by Human interpreters to reduce latency is anticipating future speech. • The accuracy of anticipation depends on the context, the language, the prior knowledge of the interpreters and etc. • LLMs are good at predicting future tokens! • Prior methods tried to implicitly model the future, but not explicitly especially with LLM
  • 30. Perfect prediction is impossible • Prompt: The cat chased the • LM Predictions • The cat chased the mouse. • The cat chased the bird. • The cat chased the laser pointer. • Etc • A mechanism is required to make use of the predictions.
  • 32. LLM Sampling • The sampled sentences should follow the LLM distribution instead of those with maximum probabilities • Why? • Let xj be partial source and yi be partial hypothesis. Ideally we want the model’s output distribution to be like this
  • 33. LLM Sampling • The sampled sentences should follow the LLM distribution instead of those with maximum probabilities • Thus we use top-k/p sampling here.
  • 34. Relaxed Agreement Longest Common Prefix (RALCP) • First proposed in [1]. • Given candidate translations t[1~n] and a threshold gamma, • We find the longest common prefix of at least gamma * n translations [1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv, abs/2309.06706.
  • 35. Relaxed Agreement Longest Common Prefix (RALCP) • gamma = 0.6 • Candidates • Einige Wochen später, die Abteilung • Einige Wochen später, der Abteilung • Ein paar Wochen später, der Abteilung • Ein paar Wochen später, das Department • Ein paar Wochen später, das Departement [1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv, abs/2309.06706.
  • 36. Relaxed Agreement Longest Common Prefix (RALCP) • First proposed in [1]. • Given candidate translations t[1~n] and a threshold gamma, • We find the longest common prefix of at least gamma * n translations • RALCP ensures the output translation is agreed by most candidates and (in most cases) consistent with the source. [1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv, abs/2309.06706.
  • 38. Experiment Setup • Dataset • WMT15 De-En • WMT20 En-Zh • Model Config • LLM: Llama2-7B-Base • MT: Transformer-Big • Sampling 10 candidates, each 10 new tokens with top-10 sampling
  • 39. Experiment Setup • Evaluation • BLEU & LAAL • Baselines • Wait-K-N • Local Agreement • RALCP • SM2 (state-of-the-art)
  • 40. Wait-K-N • Wait K tokens at the beginning, then start generating a N tokens each step. • Example of Wait-5-3 • 1. The => • 2. The cat => • 3. The cat is => • 4. The cat is on => • 5. The cat is on the => 这猫在 • 6. The cat is on the mat => 这猫在垫子上 • For De-En, N=1. For En-Zh, N=3. • K = 1-10
  • 41. Local Agreement • Generate hypothesis until EOS at each step. Output the LCP of previous N as the current translation. • Example of Local Agreement N=2 (LA-2) • The president will announce new • Hypo: 总统将宣布新的施政方针 • Translation: • The president will announce new policy • Hypo: 总统将宣布新的政策明天 • Translation: 总统将宣布新的 • The president will announce new policy today • Hypo: 总统将宣布新的政策今天 • Translation: 总统将宣布新的政策
  • 42. RALCP • The original use case of RALCP • The model conducts beam search at each step • Then output the RALCP of beam search candidates as the translation
  • 43. Future Anticipation improves translation quality on extremely low latency
  • 44. Future Anticipation works for different LLMs
  • 46. Case Study #1: Prediction is Correct • Wait-K • The European Union's chief 欧盟的首席 • The European Union's chief Brexit 欧盟的首席执政者 • The European Union's chief Brexit negotiator 欧盟的首席执政者布雷 克 • Wait-K + FP • The European Union's chief 欧盟的首席 • The European Union‘s chief Brexit 欧盟的首席脱欧谈 • The European Union‘s chief Brexit negotiator 欧盟的首席脱欧谈判官
  • 47. Case Study #1: Prediction is Correct • The European Union's chief Brexit • secretary said Brexit secretary John Hainther • officer, Brett Hollin, was born • adviser says there will be a general "The • negotiator said the UK had a "strong" • secretary and Brexit secretary John Berne, • negotiator, John Tudu, has been • negotiator has revealed that a majority of Brex • negotiator and Brexit Party leader, have • negotiator has suggested that the UK should • negotiator has criticised Brexit talks
  • 48. Case Study #2: Reduce Hallucination • RALCP • Sources said the action was in 消息来源说,这一行动 • Sources said the action was in line 消息来源说,这一行动符合《联合 国 • RALCP + FP • Sources said the action was in 消息来源说,这一行动是对 • Sources said the action was in line 消息来源说,这一行动是对
  • 49. Case Study #3: Introduce New Hallucination • RALCP + FP • In August, the government 8 月 , 政府 • In August, the government compulsorily 8 月 , 政府强制收购了 • Predictions • retired three generals and a lieutenant colonel and • acquired land in Western Sydney to make way for the • amended the general data protection regulation (G • acquired the former Daily Mirror printing plant. Then • acquired farms that supply the Kangaroo • acquired The Bluestone estate in Southern Laos • acquired 7,000 acres of land • took control of the collapsed Halkbank, • purchased the 54,408 shares • retired nine commanders and investigations revealed it spent
  • 50. Discussion • FP makes offline model outperform SOTA SMT model. • LLM and offline MT are easy to get. • Inference cost is higher, but will be less higher if offline MT is also large. • FP still introduces a small amount of hallucination.
  • 51. Q & A