Toward Real-Time Simultaneous Translation with LLM

Toward Real-Time
Simultaneous Translation
with LLM
Presented by Siqi Ouyang
CMU

• Simultaneous translation generates translation given partial
source input.
1. The =>
2. The cat => 猫
3. The cat is on => 猫在
4. The cat is on the => 猫在
5. The cat is on the mat => 猫在垫子上

• Simultaneous translation generates translation given partial
source input.
• Text-to-Text (SMT)
• Speech-to-Text (SST)
• Speech-to-Speech

What means Real-Time?
• Real-time = extremely low-latency = sub second latency
• Latency
• Computational: time taken to compute model features
• Algorithmic: latency induced purely by the model decision
1. The =>
2. The cat => 猫
3. The cat is on => 猫在
4. The cat is on the => 猫在
5. The cat is on the mat => 猫在垫子上

What means Real-Time?
• Real-time = extremely low-latency = sub second latency
• Latency
• Computational: time taken to compute model features
• Algorithmic: latency induced purely by the model decision
• Why we want sub second latency?
• Seamless cross-lingual conversation

Status Quo
• State-of-the-art SST works well at 2-second algorithmic
latency, but quality suffers at sub-second algorithmic
latency.

Path Toward Real-Time
• Achieve better translation quality using LLM as the
backbone
• CMU IWSLT24 SST, 1st place on En-De in terms of human rating
• Reduce the computational latency of above system
• Avoid feature recomputation without sacrificing too much quality
• Future anticipation with LLM
• Ongoing work

CMU’s IWSLT 2024
Simultaneous Speech
translation System
Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li,
Graham Neubig, Shinji Watanabe

• AL must below 2s on MuST-C V2.0 tst-COMMON set
• Data adhere the Constraints with LLMs in offline task
9
SimulST Settings

• We use MuST-C v2.0 as the only
training set and leverage
pretrained models of WavLM and
Llama2-7B-Base.
• The total parameter size is 7.4B.
• Training time is 38 hours.
10
Training

• Simply remove unnecessary
names, gives us a 0.8 BLEU
score increase.
11
Data Filtering

12
Other teams
• FBK:
o Seamless M4T
o policy: AlignAtt
• HW-TSC:
o Cascade w/ offline ASR & ensemble offline MT
o policy: modification of onlinization (Polák et al., 2022)
• NAIST:
o Hubert + mBART
o policy:local agreement

• Best Human Rating!
• Our large model achieve competitive
computation aware latency
13
Results

Fast LLM-based
Simultaneous Speech
Translation
Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li

Motivation
• In the previous IWSLT submission, we recompute the
feature of entire speech input and previously generated text
after receiving each speech segment to get the best quality.

Motivation
• In the previous IWSLT submission, we recompute the
feature of entire speech input and previously generated text
after receiving each speech segment to get the best quality.
• We want a method that avoids recomputation while keeping
the translation quality.

• During inference, the order of speech and text is
[speech] [text] [speech] [text] [speech]…
• If we want to avoid recomputation, we need to keep this
order.
• However, this is not consistent with how model is trained.
• During training, we always have full speech before text.
• We need to address training inference mismatch.
How to avoid recomputation?

Experiment Setup
• Dataset: MuST-C En-De/Es
• Concat adjacent audio clips to make 30-second audio clips
• Model Config
• Wav2vec 2.0 large as encoder, Llama2-7B-Base as decoder
• Train stage 1 for 500k steps and stage 2 for 1 epoch
• Speech segment size = 1 second ~ blocksize of speech encoder
• Wait-K-stride-N with N=3
• Greedy decoding during inference

Experiment Setup
• Evaluation
• BLEU and LAAL-CA
• LAAL-CA is the simuleval version, but it is problematic
• Batch size = 8 to stress test
• Single A6000 GPU
• Baselines
• Wait-K LST (7B): test-time wait-k of the offline speech LLM
• EDAtt, AlignAtt (0.1B): previous state-of-the-art
• SeamlessStreaming

FASST works for Wait-K on En-Es

FASST reduces computation overhead

FASST works for various combinations

FASST also “works” for Hold-N

Discussion
• FASST works for trainable policies like Wait-K
• If wait-k works for a language direction, then FASST will work
• Otherwise, it falls behind other methods
• Attention methods like AlignAtt is not compatible with FASST
• AlignAtt is not trainable, so we can only train offline model
• If we avoid re-computation during inference like FASST, then the
training-inference mismatch makes attention information not
trustable.
• LAAL-CA is flawed

Future Anticipation with LLM
Siqi Ouyang, Oleksii Hrinchuk, Zhehuai Chen, Vitaly Lavrukhin,
Jagadeesh Balam, Lei Li, Boris Ginsburg

Motivation
• One technique frequently used by Human interpreters to
reduce latency is anticipating future speech.

Example of Future Anticipation
• Context: A speaker is giving a presentation on the
importance of exercise for heart health. The interpreter is
translating from English to Chinese.
• Speaker: Regular physical activity can significantly reduce
the risk of...
• Interpreter Anticipates: Based on common phrasing in
health-related talks, the interpreter anticipates that the
speaker will mention "heart disease" next.
• Interpretation: 定期的体育活动可以显著降低心脏病的风险 ...

Motivation
• One technique frequently used by Human interpreters to
reduce latency is anticipating future speech.
• The accuracy of anticipation depends on the context, the language,
the prior knowledge of the interpreters and etc.
• LLMs are good at predicting future tokens!
• Prior methods tried to implicitly model the future, but not
explicitly especially with LLM

Perfect prediction is impossible
• Prompt: The cat chased the
• LM Predictions
• The cat chased the mouse.
• The cat chased the bird.
• The cat chased the laser pointer.
• Etc
• A mechanism is required to make use of the predictions.

LLM Sampling
• The sampled sentences should follow the LLM distribution
instead of those with maximum probabilities
• Why?
• Let xj be partial source and yi be partial hypothesis. Ideally
we want the model’s output distribution to be like this

LLM Sampling
• The sampled sentences should follow the LLM distribution
instead of those with maximum probabilities
• Thus we use top-k/p sampling here.

Relaxed Agreement Longest Common Prefix
(RALCP)
• First proposed in [1].
• Given candidate translations t[1~n] and a threshold
gamma,
• We find the longest common prefix of at least gamma * n
translations
[1] Wang, M., Zhao, J., Vu, T., Shiri, F., Shareghi, E., & Haffari, G. (2023). Simultaneous Machine Translation with Large Language Models. ArXiv,
abs/2309.06706.

(RALCP)
• gamma = 0.6
• Candidates
• Einige Wochen später, die Abteilung
• Einige Wochen später, der Abteilung
• Ein paar Wochen später, der Abteilung
• Ein paar Wochen später, das Department
• Ein paar Wochen später, das Departement
abs/2309.06706.

(RALCP)
• First proposed in [1].
• Given candidate translations t[1~n] and a threshold
gamma,
• We find the longest common prefix of at least gamma * n
translations
• RALCP ensures the output translation is agreed by most
candidates and (in most cases) consistent with the source.
abs/2309.06706.

Experiment Setup
• Dataset
• WMT15 De-En
• WMT20 En-Zh
• Model Config
• LLM: Llama2-7B-Base
• MT: Transformer-Big
• Sampling 10 candidates, each 10 new tokens with top-10 sampling

Experiment Setup
• Evaluation
• BLEU & LAAL
• Baselines
• Wait-K-N
• Local Agreement
• RALCP
• SM2 (state-of-the-art)

Wait-K-N
• Wait K tokens at the beginning, then start generating a N tokens each step.
• Example of Wait-5-3
• 1. The =>
• 2. The cat =>
• 3. The cat is =>
• 4. The cat is on =>
• 5. The cat is on the => 这猫在
• 6. The cat is on the mat => 这猫在垫子上
• For De-En, N=1. For En-Zh, N=3.
• K = 1-10

Local Agreement
• Generate hypothesis until EOS at each step. Output the LCP of previous
N as the current translation.
• Example of Local Agreement N=2 (LA-2)
• The president will announce new
• Hypo: 总统将宣布新的施政方针
• Translation:
• The president will announce new policy
• Hypo: 总统将宣布新的政策明天
• Translation: 总统将宣布新的
• The president will announce new policy today
• Hypo: 总统将宣布新的政策今天
• Translation: 总统将宣布新的政策

RALCP
• The original use case of RALCP
• The model conducts beam search at each step
• Then output the RALCP of beam search candidates as the
translation

Future Anticipation improves
translation quality on extremely low
latency

Future Anticipation works for different
LLMs

Case Study #1: Prediction is Correct
• Wait-K
• The European Union's chief 欧盟的首席
• The European Union's chief Brexit 欧盟的首席执政者
• The European Union's chief Brexit negotiator 欧盟的首席执政者布雷
克
• Wait-K + FP
• The European Union's chief 欧盟的首席
• The European Union‘s chief Brexit 欧盟的首席脱欧谈
• The European Union‘s chief Brexit negotiator 欧盟的首席脱欧谈判官

Case Study #1: Prediction is Correct
• The European Union's chief Brexit
• secretary said Brexit secretary John Hainther
• officer, Brett Hollin, was born
• adviser says there will be a general "The
• negotiator said the UK had a "strong"
• secretary and Brexit secretary John Berne,
• negotiator, John Tudu, has been
• negotiator has revealed that a majority of Brex
• negotiator and Brexit Party leader, have
• negotiator has suggested that the UK should
• negotiator has criticised Brexit talks

Case Study #2: Reduce Hallucination
• RALCP
• Sources said the action was in 消息来源说，这一行动
• Sources said the action was in line 消息来源说，这一行动符合《联合
国
• RALCP + FP
• Sources said the action was in 消息来源说，这一行动是对
• Sources said the action was in line 消息来源说，这一行动是对

Case Study #3: Introduce New
Hallucination
• RALCP + FP
• In August, the government 8 月 , 政府
• In August, the government compulsorily 8 月 , 政府强制收购了
• Predictions
• retired three generals and a lieutenant colonel and
• acquired land in Western Sydney to make way for the
• amended the general data protection regulation (G
• acquired the former Daily Mirror printing plant. Then
• acquired farms that supply the Kangaroo
• acquired The Bluestone estate in Southern Laos
• acquired 7,000 acres of land
• took control of the collapsed Halkbank,
• purchased the 54,408 shares
• retired nine commanders and investigations revealed it spent

Discussion
• FP makes offline model outperform SOTA SMT model.
• LLM and offline MT are easy to get.
• Inference cost is higher, but will be less higher if offline MT
is also large.
• FP still introduces a small amount of hallucination.

Toward Real-Time Simultaneous Translation with LLM

More Related Content

Similar to Toward Real-Time Simultaneous Translation with LLM (20)

Recently uploaded (20)

Toward Real-Time Simultaneous Translation with LLM