deep encoder, shallow decoder reevaluating non-autoregressive machine translation ppt

Deep Encoder, Shallow Decoder:
Reevaluating non-
autoregressive machine
translation
자연어처리팀: 김수빈, 신문종, 박희수, 조진욱, 진명훈, 황경진(발표자)

Index
• Introduction
• Background
• Problem Statement
• Objectives
• Methodology
• Technical Contribution
• Reevaluating NMT
• Experiments
• Result and discussion
• Further analysis
• Conclusion

Background
• 기존의 SOTA NMT 시스템 : autoregressive
- autoregressive: 단어가 이전 단어들의 조건 하에 one-by-one으로 예측됨
- 입력 𝑋 = 𝑥!, … , 𝑥"
#
가 주어지면 생성가능한 𝑌 = 𝑦!, … , 𝑦"´에 대한 probability를 chain
of condition probability를 통해 구함.
𝑃%& 𝑌 𝑋; 𝜃 = ∏'(!
")!
𝑝 𝑦' 𝑦*:',!, 𝑥!:"!; 𝜃
- Maximum likelihood training: 각 decoding step마다 maximum likelihood를 적용

Background
• 기존의 SOTA NMT 시스템 : autoregressive
- 문제점: Autoregressive 모델에서 디코더는 각 토큰을 생성할 때 이전에 생성한 토큰에
영향을 받는 구조이기 때문에 병렬화가 불가능 하고, 이로 인해 Inference time이
길어진다는 한계가 있음.
- 이를 해결하기 위해 최근 디코더를 병렬화하여 처리하는 non-autregressive NMT에
대한 연구가 많이 진행됨

Background
출처: Wei, Bingzhen, et al. "Imitation learning for non-autoregressive neural machine translation." arXiv
preprint arXiv:1906.02041 (2019).

Problem Statement
• Non-autoregressive machine translation (이하 NAR)
- Multimodality problem: 이러한 NAR은 병렬 디코딩이 출력 토큰 간의 조건부 독립성을
가정하고, 모델이 번역 target 언어의 multimodal distribution을 적절하게 고려하는
것을 방지하여 번역 품질이 떨어지는 경향이 있음.
- 즉, 한 단어가 여러개의 단어로 번역이 가능한 경우, 조건부 독립분포는 다른 단어를
허용하지 않기 때문에 모든 가능성을 고려한 번역이 불가능.

Problem Statement
• Non-autoregressive machine translation (이하 NAR)
- Speed-quality tradeoff : 이 연구가 제안한 해결책.
- Kim et al. (2019), 인코더와 디코더의 깊이를 다르게 함으로써 더 나은 speed-quality
tradeoff를 제안
- 본 연구는 따라서 Deep encoder, shallow decoder의 방식을 제안.

Objectives
• 기존의 NAR의 성능이 측정되던 방식에 대해 다시 한번 연구하고, 다른 개념의 도입을 통해 NAR과
AR의 성능을 보다 효과적으로 측정하려고 함.
• 각각 NAR과 AR에서도 인코더와 디코더의 깊이를 다르게 하여 어떤 구조가 가장 효과적인 성능을
보이는지 연구하고자 함.
• 위 성능 측정 분석을 통해 NAR과 AR이 각각 더 나은 성능을 보이기 위해 어떤 부분에 초점을 맞춰
향후 연구를 진행해야할지 그 방향을 제시하려고 함.

Methodology
• 두가지 시나리오를 가정하여 NAR 모델과 AR 모델간의 speed-quality
comparision을 제공.
- 1st Scenario: 사용자의 텍스트(또는 음성) 입력을 번역하는 즉각적인 기계 번역을
시뮬레이션하도록 설계. GPU의 디코딩 위치에서 병렬처리를 최대한 활용할 수 있음.
- 2nd Scenario: 많은 양의 텍스트를 가능한 한 빨리 번역하려는 상황을 목표로 함. 이 경우
일반적으로 AR 모델이 GPU의 병렬처리를 활용하기 때문에 NAR 모델보다 큰 차이로 더
빠르게 실행됨.

Technical Contribution
• 본 연구는 NAR 평가의 세 가지 기존 가정의 유효성에 대해 다시 연구해보는데 의의가 있음, 즉
suboptimal layer allocation, lack of distillation for AR baselines, and insufficiently
general speed measures에 대해 다시 측정해봄..
• 본 연구는 복잡성 분석을 제공하고 더 나은 속도-품질 트레이드오프, 즉 deep-shallow 구성으로
이어지는 optimal layer allocation strategy를 식별함
• 본 연구는 7가지 표준 번역에 대해 AR 및 NAR 모델에 대한 광범위한 분석 및 일대일 비교를
수행함.
• 본 연구는 두 모델 패밀리 사이의 정확도 격차가 이전에 생각했던 것보다 훨씬 더 넓으며 NAR
모델이 충분히 깊은 디코더 없이는 대상 단어 순서를 잘 포착할 수 없다는 것을 증명함.

REEVALUATING NMT
• Speed Measures
- 기존의 방식: focused solely on the setting of translating one sentence at a time
where full parallelization is trivial with a single GPU.
- S1 measures speed when translating one sentence at a time. 사용자의 텍스트
입력을 즉시 번역하는 즉각적인 기계 번역과 같은 응용 프로그램의 방식에서 사용
- Smax measures speed when translating in mini-batches as large as the
hardware allows. 구글 클라우드 서비스와 같이 주어진 많은 양의 텍스트를 번역하려는
시나리오에서 사용됨.

• Deep encoder and Shallow decoder
- Why “Deep encoder and Shallw decoder”, not “Shallow encoder and Deep
decoder”?
- “an AR model with a deep-shallow configuration retains translation accuracy,
but can substantially reduce decoding time. This is because at inference time, the
encoder accounts for a smaller part of the overhead since its computation can be
easily parallelized oversource positions; on the other hand, the speed up gains
from a light weight decoder are substantial.”
REEVALUATING NMT

• Complexity analysis
- Focus on two main key properties: (1) the total amount of operations and (2)
time complexity when full parallelization is assumed (Harris, 2007)
- N: text length / T: the number of iterations in an iterative NAR method
(typically T < N) / E: number of encoder / D: number of decoder
- AR 및 NAR 모델은 동일한 인코더 구조를 사용하며 디코더에서만 차이를 보임. 첫째,
연산의 총량은 둘 다 시퀀스 길이의 제곱과 같지만, T decoding iteration을 갖는 NAR
디코더는 T 배 더 많은 계산이 필요함. 둘째, AR 디코더는 시퀀스 길이의 제곱의 시간
복잡도를 가짐.
REEVALUATING NMT

• Complexity analysis
• 위 표에서 볼 수 있듯이, AR과 NAR 의 time complexity 는 decoder 에 달려있음. T<N 일 때 NAR
모델이 AR 에 비교해서 강점을 가짐.
Ø Decoder의 layer 를 줄이면 S1 속도가 많이 빨라지고, encoder를 늘리는 것은 속도가 적당히
느려진다.
• T 는 total operation 관점에서 가장 중요한 요인인데, 실험적으로 최소 4개 이상일 때, AR 과
비슷한 정도의 성능을 보임.
REEVALUATING NMT

• Knowledge Distillation
- 대부분의 NMT는 Knowledge Distillation을 이용하여 합리적인 speed-quality trade
off를 달성함. 그리고 여태까지 AR의 baseline에는 Knowledge Distillation이 필요하지
않다고 주장했음.
- 그러나 본 연구는 AR과 NAR 모두에 Knowledge Distillation을 적용하여 연구 비교
하였으며 그 결과 AR 모델도 지식 증류의 이점을 가지며 AR과 NAR 모델 사이의 정확도
격차가 이전에 확립된 것보다 더 넓다는 것을 보여주고 있음.
REEVALUATING NMT
Knowledge distillation 의 목적은 "미리 잘
학습된 큰 네트워크(Teacher network) 의
지식을 실제로 사용하고자 하는 작은
네트워크(Student network) 에게 전달하는 것".
출처: https://light-tree.tistory.com/196

Experiments
• 본 연구는 different layer allocation을 가진 NAR과 AR을 기존 번역 데이터셋을 이용하여
비교하고자 함 .
- 본 연구는 두가지 기존의 NAR 모델을 비교하고자 함.
ü CMLM (Ghazvininejad et al., 2019) predicts randomly masked target tokens given observed ones as
well as the source. At inference time, it first predicts all target words nonautoregressively, and then
iteratively masks and predicts the words that the model is least confident about. Following previous
practice (Ghazvininejad et al., 2019; 2020b), we decode 5 candidate lengths in parallel (length beam)
with T = 4 or T = 10 iterations.
ü DisCo (Kasai et al., 2020) predicts every target token given an arbitrary subset of the rest of the target
tokens. Following Kasai et al. (2020), we use their parallel easy-first inference, and set the maximum
number of iterations to 10 and the length beam size to 5.

Experiments
• Experimental Setup
- Dataset >> WMT14 EN-DE (4.5M pairs, Bojar et al., 2014), WMT16 EN-RO
(610K, Bojar et al., 2016), WMT17 EN-ZH (20M, Bojar et al., 2017), and WMT14
EN-FR (36M, EN-> FR only).
- Preprocessing: This study follows the preprocessing and data slits of previous
work.
- Evaluation: SacreBLEU (EN – ZH), BLEU (for others)
- Hyperparameters: Base sized transformer (8 attention heads, 512 model
dimensions, 2048 hidden dimensions for both encoder and decoder). BLEU is
measured after each epoch, and this study averages the 5 best checkpoints to
obtain final model.

Results and Discussion
BLEU and speed comparisons with varying numbers of encoder and decoder layers on the
test data. 12-1 denotes 12 encoder layers and 1 decoder layer.

• 흥미롭게도 모든 NAR 모델은 AR 6-6 기준보다 느린 Smax를 달성.
- 이는 2.2.2의 복잡성 분석과 일치하며, 동일한 레이어 할당으로 반복 NAR 모델이 AR
모델보다 더 많은 총 계산을 필요로 한다는 것을 드러냄.
- AR 12-1은 여전히 AR 6-6에 비해 상당한 속도 향상을 얻음. (RO EN에서 2.0배).
- 이러한 결과는 현재 NAR 모델이 사전에 주어진 많은 양의 텍스트를 번역할 때 이점이 거의
없음을 시사하며 번역 속도를 논의할 때 이 구분을 명확히 해야한다고 이 연구는 주장함.

These results illustrate that the strategy of having a deep encoder and shallow decoder
remains effective in large bitext settings, when the model has to learn potentially more
complex distributions from more samples.

Overall, our AR deep-shallow models outperform most NAR models, with the only
exception being EN->RO where it underperforms Imputer by 0.6 BLEU points.

• In this section, this study presents two controlled experiments to
compare NAR and AR models thoroughly.
- S1 Speed Constraint
* 위 4.1의 결과를 확인하기 위해 AR deep-shallow 모델을 S1 속도를 제어하는 두 NAR
모델과 추가로 비교함. 구체적으로, 다양한 인코더 깊이의 NAR 모델을 실험하고 AR 12-
1의 S1 속도에 도달할 때까지 각각을 가능한 한 많은 디코더 레이어와 페어링.
* 모든 NAR 모델은 인코더가 더 깊어지고 6-6 기준선(x = 6을 따라 사각형으로 표시됨)의
점수를 능가함에 따라 성능이 향상됨. 그럼에도 불구하고 AR 12-1과 여전히 큰 BLEU
격차가 있음. 이것은 두 NAR 모델이 동일한 S1 속도 예산에서 AR deep-shallow의
정확도를 일치시킬 수 없음을 보여줌

• In this section, this study presents two controlled experiments to
compare NAR and AR models thoroughly.
- Layer Constraint
* NAR 모델은 디코더와 인코더가 균형을 이룰때 , 반면에 AR 모델은 4개 이상의 인코더
계층에서 일관되게 잘 수행됨.
* 이것은 깊은 인코더와 얕은 디코더를 사용하는 것이 NAR 모델보다 AR 모델에서 더
효과적임을 보여줌. 디코더 계층은 cross attention의 사용으로 인해 인코더 계층보다 30%
더 많은 매개변수를 포함하므로 각 계층 할당의 매개변수 수는 다름에 주의해야함.

• Decoder Depth and ReorderingWords
- 이전 결과에서 우리는 NAR 모델이 잘 수행되기 위해
AR 모델보다 더 깊은 디코더가 필요하다는 것을 알 수
있었음.
- 본 연구는 그에 대한 이유 중 하나로 NAR 디코더가
소스와 타겟 사이의 다양한 단어 순서에 적응하는 법을
배워야 한다는 것임이라고 추측함. 즉, AR 디코더는 모든
선행 토큰을 입력으로 받아 조건부 분포를 명시적으로
학습하는 반면 NAR 디코더는 타겟 단어를 처음부터
순서대로 학습해야 함.
Further analysis
Figure 4: WMT14 EN->DE test results in
BLEU using reordered English input.

• Decoder Depth and ReorderingWords
- AR gains the same improvement regardless of
the layer configuration; in contrast, NAR 12-1
benefits more than NAR 6-6. This result supports
our hypothesis that word reordering is one reason
why NAR models need a deeper decoder.
Further analysis
BLEU using reordered English input.

• Effect of Distillation
- AR models with distillation can be an additional
baseline for future NAR research. AR deep-shallow
deteriorates much less on the raw data compared
to the iterative NAR methods, suggesting that the
strategy of speeding up AR models is better suited
to modeling raw, complex data than the NAR
methods.
Further analysis
BLEU that analyze the effects of distillation in
fast translation methods.

• Breakdown by Sentence Length
- 아래 연구 결과, AR 6-6과 deep-shallow 모델 간에 거의 동일한 패턴을 관찰하여 번역 길이에
관계없이 유사하게 수행함을 보여줌.
Further analysis

• Can we reduce the decoder further?
- 본 연구는 single-layer decoder와 Deep encoder가 있는 AR 모델이 각각 6개 레이어로
구성되었을때 정확도를 유지할 수 있음을 확인함.
- 여기서, 디코더를 훨씬 더 컴팩트하게 만들 수 있는지에 대한 의문을 제기할 수 있음.
- 본 연구의 예비 실험은 성능 저하 없이 디코더에서 feed-forward module을 제거할 수
있음을 보여주었음. 이렇게 되면 S1 속도가 10% 증가합니다. 그러나 더 자세한 사항에
대해선 후속 연구가 필요함.
Further analysis

• Non autoregressive NMT
- In addition to the work already discussed in this study, several other works
proposed to iteratively refine (or insert) output predictions.
- Other approaches include adding a light autoregressive module to parallel
decoding, partially decoding autoregressively, rescoring output candidates
autoregressively, mimicking hidden states of an autoregressive teacher, training
with different objectives than vanilla cross-entropy, reordering input sentences,
training on additional data from an autoregressive model, and modeling with
latent variables.
Further related work

• Optimizing autoregressive NMT
- AR의 성능을 높이기 위한 다양한 방법이 제안 됨.
- Kim et al. (2019) considered shallow decoders and layer tying (Dabre & Fujita,
2019; Dehghani et al., 2019) on the transformer decoder and found that it sped
up inference on CPUs, but not on a GPU, which was our focus.
- Shi & Knight (2017) proposed a vocabulary reduction method to speed up the
last softmax computation. Senellart et al. (2018) also adopted vocabulary
reduction and explored “fat decoder, thin encoder” on RNN-based models.
- Zhang et al. (2018) used dynamic programming in an average attention network
to accelerate inference. Wu et al. (2019) developed a model with dynamic
convolutions and compared its speed and accuracy with non-autoregressive
models.
Further related work

• 본 연구는 Auto-regressive NMT가 단순한 layer allocation strategy (Deep encoder, Shallow
decoder)에 의해 극적으로 빨라질 수 있음을 입증하기 위해 이론 및 경험적 연구를 진행한데
의의가 있음.
• Non-auto regressive 모델과 비교할 때 deep-shallow autoregressive 모델은 유사한 추론
속도로 번역 품질을 크게 향상시켰음.
• 즉 본 연구의 결과는 NAR NMT가 AR보다 더 좋은 성능을 보이기 위해 Knowledge distilation 및
속도 측정이 향후 작업에서 고려해야 할 중요한 측면임을 시사함.
• 더 일반적으로, Deep Encoder, Shallow Decoder 모델은 대규모 사전 훈련을 포함한 모든
시퀀스-시퀀스 작업에 사용할 수 있음을 알 수 있음.
Conclusion

deep encoder, shallow decoder reevaluating non-autoregressive machine translation ppt

More Related Content

What's hot (20)

Similar to deep encoder, shallow decoder reevaluating non-autoregressive machine translation ppt (20)

More from taeseon ryu (20)

deep encoder, shallow decoder reevaluating non-autoregressive machine translation ppt