SlideShare a Scribd company logo
AdaMix
AdaMix: Mixture-of-Adaptations for
Parameter-efficient Model Tuning
EMNLP, 2022
Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee et al.
Speaker: Po-Chuan Chen
Jul 27, 2023
1 / 37
AdaMix
Table of contents
1 Abstract
2 Introduction
3 Background
4 Mixture-of-Adaptations
5 Experiments
6 Conclusions / Limitations
2 / 37
AdaMix
Abstract
Abstract
This paper proposes a parameter-efficient fine-tuning method called
AdaMix1, a general parameter-efficient fine-tuning (PEFT) techniques
that tunes a mixture of adaptation modules.
By only tuning 0.1 − 0.2% of PLM parameters, they show that
AdaMix outperforms SOTA parameter-efficient fine-tuning and full
model fine-tuning for both NLU and NLG tasks.
1https://github.com/microsoft/AdaMix
3 / 37
AdaMix
Introduction
Table of contents
1 Abstract
2 Introduction
3 Background
4 Mixture-of-Adaptations
5 Experiments
6 Conclusions / Limitations
4 / 37
AdaMix
Introduction
Introduction
Standard fine-tuning of large pre-trained language models (PLMs) to
downstream tasks requires updating all model parameters. This may
cost a lot while the model size increases.
To address these challenges, recent works have developed
parameter-efficient fine-tuning (PEFT) techniques. These approaches
typically underperform standard full model fine-tuning, but
significantly reduce the number of trainable parameters.
5 / 37
AdaMix
Introduction
PEFT technique
Figure 1: Performance of different PEFT methods on GLUE [7].
6 / 37
AdaMix
Introduction
Contribution
Unlike traditional PEFT methods that use a single adaptation
module in every Transformer layer, AdaMix uses several
adaptation modules that learn multiple views of the given task.
AdaMix is trained with stochastic routing and adaptation module
merging to retain the same computational cost and benefits of the
underlying PEFT method.
By tuning around 0.1 and 0.2% of a pre-trained language model’s
parameters, it outperforms full model fine-tuning methods for all
NLU tasks on GLUE, and outperforms other competing methods
for NLG and few-shot NLU tasks.
7 / 37
AdaMix
Background
Background
Mixture-of-Experts. It achieves this by using N feed-forward
networks (FFN), namely “experts” denoted as EN
i=1, each with its own
set of learnable weights that compute different representations of an
input token x based on context.
Expert output Ei(x)s can be formulated as
Ei(xs) = wout
i · GeLU(win
i · xs) (1)
Output of the sparse MoE layer is given by:
h(xs) =
∑︁
i
G(xs)i Ei(xs) (2)
where G(xs) is the output of the gating network, and
Í
i Gt(xs)i = 1.
8 / 37
AdaMix
Background
Background
Adapters. The adapter tuning strategy judiciously introduces new
parameters into the original PLMs. During fine-tuning, only the
adapter parameters are updated while keeping the remaining
parameters of the PLM frozen.
The adapter layer uses a down projection Wdown ∈ Rd×r to project
input representation x to a low dimensional space r with d being the
model dimension, followed by a nonlinear activation function f (·),
and then a up-projection with Wup ∈ Rr×d.
9 / 37
AdaMix
Background
Figure 2: Conventional adapter design in standard Transformer architecture.
Given the above adapter design with parameters 𝜓, the dataset DK, a
pre-trained language model encoder enc with parameters ΘPLM, where
ΘPLM ≫ 𝜓 , the optimization will be
𝜓 ← arg min
𝜓
L(Dk; ΘPLM, 𝜓) (3)
10 / 37
AdaMix
Mixture-of-Adaptations
Table of contents I
1 Abstract
2 Introduction
3 Background
4 Mixture-of-Adaptations
Routing Policy
Consistency regularization
Adaptation module merging
Adaptation module sharing
11 / 37
AdaMix
Mixture-of-Adaptations
Table of contents II
Connection to Bayesian Neural Networks and Model
Ensembling
5 Experiments
6 Conclusions / Limitations
12 / 37
AdaMix
Mixture-of-Adaptations
Mixture-of-Adaptations
There will be a set of M adaptation modules injected in each
Transformer layer, where Aij : i ∈ {1 . . . L}, j ∈ {1 . . . M} represents
the jth adaptation module in the ith Transformer layer.
In their Transformer, it consists of L repeated Transformer blocks,
where each block consists of a self-attention sub-layer, a fully
connected feed-forward network (FFN) and residual connections
around the sub-layers followed by layer normalization.
13 / 37
AdaMix
Mixture-of-Adaptations
Figure 3: Mixture-of-Adaptations (AdaMix), and M = 4.
14 / 37
AdaMix
Mixture-of-Adaptations
Routing Policy
Routing Policy
They use stochastic routing policy.
At any training step, they randomly select a pair of feedforward up and
feedforward down projection matrices in the ith Transformer layer as
Ai = {W
up
ij , Wdown
ik } and Bi = {W
up
ij′ , Wdown
ik′ } respectively.
Given an input representation x in a given Transformer layer, the
above pair of modules perform the following transformations:
x ← x + f (x · Wdown
) · Wup
(4)
15 / 37
AdaMix
Mixture-of-Adaptations
Routing Policy
Routing Policy
Stochastic routing enables adaptation modules to learn different
transformations during training and obtain multiple views of the task.
However, this also creates a challenge on which modules to use
during inference due to random routing protocol during training.
To overcome this issue, they provide two techniques that further allow
them to collapse adaptation modules and obtain the same
computational cost as that of a single module.
16 / 37
AdaMix
Mixture-of-Adaptations
Consistency regularization
Consistency regularization
Consider A = {AL
i=1} and B = {BL
i=1} to be the sets of adaptation
modules, they add the following consistency loss as a regularizer to
the task-specific optimization loss:
L = −(
C
∑︁
c=1
I(x, c) log softmax(zA
c (x))+
1
2
(KL(zA
(·) (x)∥zB
(·) (x)) + KL(zB
(·) (x)∥zA
(·) (x)))) (5)
where I(x, c) is a binary indicator (0 or 1) if class label c is the correct
classification for x and zA
(·)
(x) and zB
(·)
(x) are the predicted logits.
17 / 37
AdaMix
Mixture-of-Adaptations
Adaptation module merging
Adaptation module merging
While the above regularization mitigates inconsistency in random
module selection during inference, it still results in increased
serving cost to host several adaptation modules.
They employ adaptation merging only during inference. Given a set
of adaptation modules W
up
ij and Wdown
ik for i ∈ {1 . . . L} and
{j, k} ∈ {1 . . . M}, they simply average the weights of all the
corresponding modules in every Transformer layer to collapse to a
single module {W
′ up
i , W′ down
i }, where:
W
′up
i ←
1
M
M
∑︁
j=1
W
up
ij W′down
i ←
1
M
M
∑︁
j=1
Wdown
ij (6)
18 / 37
AdaMix
Mixture-of-Adaptations
Adaptation module merging
Adaptation module merging
Figure 4: Merging weights of the adaptation modules.
19 / 37
AdaMix
Mixture-of-Adaptations
Adaptation module sharing
Adaptation module sharing
While stochastic routing to multi-view adaptation modules increases
the model capacity, it can also impact downstream tasks with less
amounts of labeled data for tuning several sets of adaptation modules.
Here, they share some of the adaption modules (e.g., project-down or
the project-up operations) to improve training efficiency.
In their setting, they share only the feedforward projection-up matrices
i.e., W
up
ij = W
up
i .
20 / 37
AdaMix
Mixture-of-Adaptations
Connection to Bayesian Neural Networks and Model Ensembling
Bayesian Neural Networks
Bayesian Neural Network (BNN) [2] replaces a deterministic model’s
weight parameters by a distribution over the parameters. For
inference, BNN averages over all the possible weights, also referred to
as marginalization.
Figure 5: A Bayesian neural network with one hidden layer.
21 / 37
AdaMix
Mixture-of-Adaptations
Connection to Bayesian Neural Networks and Model Ensembling
Bayesian Neural Networks (cont.)
Consider fW (x) ∈ Rd to be the d-dimensional output of such a neural
network where the model likelihood is given by p(y | f W(x)). In their
setting, W = Wup, Wdown along with frozen PLM parameters.
The output will be P(y = c | x, W) = softmax(f W(x)) in the
classification task.
And with an given instance x, the probability distribution over the
classes is given by marginalization over the posterior distribution as:
p(y = c | x) =
∫
W
p(y = c | f W(x))p(W | X, Y)dW.
22 / 37
AdaMix
Mixture-of-Adaptations
Connection to Bayesian Neural Networks and Model Ensembling
Define the notation
n
f
Wt
oT
t=1
∼ q𝜃 (W)
f
W = 1
T
Í
t
f
Wt
LAM
W
means AdaMix
LEns
W
means AdaMix-Ensemble
23 / 37
AdaMix
Mixture-of-Adaptations
Connection to Bayesian Neural Networks and Model Ensembling
Monte-Carlo integration
Here, the objective is to find a surrogate distribution q𝜃 (W) in a
tractable family of distributions that can replace the true model
posterior that is hard to compute.
Consider q𝜃 (W), it can be obtained by Monte-Carlo integration [3] in
classification tasks as:
p(y = c | x) ≈ p

y = c | f W
(x)

q𝜃 (W)dW
≈
1
T
T
∑︁
t=1
p

y = c | f
f
Wt
(x)

=
1
T
T
∑︁
t=1
softmax

f
f
Wt
(x)

(7)
24 / 37
AdaMix
Mixture-of-Adaptations
Connection to Bayesian Neural Networks and Model Ensembling
Analyzing both methods
Prior work [8] shows that averaging the weights of multiple models
fine-tuned with different hyper-parameters improves model
performance.
Let LAM
W
= Ex,yL(softmax(f
f
W)(x), y) be the loss with merging of
the stochastic adaptation weights (from equation 6)
LEns
W
= Ex,yL( 1
T
ÍT
t=1 softmax(f
g
Wt )(x), y) denote the expected loss
from logit-level stochastic model ensembling (from equation 7).
They analytically show the similarity in LAM
W
and LEns
W
as a function
of the flatness of the loss and confidence of the predictions.
25 / 37
AdaMix
Experiments
Table of contents
1 Abstract
2 Introduction
3 Background
4 Mixture-of-Adaptations
5 Experiments
6 Conclusions / Limitations
26 / 37
AdaMix
Experiments
Experiments
Dataset. In NLU, they use GLUE. In NLG, they use three different
tasks, namely, E2E [6], WebNLG [4], and DART [5].
Baselines. They compare AdaMix to full model fine-tuning and
several state-of-the-art parameter-efficient fine-tuning (PEFT)
methods.
AdaMix implementation details. The number of adaptation modules
in AdaMix is set to 4 for all the tasks. AdaMix choose BERT-base and
RoBERTa-large encoders for NLU task, and GPT-2 medium [1] for
NLG task.
27 / 37
AdaMix
Experiments
Table 1: GLUE development set with RoBERTa-large encoder.
Table 2: GLUE development set with BERT-base encoder and AdaMix with
a mixture-of-adapters.
28 / 37
AdaMix
Experiments
NLG Tasks
Table 3: Results on E2E NLG Challenge with GPT-2 medium backbone.
They also do some experiments with the dataset DART, WebNLG and
some ablation studies. Those results show on the original paper.
29 / 37
AdaMix
Conclusions / Limitations
Table of contents
1 Abstract
2 Introduction
3 Background
4 Mixture-of-Adaptations
5 Experiments
6 Conclusions / Limitations
30 / 37
AdaMix
Conclusions / Limitations
Conclusions
This paper develops a new framework AdaMix for parameter-efficient
fine-tuning (PEFT) of large pre-trained language models (PLM).
It improves downstream task performance without increasing the
computational cost of the underlying adaptation method.
31 / 37
AdaMix
Conclusions / Limitations
Limitations
The proposed AdaMix method is somewhat compute-intensive as it
involves fine-tuning large-scale language models.
Based on their empirical observation, the number of training iterations
for AdaMix is usually between 1 ∼ 2 times the training for standard
PEFT methods.
32 / 37
AdaMix
Conclusions / Limitations
References I
[1] Tom Brown et al. “Language Models are Few-Shot Learners”.
In: Advances in Neural Information Processing Systems. Ed. by
H. Larochelle et al. Vol. 33. Curran Associates, Inc., 2020,
pp. 1877–1901. url: https:
//proceedings.neurips.cc/paper_files/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[2] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian
Approximation: Representing Model Uncertainty in Deep
Learning. 2016. arXiv: 1506.02142 [stat.ML].
33 / 37
AdaMix
Conclusions / Limitations
References II
[3] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. “Deep
Bayesian Active Learning with Image Data”. In: Proceedings of
the 34th International Conference on Machine Learning. Ed. by
Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of
Machine Learning Research. PMLR, Aug. 2017, pp. 1183–1192.
url:
https://proceedings.mlr.press/v70/gal17a.html.
[4] Claire Gardent et al. “The WebNLG Challenge: Generating Text
from RDF Data”. In: Proceedings of the 10th International
Conference on Natural Language Generation. Santiago de
Compostela, Spain: Association for Computational Linguistics,
Sept. 2017, pp. 124–133. doi: 10.18653/v1/W17-3518. url:
https://aclanthology.org/W17-3518.
34 / 37
AdaMix
Conclusions / Limitations
References III
[5] Linyong Nan et al. “DART: Open-Domain Structured Data
Record to Text Generation”. In: Proceedings of the 2021
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies.
Online: Association for Computational Linguistics, June 2021,
pp. 432–447. doi: 10.18653/v1/2021.naacl-main.37. url:
https://aclanthology.org/2021.naacl-main.37.
35 / 37
AdaMix
Conclusions / Limitations
References IV
[6] Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. “The
E2E Dataset: New Challenges For End-to-End Generation”. In:
Proceedings of the 18th Annual SIGdial Meeting on Discourse
and Dialogue. Saarbrücken, Germany: Association for
Computational Linguistics, Aug. 2017, pp. 201–206. doi:
10.18653/v1/W17-5525. url:
https://aclanthology.org/W17-5525.
36 / 37
AdaMix
Conclusions / Limitations
References V
[7] Alex Wang et al. “GLUE: A Multi-Task Benchmark and
Analysis Platform for Natural Language Understanding”. In:
Proceedings of the 2018 EMNLP Workshop BlackboxNLP:
Analyzing and Interpreting Neural Networks for NLP. Brussels,
Belgium: Association for Computational Linguistics, Nov. 2018,
pp. 353–355. doi: 10.18653/v1/W18-5446. url:
https://aclanthology.org/W18-5446.
[8] Mitchell Wortsman et al. Model soups: averaging weights of
multiple fine-tuned models improves accuracy without
increasing inference time. 2022. arXiv: 2203.05482 [cs.LG].
37 / 37

More Related Content

PDF
tesis introducción
PDF
El eclecticismo.pdf
PPT
How To Create Blogs
PPT
Adaline madaline
PDF
Low-complexity robust adaptive generalized sidelobe canceller detector for DS...
PDF
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithm
PDF
Efficient combined fuzzy logic and LMS algorithm for smart antenna
tesis introducción
El eclecticismo.pdf
How To Create Blogs
Adaline madaline
Low-complexity robust adaptive generalized sidelobe canceller detector for DS...
Efficient realization-of-an-adfe-with-a-new-adaptive-algorithm
Efficient combined fuzzy logic and LMS algorithm for smart antenna

Similar to AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf (20)

PDF
BPstudy sklearn 20180925
PDF
Economic Dispatch of Generated Power Using Modified Lambda-Iteration Method
PDF
mooney slides for dynamic topoFvMesh in open foam for mesh motion
PDF
E010123337
PDF
Application of Gravitational Search Algorithm and Fuzzy For Loss Reduction of...
PDF
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
PDF
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
PDF
Distributed ADMM
PDF
A Study of Training and Blind Equalization Algorithms for Quadrature Amplitud...
PPTX
Yuwu chen
PDF
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
PDF
IRJET- An Efficient Reverse Converter for the Three Non-Coprime Moduli Set {4...
PDF
widely-linear-minimum (1)
PDF
A simple framework for contrastive learning of visual representations
PDF
SigOpt_Bayesian_Optimization_Primer
PDF
M.Tech Thesis Defense Presentation
PDF
An Alternative Genetic Algorithm to Optimize OSPF Weights
PDF
D0341015020
PDF
Discrete wavelet transform-based RI adaptive algorithm for system identification
PDF
IRJET- Optimal Placement and Size of DG and DER for Minimizing Power Loss and...
BPstudy sklearn 20180925
Economic Dispatch of Generated Power Using Modified Lambda-Iteration Method
mooney slides for dynamic topoFvMesh in open foam for mesh motion
E010123337
Application of Gravitational Search Algorithm and Fuzzy For Loss Reduction of...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
Distributed ADMM
A Study of Training and Blind Equalization Algorithms for Quadrature Amplitud...
Yuwu chen
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
IRJET- An Efficient Reverse Converter for the Three Non-Coprime Moduli Set {4...
widely-linear-minimum (1)
A simple framework for contrastive learning of visual representations
SigOpt_Bayesian_Optimization_Primer
M.Tech Thesis Defense Presentation
An Alternative Genetic Algorithm to Optimize OSPF Weights
D0341015020
Discrete wavelet transform-based RI adaptive algorithm for system identification
IRJET- Optimal Placement and Size of DG and DER for Minimizing Power Loss and...
Ad

More from Po-Chuan Chen (20)

PDF
Graph Neural Prompting with Large Language Models.pdf
PDF
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
PDF
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
PDF
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
PDF
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
PDF
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
PDF
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
PDF
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
PDF
A Statistical Perspective on Retrieval-Based Models.pdf
PDF
A Neural Corpus Indexer for Document Retrieval.pdf
PDF
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
PDF
Active Retrieval Augmented Generation.pdf
PDF
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
PDF
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
PDF
Image_to_Prompts.pdf
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
PDF
Evaluating Parameter Efficient Learning for Generation.pdf
PDF
Off-Policy Deep Reinforcement Learning without Exploration.pdf
PDF
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
PDF
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Graph Neural Prompting with Large Language Models.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
A Statistical Perspective on Retrieval-Based Models.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Active Retrieval Augmented Generation.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Image_to_Prompts.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
Sustainable Sites - Green Building Construction
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
737-MAX_SRG.pdf student reference guides
PDF
Well-logging-methods_new................
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
Project quality management in manufacturing
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
573137875-Attendance-Management-System-original
Sustainable Sites - Green Building Construction
R24 SURVEYING LAB MANUAL for civil enggi
737-MAX_SRG.pdf student reference guides
Well-logging-methods_new................
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
UNIT 4 Total Quality Management .pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
III.4.1.2_The_Space_Environment.p pdffdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Project quality management in manufacturing
additive manufacturing of ss316l using mig welding
CH1 Production IntroductoryConcepts.pptx
Foundation to blockchain - A guide to Blockchain Tech
Fundamentals of safety and accident prevention -final (1).pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf

  • 1. AdaMix AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning EMNLP, 2022 Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee et al. Speaker: Po-Chuan Chen Jul 27, 2023 1 / 37
  • 2. AdaMix Table of contents 1 Abstract 2 Introduction 3 Background 4 Mixture-of-Adaptations 5 Experiments 6 Conclusions / Limitations 2 / 37
  • 3. AdaMix Abstract Abstract This paper proposes a parameter-efficient fine-tuning method called AdaMix1, a general parameter-efficient fine-tuning (PEFT) techniques that tunes a mixture of adaptation modules. By only tuning 0.1 − 0.2% of PLM parameters, they show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks. 1https://github.com/microsoft/AdaMix 3 / 37
  • 4. AdaMix Introduction Table of contents 1 Abstract 2 Introduction 3 Background 4 Mixture-of-Adaptations 5 Experiments 6 Conclusions / Limitations 4 / 37
  • 5. AdaMix Introduction Introduction Standard fine-tuning of large pre-trained language models (PLMs) to downstream tasks requires updating all model parameters. This may cost a lot while the model size increases. To address these challenges, recent works have developed parameter-efficient fine-tuning (PEFT) techniques. These approaches typically underperform standard full model fine-tuning, but significantly reduce the number of trainable parameters. 5 / 37
  • 6. AdaMix Introduction PEFT technique Figure 1: Performance of different PEFT methods on GLUE [7]. 6 / 37
  • 7. AdaMix Introduction Contribution Unlike traditional PEFT methods that use a single adaptation module in every Transformer layer, AdaMix uses several adaptation modules that learn multiple views of the given task. AdaMix is trained with stochastic routing and adaptation module merging to retain the same computational cost and benefits of the underlying PEFT method. By tuning around 0.1 and 0.2% of a pre-trained language model’s parameters, it outperforms full model fine-tuning methods for all NLU tasks on GLUE, and outperforms other competing methods for NLG and few-shot NLU tasks. 7 / 37
  • 8. AdaMix Background Background Mixture-of-Experts. It achieves this by using N feed-forward networks (FFN), namely “experts” denoted as EN i=1, each with its own set of learnable weights that compute different representations of an input token x based on context. Expert output Ei(x)s can be formulated as Ei(xs) = wout i · GeLU(win i · xs) (1) Output of the sparse MoE layer is given by: h(xs) = ∑︁ i G(xs)i Ei(xs) (2) where G(xs) is the output of the gating network, and Í i Gt(xs)i = 1. 8 / 37
  • 9. AdaMix Background Background Adapters. The adapter tuning strategy judiciously introduces new parameters into the original PLMs. During fine-tuning, only the adapter parameters are updated while keeping the remaining parameters of the PLM frozen. The adapter layer uses a down projection Wdown ∈ Rd×r to project input representation x to a low dimensional space r with d being the model dimension, followed by a nonlinear activation function f (·), and then a up-projection with Wup ∈ Rr×d. 9 / 37
  • 10. AdaMix Background Figure 2: Conventional adapter design in standard Transformer architecture. Given the above adapter design with parameters 𝜓, the dataset DK, a pre-trained language model encoder enc with parameters ΘPLM, where ΘPLM ≫ 𝜓 , the optimization will be 𝜓 ← arg min 𝜓 L(Dk; ΘPLM, 𝜓) (3) 10 / 37
  • 11. AdaMix Mixture-of-Adaptations Table of contents I 1 Abstract 2 Introduction 3 Background 4 Mixture-of-Adaptations Routing Policy Consistency regularization Adaptation module merging Adaptation module sharing 11 / 37
  • 12. AdaMix Mixture-of-Adaptations Table of contents II Connection to Bayesian Neural Networks and Model Ensembling 5 Experiments 6 Conclusions / Limitations 12 / 37
  • 13. AdaMix Mixture-of-Adaptations Mixture-of-Adaptations There will be a set of M adaptation modules injected in each Transformer layer, where Aij : i ∈ {1 . . . L}, j ∈ {1 . . . M} represents the jth adaptation module in the ith Transformer layer. In their Transformer, it consists of L repeated Transformer blocks, where each block consists of a self-attention sub-layer, a fully connected feed-forward network (FFN) and residual connections around the sub-layers followed by layer normalization. 13 / 37
  • 15. AdaMix Mixture-of-Adaptations Routing Policy Routing Policy They use stochastic routing policy. At any training step, they randomly select a pair of feedforward up and feedforward down projection matrices in the ith Transformer layer as Ai = {W up ij , Wdown ik } and Bi = {W up ij′ , Wdown ik′ } respectively. Given an input representation x in a given Transformer layer, the above pair of modules perform the following transformations: x ← x + f (x · Wdown ) · Wup (4) 15 / 37
  • 16. AdaMix Mixture-of-Adaptations Routing Policy Routing Policy Stochastic routing enables adaptation modules to learn different transformations during training and obtain multiple views of the task. However, this also creates a challenge on which modules to use during inference due to random routing protocol during training. To overcome this issue, they provide two techniques that further allow them to collapse adaptation modules and obtain the same computational cost as that of a single module. 16 / 37
  • 17. AdaMix Mixture-of-Adaptations Consistency regularization Consistency regularization Consider A = {AL i=1} and B = {BL i=1} to be the sets of adaptation modules, they add the following consistency loss as a regularizer to the task-specific optimization loss: L = −( C ∑︁ c=1 I(x, c) log softmax(zA c (x))+ 1 2 (KL(zA (·) (x)∥zB (·) (x)) + KL(zB (·) (x)∥zA (·) (x)))) (5) where I(x, c) is a binary indicator (0 or 1) if class label c is the correct classification for x and zA (·) (x) and zB (·) (x) are the predicted logits. 17 / 37
  • 18. AdaMix Mixture-of-Adaptations Adaptation module merging Adaptation module merging While the above regularization mitigates inconsistency in random module selection during inference, it still results in increased serving cost to host several adaptation modules. They employ adaptation merging only during inference. Given a set of adaptation modules W up ij and Wdown ik for i ∈ {1 . . . L} and {j, k} ∈ {1 . . . M}, they simply average the weights of all the corresponding modules in every Transformer layer to collapse to a single module {W ′ up i , W′ down i }, where: W ′up i ← 1 M M ∑︁ j=1 W up ij W′down i ← 1 M M ∑︁ j=1 Wdown ij (6) 18 / 37
  • 19. AdaMix Mixture-of-Adaptations Adaptation module merging Adaptation module merging Figure 4: Merging weights of the adaptation modules. 19 / 37
  • 20. AdaMix Mixture-of-Adaptations Adaptation module sharing Adaptation module sharing While stochastic routing to multi-view adaptation modules increases the model capacity, it can also impact downstream tasks with less amounts of labeled data for tuning several sets of adaptation modules. Here, they share some of the adaption modules (e.g., project-down or the project-up operations) to improve training efficiency. In their setting, they share only the feedforward projection-up matrices i.e., W up ij = W up i . 20 / 37
  • 21. AdaMix Mixture-of-Adaptations Connection to Bayesian Neural Networks and Model Ensembling Bayesian Neural Networks Bayesian Neural Network (BNN) [2] replaces a deterministic model’s weight parameters by a distribution over the parameters. For inference, BNN averages over all the possible weights, also referred to as marginalization. Figure 5: A Bayesian neural network with one hidden layer. 21 / 37
  • 22. AdaMix Mixture-of-Adaptations Connection to Bayesian Neural Networks and Model Ensembling Bayesian Neural Networks (cont.) Consider fW (x) ∈ Rd to be the d-dimensional output of such a neural network where the model likelihood is given by p(y | f W(x)). In their setting, W = Wup, Wdown along with frozen PLM parameters. The output will be P(y = c | x, W) = softmax(f W(x)) in the classification task. And with an given instance x, the probability distribution over the classes is given by marginalization over the posterior distribution as: p(y = c | x) = ∫ W p(y = c | f W(x))p(W | X, Y)dW. 22 / 37
  • 23. AdaMix Mixture-of-Adaptations Connection to Bayesian Neural Networks and Model Ensembling Define the notation n f Wt oT t=1 ∼ q𝜃 (W) f W = 1 T Í t f Wt LAM W means AdaMix LEns W means AdaMix-Ensemble 23 / 37
  • 24. AdaMix Mixture-of-Adaptations Connection to Bayesian Neural Networks and Model Ensembling Monte-Carlo integration Here, the objective is to find a surrogate distribution q𝜃 (W) in a tractable family of distributions that can replace the true model posterior that is hard to compute. Consider q𝜃 (W), it can be obtained by Monte-Carlo integration [3] in classification tasks as: p(y = c | x) ≈ p y = c | f W (x) q𝜃 (W)dW ≈ 1 T T ∑︁ t=1 p y = c | f f Wt (x) = 1 T T ∑︁ t=1 softmax f f Wt (x) (7) 24 / 37
  • 25. AdaMix Mixture-of-Adaptations Connection to Bayesian Neural Networks and Model Ensembling Analyzing both methods Prior work [8] shows that averaging the weights of multiple models fine-tuned with different hyper-parameters improves model performance. Let LAM W = Ex,yL(softmax(f f W)(x), y) be the loss with merging of the stochastic adaptation weights (from equation 6) LEns W = Ex,yL( 1 T ÍT t=1 softmax(f g Wt )(x), y) denote the expected loss from logit-level stochastic model ensembling (from equation 7). They analytically show the similarity in LAM W and LEns W as a function of the flatness of the loss and confidence of the predictions. 25 / 37
  • 26. AdaMix Experiments Table of contents 1 Abstract 2 Introduction 3 Background 4 Mixture-of-Adaptations 5 Experiments 6 Conclusions / Limitations 26 / 37
  • 27. AdaMix Experiments Experiments Dataset. In NLU, they use GLUE. In NLG, they use three different tasks, namely, E2E [6], WebNLG [4], and DART [5]. Baselines. They compare AdaMix to full model fine-tuning and several state-of-the-art parameter-efficient fine-tuning (PEFT) methods. AdaMix implementation details. The number of adaptation modules in AdaMix is set to 4 for all the tasks. AdaMix choose BERT-base and RoBERTa-large encoders for NLU task, and GPT-2 medium [1] for NLG task. 27 / 37
  • 28. AdaMix Experiments Table 1: GLUE development set with RoBERTa-large encoder. Table 2: GLUE development set with BERT-base encoder and AdaMix with a mixture-of-adapters. 28 / 37
  • 29. AdaMix Experiments NLG Tasks Table 3: Results on E2E NLG Challenge with GPT-2 medium backbone. They also do some experiments with the dataset DART, WebNLG and some ablation studies. Those results show on the original paper. 29 / 37
  • 30. AdaMix Conclusions / Limitations Table of contents 1 Abstract 2 Introduction 3 Background 4 Mixture-of-Adaptations 5 Experiments 6 Conclusions / Limitations 30 / 37
  • 31. AdaMix Conclusions / Limitations Conclusions This paper develops a new framework AdaMix for parameter-efficient fine-tuning (PEFT) of large pre-trained language models (PLM). It improves downstream task performance without increasing the computational cost of the underlying adaptation method. 31 / 37
  • 32. AdaMix Conclusions / Limitations Limitations The proposed AdaMix method is somewhat compute-intensive as it involves fine-tuning large-scale language models. Based on their empirical observation, the number of training iterations for AdaMix is usually between 1 ∼ 2 times the training for standard PEFT methods. 32 / 37
  • 33. AdaMix Conclusions / Limitations References I [1] Tom Brown et al. “Language Models are Few-Shot Learners”. In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. url: https: //proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [2] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 2016. arXiv: 1506.02142 [stat.ML]. 33 / 37
  • 34. AdaMix Conclusions / Limitations References II [3] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. “Deep Bayesian Active Learning with Image Data”. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR, Aug. 2017, pp. 1183–1192. url: https://proceedings.mlr.press/v70/gal17a.html. [4] Claire Gardent et al. “The WebNLG Challenge: Generating Text from RDF Data”. In: Proceedings of the 10th International Conference on Natural Language Generation. Santiago de Compostela, Spain: Association for Computational Linguistics, Sept. 2017, pp. 124–133. doi: 10.18653/v1/W17-3518. url: https://aclanthology.org/W17-3518. 34 / 37
  • 35. AdaMix Conclusions / Limitations References III [5] Linyong Nan et al. “DART: Open-Domain Structured Data Record to Text Generation”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 432–447. doi: 10.18653/v1/2021.naacl-main.37. url: https://aclanthology.org/2021.naacl-main.37. 35 / 37
  • 36. AdaMix Conclusions / Limitations References IV [6] Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. “The E2E Dataset: New Challenges For End-to-End Generation”. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Saarbrücken, Germany: Association for Computational Linguistics, Aug. 2017, pp. 201–206. doi: 10.18653/v1/W17-5525. url: https://aclanthology.org/W17-5525. 36 / 37
  • 37. AdaMix Conclusions / Limitations References V [7] Alex Wang et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 353–355. doi: 10.18653/v1/W18-5446. url: https://aclanthology.org/W18-5446. [8] Mitchell Wortsman et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. 2022. arXiv: 2203.05482 [cs.LG]. 37 / 37