AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf

AdaMix
AdaMix: Mixture-of-Adaptations for
Parameter-efficient Model Tuning
EMNLP, 2022
Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee et al.
Speaker: Po-Chuan Chen
Jul 27, 2023
1 / 37

AdaMix
Table of contents
1 Abstract
2 Introduction
3 Background
4 Mixture-of-Adaptations
5 Experiments
6 Conclusions / Limitations
2 / 37

AdaMix
Abstract
Abstract
This paper proposes a parameter-efficient fine-tuning method called
AdaMix1, a general parameter-efficient fine-tuning (PEFT) techniques
that tunes a mixture of adaptation modules.
By only tuning 0.1 − 0.2% of PLM parameters, they show that
AdaMix outperforms SOTA parameter-efficient fine-tuning and full
model fine-tuning for both NLU and NLG tasks.
1https://github.com/microsoft/AdaMix
3 / 37

AdaMix
Introduction
Table of contents
1 Abstract
2 Introduction
3 Background
5 Experiments
4 / 37

AdaMix
Introduction
Introduction
Standard fine-tuning of large pre-trained language models (PLMs) to
downstream tasks requires updating all model parameters. This may
cost a lot while the model size increases.
To address these challenges, recent works have developed
parameter-efficient fine-tuning (PEFT) techniques. These approaches
typically underperform standard full model fine-tuning, but
significantly reduce the number of trainable parameters.
5 / 37

AdaMix
Introduction
PEFT technique
Figure 1: Performance of different PEFT methods on GLUE [7].
6 / 37

AdaMix
Introduction
Contribution
Unlike traditional PEFT methods that use a single adaptation
module in every Transformer layer, AdaMix uses several
adaptation modules that learn multiple views of the given task.
AdaMix is trained with stochastic routing and adaptation module
merging to retain the same computational cost and benefits of the
underlying PEFT method.
By tuning around 0.1 and 0.2% of a pre-trained language model’s
parameters, it outperforms full model fine-tuning methods for all
NLU tasks on GLUE, and outperforms other competing methods
for NLG and few-shot NLU tasks.
7 / 37

AdaMix
Background
Background
Mixture-of-Experts. It achieves this by using N feed-forward
networks (FFN), namely “experts” denoted as EN
i=1, each with its own
set of learnable weights that compute different representations of an
input token x based on context.
Expert output Ei(x)s can be formulated as
Ei(xs) = wout
i · GeLU(win
i · xs) (1)
Output of the sparse MoE layer is given by:
h(xs) =
∑︁
i
G(xs)i Ei(xs) (2)
where G(xs) is the output of the gating network, and
Í
i Gt(xs)i = 1.
8 / 37

AdaMix
Background
Background
Adapters. The adapter tuning strategy judiciously introduces new
parameters into the original PLMs. During fine-tuning, only the
adapter parameters are updated while keeping the remaining
parameters of the PLM frozen.
The adapter layer uses a down projection Wdown ∈ Rd×r to project
input representation x to a low dimensional space r with d being the
model dimension, followed by a nonlinear activation function f (·),
and then a up-projection with Wup ∈ Rr×d.
9 / 37

AdaMix
Background
Figure 2: Conventional adapter design in standard Transformer architecture.
Given the above adapter design with parameters 𝜓, the dataset DK, a
pre-trained language model encoder enc with parameters ΘPLM, where
ΘPLM ≫ 𝜓 , the optimization will be
𝜓 ← arg min
𝜓
L(Dk; ΘPLM, 𝜓) (3)
10 / 37

AdaMix
Mixture-of-Adaptations
Table of contents I
1 Abstract
2 Introduction
3 Background
Routing Policy
Consistency regularization
Adaptation module merging
Adaptation module sharing
11 / 37

AdaMix
Table of contents II
Connection to Bayesian Neural Networks and Model
Ensembling
5 Experiments
12 / 37

AdaMix
There will be a set of M adaptation modules injected in each
Transformer layer, where Aij : i ∈ {1 . . . L}, j ∈ {1 . . . M} represents
the jth adaptation module in the ith Transformer layer.
In their Transformer, it consists of L repeated Transformer blocks,
where each block consists of a self-attention sub-layer, a fully
connected feed-forward network (FFN) and residual connections
around the sub-layers followed by layer normalization.
13 / 37

AdaMix
Figure 3: Mixture-of-Adaptations (AdaMix), and M = 4.
14 / 37

AdaMix
Routing Policy
Routing Policy
They use stochastic routing policy.
At any training step, they randomly select a pair of feedforward up and
feedforward down projection matrices in the ith Transformer layer as
Ai = {W
up
ij , Wdown
ik } and Bi = {W
up
ij′ , Wdown
ik′ } respectively.
Given an input representation x in a given Transformer layer, the
above pair of modules perform the following transformations:
x ← x + f (x · Wdown
) · Wup
(4)
15 / 37

AdaMix
Routing Policy
Routing Policy
Stochastic routing enables adaptation modules to learn different
transformations during training and obtain multiple views of the task.
However, this also creates a challenge on which modules to use
during inference due to random routing protocol during training.
To overcome this issue, they provide two techniques that further allow
them to collapse adaptation modules and obtain the same
computational cost as that of a single module.
16 / 37

AdaMix
Consider A = {AL
i=1} and B = {BL
i=1} to be the sets of adaptation
modules, they add the following consistency loss as a regularizer to
the task-specific optimization loss:
L = −(
C
∑︁
c=1
I(x, c) log softmax(zA
c (x))+
1
2
(KL(zA
(·) (x)∥zB
(·) (x)) + KL(zB
(·) (x)∥zA
(·) (x)))) (5)
where I(x, c) is a binary indicator (0 or 1) if class label c is the correct
classification for x and zA
(·)
(x) and zB
(·)
(x) are the predicted logits.
17 / 37

AdaMix
While the above regularization mitigates inconsistency in random
module selection during inference, it still results in increased
serving cost to host several adaptation modules.
They employ adaptation merging only during inference. Given a set
of adaptation modules W
up
ij and Wdown
ik for i ∈ {1 . . . L} and
{j, k} ∈ {1 . . . M}, they simply average the weights of all the
corresponding modules in every Transformer layer to collapse to a
single module {W
′ up
i , W′ down
i }, where:
W
′up
i ←
1
M
M
∑︁
j=1
W
up
ij W′down
i ←
1
M
M
∑︁
j=1
Wdown
ij (6)
18 / 37

AdaMix
Figure 4: Merging weights of the adaptation modules.
19 / 37

AdaMix
While stochastic routing to multi-view adaptation modules increases
the model capacity, it can also impact downstream tasks with less
amounts of labeled data for tuning several sets of adaptation modules.
Here, they share some of the adaption modules (e.g., project-down or
the project-up operations) to improve training efficiency.
In their setting, they share only the feedforward projection-up matrices
i.e., W
up
ij = W
up
i .
20 / 37

AdaMix
Connection to Bayesian Neural Networks and Model Ensembling
Bayesian Neural Networks
Bayesian Neural Network (BNN) [2] replaces a deterministic model’s
weight parameters by a distribution over the parameters. For
inference, BNN averages over all the possible weights, also referred to
as marginalization.
Figure 5: A Bayesian neural network with one hidden layer.
21 / 37

AdaMix
Bayesian Neural Networks (cont.)
Consider fW (x) ∈ Rd to be the d-dimensional output of such a neural
network where the model likelihood is given by p(y | f W(x)). In their
setting, W = Wup, Wdown along with frozen PLM parameters.
The output will be P(y = c | x, W) = softmax(f W(x)) in the
classification task.
And with an given instance x, the probability distribution over the
classes is given by marginalization over the posterior distribution as:
p(y = c | x) =
∫
W
p(y = c | f W(x))p(W | X, Y)dW.
22 / 37

AdaMix
Define the notation
n
f
Wt
oT
t=1
∼ q𝜃 (W)
f
W = 1
T
Í
t
f
Wt
LAM
W
means AdaMix
LEns
W
means AdaMix-Ensemble
23 / 37

AdaMix
Monte-Carlo integration
Here, the objective is to find a surrogate distribution q𝜃 (W) in a
tractable family of distributions that can replace the true model
posterior that is hard to compute.
Consider q𝜃 (W), it can be obtained by Monte-Carlo integration [3] in
classification tasks as:
p(y = c | x) ≈ p

y = c | f W
(x)

q𝜃 (W)dW
≈
1
T
T
∑︁
t=1
p

y = c | f
f
Wt
(x)

=
1
T
T
∑︁
t=1
softmax

f
f
Wt
(x)

(7)
24 / 37

AdaMix
Analyzing both methods
Prior work [8] shows that averaging the weights of multiple models
fine-tuned with different hyper-parameters improves model
performance.
Let LAM
W
= Ex,yL(softmax(f
f
W)(x), y) be the loss with merging of
the stochastic adaptation weights (from equation 6)
LEns
W
= Ex,yL( 1
T
ÍT
t=1 softmax(f
g
Wt )(x), y) denote the expected loss
from logit-level stochastic model ensembling (from equation 7).
They analytically show the similarity in LAM
W
and LEns
W
as a function
of the flatness of the loss and confidence of the predictions.
25 / 37

AdaMix
Experiments
Table of contents
1 Abstract
2 Introduction
3 Background
5 Experiments
26 / 37

AdaMix
Experiments
Experiments
Dataset. In NLU, they use GLUE. In NLG, they use three different
tasks, namely, E2E [6], WebNLG [4], and DART [5].
Baselines. They compare AdaMix to full model fine-tuning and
several state-of-the-art parameter-efficient fine-tuning (PEFT)
methods.
AdaMix implementation details. The number of adaptation modules
in AdaMix is set to 4 for all the tasks. AdaMix choose BERT-base and
RoBERTa-large encoders for NLU task, and GPT-2 medium [1] for
NLG task.
27 / 37

AdaMix
Experiments
Table 1: GLUE development set with RoBERTa-large encoder.
Table 2: GLUE development set with BERT-base encoder and AdaMix with
a mixture-of-adapters.
28 / 37

AdaMix
Experiments
NLG Tasks
Table 3: Results on E2E NLG Challenge with GPT-2 medium backbone.
They also do some experiments with the dataset DART, WebNLG and
some ablation studies. Those results show on the original paper.
29 / 37

AdaMix
Conclusions / Limitations
Table of contents
1 Abstract
2 Introduction
3 Background
5 Experiments
30 / 37

AdaMix
Conclusions
This paper develops a new framework AdaMix for parameter-efficient
fine-tuning (PEFT) of large pre-trained language models (PLM).
It improves downstream task performance without increasing the
computational cost of the underlying adaptation method.
31 / 37

AdaMix
Limitations
The proposed AdaMix method is somewhat compute-intensive as it
involves fine-tuning large-scale language models.
Based on their empirical observation, the number of training iterations
for AdaMix is usually between 1 ∼ 2 times the training for standard
PEFT methods.
32 / 37

AdaMix
References I
[1] Tom Brown et al. “Language Models are Few-Shot Learners”.
In: Advances in Neural Information Processing Systems. Ed. by
H. Larochelle et al. Vol. 33. Curran Associates, Inc., 2020,
pp. 1877–1901. url: https:
//proceedings.neurips.cc/paper_files/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[2] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian
Approximation: Representing Model Uncertainty in Deep
Learning. 2016. arXiv: 1506.02142 [stat.ML].
33 / 37

AdaMix
References II
[3] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. “Deep
Bayesian Active Learning with Image Data”. In: Proceedings of
the 34th International Conference on Machine Learning. Ed. by
Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of
Machine Learning Research. PMLR, Aug. 2017, pp. 1183–1192.
url:
https://proceedings.mlr.press/v70/gal17a.html.
[4] Claire Gardent et al. “The WebNLG Challenge: Generating Text
from RDF Data”. In: Proceedings of the 10th International
Conference on Natural Language Generation. Santiago de
Compostela, Spain: Association for Computational Linguistics,
Sept. 2017, pp. 124–133. doi: 10.18653/v1/W17-3518. url:
https://aclanthology.org/W17-3518.
34 / 37

AdaMix
References III
[5] Linyong Nan et al. “DART: Open-Domain Structured Data
Record to Text Generation”. In: Proceedings of the 2021
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies.
Online: Association for Computational Linguistics, June 2021,
pp. 432–447. doi: 10.18653/v1/2021.naacl-main.37. url:
https://aclanthology.org/2021.naacl-main.37.
35 / 37

AdaMix
References IV
[6] Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. “The
E2E Dataset: New Challenges For End-to-End Generation”. In:
Proceedings of the 18th Annual SIGdial Meeting on Discourse
and Dialogue. Saarbrücken, Germany: Association for
Computational Linguistics, Aug. 2017, pp. 201–206. doi:
10.18653/v1/W17-5525. url:
36 / 37

AdaMix
References V
[7] Alex Wang et al. “GLUE: A Multi-Task Benchmark and
Analysis Platform for Natural Language Understanding”. In:
Proceedings of the 2018 EMNLP Workshop BlackboxNLP:
Analyzing and Interpreting Neural Networks for NLP. Brussels,
Belgium: Association for Computational Linguistics, Nov. 2018,
pp. 353–355. doi: 10.18653/v1/W18-5446. url:
[8] Mitchell Wortsman et al. Model soups: averaging weights of
multiple fine-tuned models improves accuracy without
increasing inference time. 2022. arXiv: 2203.05482 [cs.LG].
37 / 37

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf

More Related Content

Similar to AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf (20)

More from Po-Chuan Chen (20)

Recently uploaded (20)

AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf