SlideShare a Scribd company logo
6
Most read
7
Most read
8
Most read
1/19
Parameter-efficient Fine-tuning (PEFT): Overview,
benefits, techniques and model training
leewayhertz.com/parameter-efficient-fine-tuning
Transfer learning plays a crucial role in the development of large language models such
as GPT-3 and BERT. It is an ML technique in which a model trained on a certain task is
used as a starting point for a distinct but similar task. The idea behind transfer learning is
that the knowledge gained by a model from solving one problem can be leveraged to help
solve another problem.
One of the earliest examples of transfer learning was using pre-trained word embeddings,
such as Word2Vec, to improve the performance of NLP-based models. More recently,
with the emergence of large pre-trained language models such as BERT and GPT-3, the
scope of transfer learning has extended remarkably. Fine-tuning is one of the most
popular methods used in transfer learning. It involves adapting a pre-trained model to a
particular task by training it on a smaller set of task-specific labeled data.
However, with the parameter count of large language models reaching trillions, fine-tuning
the entire model has become computationally expensive and often impractical. In
response, the focus has shifted towards in-context learning, where the model is provided
with prompts for a given task and returns in-context updates. However, inefficiencies like
processing the prompt each time the model makes a prediction and its poor performance
at times make it a less favorable choice. This is where Parameter-efficient Fine-tuning
(PEFT) comes in as an alternative paradigm to prompting. PEFT aims to fine-tune only a
small subset of the model’s parameters, achieving comparable performance to full fine-
2/19
tuning while significantly reducing computational requirements. This article will discuss
the PEFT method in detail, exploring its benefits and how it has become an efficient way
to fine-tune LLMs on downstream tasks.
A glossary of important terms
What is PEFT?
What is the difference between fine-tuning and parameter-efficient fine-tuning?
Benefits of PEFT
PEFT: A better alternative to standard fine-tuning
Parameter-efficient fine-tuning techniques
Training your model using PEFT
Few-shot In-context Learning (ICL) vs. Parameter-efficient Fine-tuning (PEFT)
Is PEFT more efficient than ICL?
The process of parameter-efficient fine-tuning
A glossary of important terms
LLM models: Large Language Models or LLMs are a type of machine learning models
that can learn the underlying structure and semantics of text data for NLP tasks. They do
this by learning a set of latent variables representing the text’s high-level concepts and
features. Essentially, LLM models try to capture what the text is about, without solely
focusing on what words are used.
Pre-trained models: Pre-trained models are machine learning models that have been
trained on large amounts of data to facilitate a specific task, such as image classification,
speech recognition, or natural language processing. These models have already learned
the optimal set of weights and parameters needed to perform the task effectively so that
they can be used as a starting point for further training on new data or for use in other
applications.
Parameters: Parameters are the values/variables that a model learns during training to
make predictions or classifications on new data. Parameters are usually represented as
weights and biases in neural networks, and they control how the input data is transformed
into output predictions.
Transfer learning: Transfer learning refers to taking a pre-trained model developed for a
specific task and reusing it as a starting point for a new, related task. This involves using
the pre-trained model’s learned feature representations as a starting point for a new
model, which is then trained on a smaller dataset specific to the new task.
Fine-tuning: Fine-tuning is a specific type of transfer learning where the pre-trained
model’s weights are adjusted or fine-tuned on a new task-specific dataset. The pre-
trained model is used as a starting point in this process, but the weights are adjusted
during training to fit the new data better. The amount of fine-tuning can vary depending on
the amount of available data and the similarity between the original and new tasks.
3/19
Padding: Padding is a common technique used during fine-tuning language models to
handle variable-length input sequences. It is the process of adding special tokens
(typically a “padding” token) to the input sequence to bring it up to a fixed length.
Hidden representations: Hidden representations are the internal representations of the
input data learned by the pre-trained model’s layers. These representations capture
different levels of abstraction of the input data and can be used as features to train a new
model for the task at hand.
Few-shot learning: Few-shot learning is a machine learning technique that aims to train
models on a limited amount of labeled data, typically in the range of a few dozen to a few
hundred examples, and then generalize to new tasks with only a few or even a single
labeled example. Few-shot learning algorithms can learn to recognize novel objects,
categories, or concepts with very few examples by leveraging prior knowledge from
related tasks or domains.
What is PEFT?
Parameter-efficient Fine-tuning (PEFT) is a technique used in Natural Language
Processing (NLP) to improve the performance of pre-trained language models on specific
downstream tasks. It involves reusing the pre-trained model’s parameters and fine-tuning
them on a smaller dataset, which saves computational resources and time compared to
training the entire model from scratch.
PEFT achieves this efficiency by freezing some of the layers of the pre-trained model and
only fine-tuning the last few layers that are specific to the downstream task. This way, the
model can be adapted to new tasks with less computational overhead and fewer labeled
examples. Although PEFT has been a relatively novel concept, updating the last layer of
models has been in practice in the field of computer vision since the introduction of
transfer learning. Even in NLP, experiments with static and non-static word embeddings
were carried out early on.
Parameter-efficient fine-tuning aims to improve the performance of pre-trained models,
such as BERT and RoBERTa, on various downstream tasks, including sentiment
analysis, named entity recognition, and question-answering. It achieves this in low-
resource settings with limited data and computational resources. It modifies only a small
subset of model parameters and is less prone to overfitting.
What is the difference between fine-tuning and parameter-efficient
fine-tuning?
Fine-tuning and parameter-efficient fine-tuning are two approaches used in machine
learning to improve the performance of pre-trained models on a specific task.
4/19
Fine-tuning is taking a pre-trained model and training it further on a new task with new
data. The entire pre-trained model is usually trained in fine-tuning, including all its layers
and parameters. This process can be computationally expensive and time-consuming,
especially for large models.
On the other hand, parameter-efficient fine-tuning is a method of fine-tuning that focuses
on training only a subset of the pre-trained model’s parameters. This approach involves
identifying the most important parameters for the new task and only updating those
parameters during training. Doing so, PEFT can significantly reduce the computation
required for fine-tuning.
Contact LeewayHertz for AI consultancy and development
Optimize AI model performance on any task with PEFT, without the need for extensive
retraining or large-scale parameter updates
Learn More
Parameter-efficient Fine-
tuning
Standard Fine-tuning
Goal Improve the performance of a
pre-trained model on a specific
task with limited data and
computation
Improve the performance of a
pre-trained model on a specific
task with ample data and
computation
Training Data Small dataset (fewer examples) Large dataset (many examples)
Training Time Faster training time as compared
to fine-tuning
Longer training time as
compared to PEFT
Computational
Resources
Uses fewer computational
resources
Requires larger computational
resources
Model
Parameters
Modifies only a small subset of
model parameters
Re-trains the entire model
Overfitting Less prone to overfitting as the
model is not excessively
modified
More prone to overfitting as the
model is extensively modified
Training
Performance
Not as good as fine-tuning, but
still good enough
Typically results in better
performance than PEFT
Use Cases Ideal for low-resource settings or
where large amounts of training
data are not available
Ideal for high-resource settings
with ample training data and
computational resources
Parameter-efficient fine-tuning can be particularly useful in scenarios where
computational resources are limited or where large pre-trained models are involved. In
such cases, PEFT can provide a more efficient way of fine-tuning the model without
5/19
sacrificing performance. However, it’s important to note that PEFT may sometimes
achieve a different level of performance than full fine-tuning, especially in cases where
the pre-trained model requires significant modification to perform well on the new task.
Benefits of PEFT
Here we will discuss the benefits of PEFT in relation to traditional fine-tuning. So, let us
understand why parameter-efficient fine-tuning is more beneficial than fine-tuning.
1. Decreased computational and storage costs: PEFT involves fine-tuning only a
small number of extra model parameters while freezing most parameters of the pre-
trained LLMs, thereby reducing computational and storage costs significantly.
2. Overcoming catastrophic forgetting: During full fine-tuning of LLMs, catastrophic
forgetting can occur where the model forgets the knowledge it learned during
pretraining. PEFT stands to overcome this issue by only updating a few parameters.
3. Better performance in low-data regimes: PEFT approaches have been shown to
perform better than full fine-tuning in low-data regimes and generalize better to out-
of-domain scenarios.
4. Portability: PEFT methods enable users to obtain tiny checkpoints worth a few
MBs compared to the large checkpoints of full fine-tuning. This makes the trained
weights from PEFT approaches easy to deploy and use for multiple tasks without
replacing the entire model.
5. Performance comparable to full fine-tuning: PEFT enables achieving
comparable performance to full fine-tuning with only small number of trainable
parameters.
PEFT: A better alternative to standard fine-tuning
A standard fine-tuning process involves adjusting the hidden representations (h)
extracted by transformer models to enhance their performance in downstream tasks.
These hidden representations refer to any features the transformer architecture extracts,
such as the output of a transformer layer or a self-attention layer.
Before Fine-Tuning
This
is
a
total
waste
of
money
Embedding
Layer
Transformer
Layer
1
Transformer
Layer
2
Transformer
Layer
N
[CLS]
h
LeewayHertz
6/19
To illustrate, suppose we have an input sentence, “This is a total waste of money.” Before
fine-tuning, the transformer model computes the hidden representations (h) of each token
in the sentence. After fine-tuning, the model’s parameters are updated, and the updated
parameters will generate a different set of hidden representations, denoted by h’. Thus,
the hidden representations generated by the pre-trained and fine-tuned models will differ
even for the same sentence.
After Fine-Tuning
This
is
a
total
waste
of
money
Embedding
Layer
Transformer
Layer
1
Transformer
Layer
2
Transformer
Layer
N
[CLS]
h’
Classifier
Head
LeewayHertz
In essence, fine-tuning is a process that modifies the pre-trained language model’s
hidden representations to make them more suitable for downstream tasks. However, fine-
tuning all the parameters in the model is not necessary to achieve this goal. Only fine-
tuning a small fraction of the parameters is often sufficient to change the hidden
representations from h to h’.
Parameter-efficient fine-tuning techniques
Presently, only the following PEFT methods are employed. Nevertheless, ongoing
research is underway to explore and develop new methods.
Adapter
Adapters are a special type of submodule that can be added to pre-trained language
models to modify their hidden representation during fine-tuning. By inserting adapters
after the multi-head attention and feed-forward layers in the transformer architecture, we
can update only the parameters in the adapters during fine-tuning while keeping the rest
of the model parameters frozen.
Adopting adapters can be a straightforward process. All that is required is to add adapters
into each transformer layer and place a classifier layer on top of the pre-trained model. By
updating the parameters of the adapters and the classifier head, we can improve the
performance of the pre-trained model on a particular task without updating the entire
model. This approach can save time and computational resources while still producing
impressive results.
7/19
How does fine-tuning using an adapter work?
The adapter module comprises two feed-forward projection layers connected with a non-
linear activation layer. There is also a skip connection that bypasses the feed-forward
layers.
If we take the adapter placed right after the multi-head attention layer, then the input to
the adapter layer is the hidden representation h calculated by the multi-head attention
layer. Here, h takes two different paths in the adapter layer; one is the skip-connection,
which leaves the input unchanged, and the other way involves the feed-forward layers.
Layer Norm
Layer Norm
Adapter
Adapter
Feed-Forward
Multi-Headed
Attention
+
+
Adapters
are
Updated
h
h’= h +
Adapter
h
h
Feed-Forward
Down-Project
h
Feed-Forward
Up-Project
Nonlinearity
+
Hidden
Representation
Skip
Connection
LeewayHertz
Initially, the first feed-forward layer projects h into a low-dimension space. This space has
a dimension less than h. Following this, the input is passed through a non-linear
activation function, and the second feed-forward layer then projects it back up to the
dimensionality of h. The results obtained from the two ways are summed together to
obtain the final output of the adapter module.
The skip-connection preserves the original input h of the adapter, while the feed-forward
path generates an incremental change, represented as Δh, based on the original h. By
adding the incremental change Δh, obtained from the feed-forward layer with the original
h from the previous layer, the adapter modifies the hidden representation calculated by
the pre-trained model. This allows the adapter to alter the hidden representation of the
pre-trained model, thereby changing its output for a specific task.
LoRA
8/19
Low-Rank Adaptation (LoRA) of large language models is another approach in the area
of fine-tuning models for specific tasks or domains. Similar to the adapters, LoRA is also
a small trainable submodule that can be inserted into the transformer architecture. It
involves freezing the pre-trained model weights and injecting trainable rank
decomposition matrices into each layer of the transformer architecture, greatly
diminishing the number of trainable parameters for downstream tasks. This method can
minimize the number of trainable parameters by up to 10,000 times and the GPU memory
necessity by 3 times while still performing on par or better than fine-tuning model quality
on various tasks. LoRA also allows for more efficient task-switching, lowering the
hardware barrier to entry, and has no additional inference latency compared to other
methods.
How does it work?
LoRA is inserted in parallel to the modules in the pre-trained transformer model,
specifically in parallel to the feed-forward layers. A feed-forward layer has two projection
layers and a non-linear layer in between them, where the input vector is projected into an
output vector with a different dimensionality using an affine transformation. The LoRA
layers are inserted next to each of the two feed-forward layers.
Feed-Forward
Down-Project
Nonlinearity
Feed-Forward
Up-Project
+
+
LoRA layers
LeewayHertz
9/19
Now, let us consider the feed-forward up-project layer and the LoRA next to it. The
original parameters of the feed-forward layer take the output from the previous layer with
the dimension d and projects it into d . Here, FFW is the abbreviation for feed-
forward. The LoRA module placed next to it consists of two feed-forward layers. The
LoRA’s first feed-forward layer takes the same input as the feed-forward up-project layer
and projects it into an r-dimensional vector, which is far less than the d . Then, the
second feed-forward layer projects the vector into another vector with a dimensionality of
d . Finally, the two vectors are added together to form the final representation.
h
h’= h +
+
h
dFFW
r
r
dmodel
h
Feed-Forward
Up-Project
dFFW
dmodel
LeewayHertz
As we have discussed earlier, fine-tuning is changing the hidden representation h
calculated by the original transformer model. Hence, in this case, the hidden
representation calculated by the feed-forward up-project layer of the original transformer
is h. Meanwhile, the vector calculated by LoRA is the incremental change Δh that is used
to modify the original h. Thus, the sum of the original representation and the incremental
change is the updated hidden representation h’.
By inserting LoRA modules next to the feed-forward layers and a classifier head on top of
the pre-trained model, task-specific parameters for each task are kept to a minimum.
Prefix tuning
Prefix-tuning is a lightweight alternative to fine-tuning large pre-trained language models
for natural language generation tasks. Fine-tuning requires updating and storing all the
model parameters for each task, which can be very expensive given the large size of
current models. Prefix-tuning keeps the language model parameters frozen and optimizes
model FFW
model
FFW
10/19
a small continuous task-specific vector called the prefix. In prefix-tuning, the prefix is a set
of free parameters that are trained along with the language model. The goal of prefix-
tuning is to find a context that steers the language model toward generating text that
solves a particular task.
This
is
a
total
waste
of
money
Embedding
Layer
Transformer
Layer
1
Transformer
Layer
2
Transformer
Layer
N
[BOS]
LeewayHertz
Prefix
Prefix
Prefix
The prefix can be seen as a sequence of “virtual tokens” that subsequent tokens can
attend to. By learning only 0.1% of the parameters, prefix-tuning obtains comparable
performance to fine-tuning in the full data setting, outperforms fine-tuning in low-data
settings, and extrapolates better to examples with topics unseen during training.
Similar to all previously mentioned PEFT techniques, the end goal of prefix tuning is to
reach h’. Prefix tuning uses prefixes to modify the hidden representations extracted by the
original pre-trained language models. When the incremental change Δh is added to the
original hidden representation h, we get the modified representation, i.e., h’.
When using prefix tuning, only the prefixes are updated, while the rest of the layers are
fixed and not updated.
Prompt tuning
Prompt tuning is another PEFT technique for adapting pre-trained language models to
specific downstream tasks. Unlike the traditional “model tuning” approach, where all the
pre-trained model parameters are tuned for each task, prompt tuning involves learning
soft prompts through backpropagation that can be fine-tuned for specific tasks by
incorporating labeled examples. Prompt tuning outperforms the few-shot learning of GPT-
3 and becomes more competitive as the model size increases. It also benefits domain
transfer’s robustness and enables efficient prompt ensembling. It requires storing a small
task-specific prompt for each task, making it easier to reuse a single frozen model for
multiple downstream tasks, unlike model tuning, which requires making a task-specific
copy of the entire pre-trained model for each task.
11/19
How does it work?
Prompt tuning is a simpler variant of prefix tuning. In it, some vectors are prepended at
the beginning of a sequence at the input layer. When presented with an input sentence,
the embedding layer converts each token into its corresponding word embedding, and the
prefix embeddings are prepended to the sequence of token embeddings. Next, the pre-
trained transformer layers will process the embedding sequence like a transformer model
does to a normal sequence. Only the prefix embeddings are adjusted during the fine-
tuning process, while the rest of the transformer model is kept frozen and unchanged.
LeewayHertz
Prefix Embedding
Embedding
Layer
Transformer
Layer
1
Transformer
Layer
2
Transformer
Layer
N
[BOS]
Input
Sequence
This technique has several advantages over traditional fine-tuning methods, including
improved efficiency and reduced computational overhead. Additionally, the fact that only
the prefix embeddings are fine-tuned means that there is a lower risk of overfitting to the
training data, thereby producing more robust and generalizable models.
P-tuning
P-tuning can improve the performance of language models such as GPTs in Natural
Language Understanding (NLU) tasks. Traditional fine-tuning techniques have not been
effective for GPTs, but P-tuning uses trainable continuous prompt embeddings to improve
their performance. This method has been tested on two NLU benchmarks, LAMA and
SuperGLUE, and has shown significant improvements in precision and world knowledge
recovery. P-tuning also reduces the need for prompt engineering and outperforms state-
of-the-art approaches on the few-shot SuperGLUE benchmark.
P-tuning can be used to improve pre-trained language models for various tasks, including
sentence classification and predicting a country’s capital. The technique involves
modifying the input embeddings of the pre-trained language model with differential output
embeddings generated using a prompt. The continuous prompts can be optimized using
a downstream loss function and a prompt encoder, which helps solve discreteness and
association challenges.
IA3
12/19
IA3, short for Infused Adapter by Inhibiting and Amplifying Inner Activations, is another
parameter-efficient fine-tuning technique designed to improve upon the LoRA technique.
It focuses on making the fine-tuning process more efficient by reducing the number of
trainable parameters in a model.
Both LoRA and IA3 share some similarities in their core objective of improving fine-tuning
efficiency. They achieve this by introducing learned components, reducing the number of
trainable parameters, and keeping the original pre-trained weights frozen. These shared
characteristics make both techniques valuable tools for adapting large pre-trained models
to specific tasks while minimizing computational demands. Additionally, both LoRA and
IA3 prioritize maintaining model performance, ensuring that fine-tuned models remain
competitive with fully fine-tuned ones. Furthermore, their capacity to merge adapter
weights without adding inference latency contributes to their versatility and practicality for
real-time applications and various downstream tasks.
How does IA3 work?
IA3 optimizes the fine-tuning process by rescaling the inner activations of a pre-trained
model using learned vectors. These learned vectors are incorporated into the attention
and feedforward modules within a standard transformer-based architecture. The key
innovation of IA3 is that it freezes the original pre-trained weights of the model, making
only the introduced learned vectors trainable during fine-tuning. This drastic reduction in
the number of trainable parameters significantly improves the efficiency of fine-tuning
without compromising model performance. IA3 is compatible with various downstream
tasks, maintains inference speed, and can be applied to specific layers of a neural
network, making it a valuable tool for efficient model adaptation and deployment.
Contact LeewayHertz for AI consultancy and development
Optimize AI model performance on any task with PEFT, without the need for extensive
retraining or large-scale parameter updates
Learn More
Training your model using PEFT
In our example, we will use LoRA to fine-tune a pre-trained sequence-to-sequence
language model to generate text for a specific task, in this case, for Twitter complaints.
Import the dependencies and define the variables
First, let us import all the necessary libraries, modules and other dependencies, like
AutoModelForSeq2SeqLM, PeftModel, torch, the datasets and AutoTokenizer, among
others. The line of codes would be something like this:
from transformers import AutoModelForSeq2SeqLM
13/19
from peft import PeftModel, PeftConfig
import torch
from datasets import load_dataset
import os
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import default_data_collator, get_linear_schedule_with_warmup
from tqdm import tqdm
from datasets import load_dataset
Next, we need to define the name of the dataset, the text column name, the label column
name, and the batch size for training the model.
dataset_name = "twitter_complaints"
text_column = "Tweet text"
label_column = "text_label"
batch_size = 8
Now, run the following commands to define the pre-trained PEFT model and load its
configuration.
peft_model_id =
"smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
config = PeftConfig.from_pretrained(peft_model_id)
In the above set of codes, the ‘peft_model_id’ variable contains the ID of the pre-trained
model and the ‘config’ variable is set to the model’s configuration.
Now, set the maximum memory allowed for each device; say, GPU is allowed to use up to
6GB of memory, and the CPU can use up to 30GB of memory.
max_memory = {0: "6GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu": "30GB"}
Load the base model of the pre-trained PEFT model specified by peft_model_id.
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,
device_map="auto", max_memory=max_memory)
14/19
In the above command, the ‘AutoModelForSeq2SeqLM’ class is used to load the base
model and the ‘from_pretrained’ function is used to load the weights of the pre-trained
model. The ‘device_map’ argument specifies the mapping between devices and model
components, and the ‘max_memory’ argument specifies the maximum memory allowed
for each device.
Next, load the full PEFT model specified by ‘peft_model_id’ using the following command:
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto",
max_memory=max_memory)
Preprocess the data
Map the dataset labels to human-readable class names:
The first step in preprocessing the data is to map the dataset labels to human-readable
class names. For this, you need to replace all the underscores with spaces in the label
names of the training set.
classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
print(classes)
Then, run the following codes to map the labels into human-readable class names.
dataset = dataset.map(
lambda x: {"text_label": [classes[label] for label in x["Label"]]},
batched=True,
num_proc=1,
)
print(dataset)
dataset["train"][0]
Tokenization:
First, we need to load a pre-trained tokenizer from the transformers library for
tokenization. We also need to set the maximum length of the target labels by tokenizing
each class label and taking the length of the resulting list of token IDs. This can be used
later to pad all labels to a consistent length. For this, run the following:
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in
classes])
15/19
Run the following codes to extract the text and target labels from the input examples,
tokenize the text using the pre-trained tokenizer, and pad the labels to a consistent
length.
def preprocess_function(examples):
inputs = examples[text_column]
targets = examples[label_column]
model_inputs = tokenizer(inputs, truncation=True)
labels = tokenizer(
targets, max_length=target_max_length, padding="max_length", truncation=True,
return_tensors="pt"
)
labels = labels["input_ids"]
labels[labels == tokenizer.pad_token_id] = -100
model_inputs["labels"] = labels
return model_inputs
Specify the steps needed to preprocess the dataset and prepare it for fine-tuning the
model.
processed_datasets = dataset.map(
preprocess_function,
batched=True,
num_proc=1,
remove_columns=dataset["train"].column_names,
load_from_cache_file=True,
desc="Running tokenizer on dataset",
)
Now, split the preprocessed dataset into separate training, evaluation, and test sets.
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["eval"]
16/19
test_dataset = processed_datasets["test"]
Define a collate function:
Next, we need to define a collate function to gather and combine the preprocessed
examples into batches.
def collate_fn(examples):
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
Next, define the data loaders for the training, evaluation, and test datasets.
train_dataloader = DataLoader(
train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size,
pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=collate_fn,
batch_size=batch_size, pin_memory=True)
test_dataloader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size,
pin_memory=True)
Model training and evaluation
To train the model using the preprocessed dataset, first, define the specifications, like the
number of epochs and loss function.
Once trained, evaluate the model on its intended purpose.
model.eval()
i = 15
inputs = tokenizer(f'{text_column} : {dataset["test"][i]["Tweet text"]} Label : ',
return_tensors="pt")
print(inputs)
with torch.no_grad():
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
17/19
Assessing the performance of a fine-tuned machine learning model is an essential step.
One common way to evaluate a model’s performance is by checking its accuracy on an
evaluation dataset. You can refer to this GitHub repository to view the entire evaluation
process, including the code for calculating these metrics.
Few-shot In-context Learning (ICL) vs. Parameter-efficient Fine-
tuning (PEFT)
Few-shot in-context learning and parameter-efficient fine-tuning are techniques or
approaches used to train natural language, processing models. Although both these
approaches enable pre-trained language models to perform new tasks without extensive
training, the methods adopted in both approaches are technically different. The first
approach, ICL, allows the model to perform a new task by inputting prompted examples
without requiring gradient-based training. However, ICL incurs significant computational,
memory, and storage costs. The second approach, PEFT, involves training a small
number of added or selected parameters to enable a model to perform a new task with
minimal updates.
ICL is an approach that aims to improve the few-shot learning performance of pre-trained
language models by incorporating contextual information during fine-tuning. This
approach involves fine-tuning a pre-trained language model on a few-shot task with
additional contextual information provided as input. This contextual information could be
in the form of additional sentences or paragraphs that provide more information about the
task at hand. ICL aims to use this contextual information to enhance the model’s ability to
generalize to new tasks, even with limited training examples.
On the other hand, parameter-efficient fine-tuning is an approach that aims to improve the
efficiency of fine-tuning pre-trained language models on downstream tasks by identifying
and freezing important model parameters. This approach involves fine-tuning the pre-
trained model on a small amount of data while also freezing some of the model’s
parameters to prevent overfitting. By selectively freezing certain parameters, the model
can retain more of its pre-trained knowledge, improving its performance on downstream
tasks with limited training data.
Is PEFT more efficient than ICL?
Parametric Few-shot Learning (PFSL) is an important task for natural language
processing applications, where models must quickly adapt to new tasks with limited
training examples. In recent years, various approaches have been put forward to tackle
this challenge, with ICL being one of the most popular techniques. However, a research
paper published in 2021 introduces a new approach called parametric efficient few-shot
learning, which outperforms ICL in terms of accuracy while requiring significantly fewer
computational resources.
18/19
One of the main reasons PEFT outperforms ICL is its use of a novel scaling method
called (IA)^3, which rescales inner activations with learned vectors. This technique
performs better than fine-tuning the full model while introducing only a few additional
parameters. In contrast, ICL fine-tunes the entire model on a small amount of data, which
can lead to overfitting and a drop in accuracy.
Another reason why PEFT is better than ICL is due to its use of two additional loss terms
that encourage the model to output lower probabilities for incorrect choices and account
for the length of different answer choices. These loss terms help the model to better
generalize to new tasks and avoid overfitting.
In addition to its superior performance, parameter-efficient fine-tuning is also more
computationally efficient than ICL. The research paper found that PEFT uses over 1,000x
fewer floating-point operations (FLOPs) during inference than few-shot ICL with GPT-3
and only requires 30 minutes to train on a single NVIDIA A100 GPU. This makes PEFT a
more practical and scalable solution for real-world NLP applications.
Overall, the introduction of PEFT represents a significant advancement in the field of few-
shot learning for NLP applications. Its use of (IA)^3 scaling, additional loss terms, and
superior computational efficiency make it a better alternative to ICL for tasks that require
rapid adaptation to new few-shot learning scenarios.
The process of parameter-efficient fine-tuning
The steps involved in parameter-efficient fine-tuning can vary depending on the specific
implementation and the pre-trained model being used. However, here is a general outline
of the steps involved in PEFT:
Pre-training: Initially, a large-scale model is pre-trained on a large dataset using a
general task such as image classification or language modeling. This pre-training phase
helps the model learn meaningful representations and features from the data.
Task-specific dataset: Gather or create a dataset that is specific to the target task you
want to fine-tune the pre-trained model for. This dataset should be labeled and
representative of the target task.
Parameter identification: Identify or estimate the importance or relevance of parameters
in the pre-trained model for the target task. This step helps in determining which
parameters should be prioritized during fine-tuning. Various techniques, such as
importance estimation, sensitivity analysis, or gradient-based methods, can be used to
identify important parameters.
Subset selection: Select a subset of the pre-trained model’s parameters based on their
importance or relevance to the target task. The subset can be determined by setting
certain criteria, such as a threshold on the importance scores or selecting the top-k most
important parameters.
19/19
Fine-tuning: Initialize the selected subset of parameters with the values from the pre-
trained model and freeze the remaining parameters. Fine-tune the selected parameters
using the task-specific dataset. This involves training the model on the target task data,
typically using techniques like Stochastic Gradient Descent (SGD) or Adam optimization.
Evaluation: Evaluate the performance of the fine-tuned model on a validation set or
through other evaluation metrics relevant to the target task. This step helps assess the
effectiveness of PEFT in achieving the desired performance while using fewer
parameters.
Iterative refinement (optional): Depending on the performance and requirements, you
may choose to iterate and refine the PEFT process by adjusting the criteria for parameter
selection, exploring different subsets, or fine-tuning for additional epochs to optimize the
model’s performance further.
However, it’s important to note that the specific implementation details and techniques
used in PEFT can vary across research papers as well as applications.
Endnote
PEFT, or Parameter-efficient Fine-tuning, is a natural language processing technique
used to improve the performance of pre-trained language models on specific downstream
tasks. It involves freezing some of the layers of the pre-trained model and only fine-tuning
the last few layers that are specific to the downstream task. This technique is more
beneficial than traditional fine-tuning in several ways, such as decreased computational
and storage costs, overcoming catastrophic forgetting, and comparable performance to
full fine-tuning with a small number of trainable parameters. Overall, PEFT is a promising
approach to improving the efficiency and effectiveness of NLP models in various
applications.
Start a conversation by filling the form

More Related Content

PDF
LLM Cheatsheet and it's brief introduction
PPTX
How to fine-tune and develop your own large language model.pptx
PDF
Parameter-Efficient Fine-Tuning Explained in Detail.pdf
PDF
Parameter-Efficient Fine-Tuning Explained in Detail.pdf
PPTX
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
PPTX
Transfer Learning: Breve introducción a modelos pre-entrenados.
LLM Cheatsheet and it's brief introduction
How to fine-tune and develop your own large language model.pptx
Parameter-Efficient Fine-Tuning Explained in Detail.pdf
Parameter-Efficient Fine-Tuning Explained in Detail.pdf
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
Transfer Learning: Breve introducción a modelos pre-entrenados.

Similar to leewayhertz.com-Parameter-efficient Fine-tuning PEFT Overview benefits techniques and model training.pdf (20)

PDF
Lecture 11 - Advance Learning Techniques
PDF
How to use transfer learning to bootstrap image classification and question a...
PDF
Fine-tuning Pre-Trained Models for Generative AI Applications
PDF
Tailoring Small Language Models for Enterprise Use Cases
PPTX
OReilly AI Transfer Learning
PPTX
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
PDF
Evaluating Parameter Efficient Learning for Generation.pdf
PDF
Introduction to Few shot learning
PPTX
Fine_Tuning_Datasets_and_Techniques.pptx
PPTX
Deep Learning Intoductions along with Examples.pptx
PPTX
[DSC Europe 24] Gabriel Preda - Fine-tune LLMs from Kaggle Models using (Q)LoRA
PDF
lenet -lenet-lenet-lenet-lenetlenetlenet.pptx.pdf
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
PPTX
Nuts and Bolts of Transfer Learning.pptx
PPTX
MODULE 4 AAI_______________________.pptx
PDF
ACL-2022_tutorial_part_AB_V8 (1).pdf
PPTX
Transfer Leaning Using Pytorch synopsis Minor project pptx
PPTX
transferlearning.pptx
Lecture 11 - Advance Learning Techniques
How to use transfer learning to bootstrap image classification and question a...
Fine-tuning Pre-Trained Models for Generative AI Applications
Tailoring Small Language Models for Enterprise Use Cases
OReilly AI Transfer Learning
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
Evaluating Parameter Efficient Learning for Generation.pdf
Introduction to Few shot learning
Fine_Tuning_Datasets_and_Techniques.pptx
Deep Learning Intoductions along with Examples.pptx
[DSC Europe 24] Gabriel Preda - Fine-tune LLMs from Kaggle Models using (Q)LoRA
lenet -lenet-lenet-lenet-lenetlenetlenet.pptx.pdf
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Nuts and Bolts of Transfer Learning.pptx
MODULE 4 AAI_______________________.pptx
ACL-2022_tutorial_part_AB_V8 (1).pdf
Transfer Leaning Using Pytorch synopsis Minor project pptx
transferlearning.pptx
Ad

More from alexjohnson7307 (20)

PDF
The adoption AI_in_account_to_report.pdf
PDF
zbrain_ai_ai_in_the_project_and_capital_expenditure_manageme.pdf
PDF
zbrain_ai_computer using agents_models.pdf
PDF
zbrain_platform_ai_in_procure_to_pay.pdf
PDF
zbrain_ai_generative_ai_for_manufacturing.pdf
PDF
Scope Integration Use Cases Challenges and Best Practices.pdf
PDF
zbrain.ai-Scope Adoption Use Cases Challenges and Trends.pdf
PDF
zbrain.ai-Generative AI in logistics Use cases integration approaches develop...
PDF
Zbrain- Generative AI in Hospitality.pdf
PDF
zbrain.ai-Accelerating Enterprise AI Development with Retrieval-augmented Gen...
PDF
zbrain_ai_generative_ai_for_internal_audit.pdf
PDF
zbrain.ai-Scope integration strategies use cases and future trends.pdf
PDF
leewayhertz.com-Cloud AI services A comprehensive guide.pdf
PDF
leewayhertz.com-AI agents for real estate Applications benefits and implement...
PDF
leewayhertz.com-Use cases solution and implementation.pdf
PDF
leewayhertz.com-AI agents for real estate Applications benefits and implement...
PDF
leewayhertz.com-Use cases implementation and development (1).pdf
PDF
leewayhertz.com-How to build a private LLM (1).pdf
PDF
leewayhertz.com-AI Use Cases amp Applications Across Major industries.pdf
PDF
leewayhertz.com-Use cases solution AI agents and implementation.pdf
The adoption AI_in_account_to_report.pdf
zbrain_ai_ai_in_the_project_and_capital_expenditure_manageme.pdf
zbrain_ai_computer using agents_models.pdf
zbrain_platform_ai_in_procure_to_pay.pdf
zbrain_ai_generative_ai_for_manufacturing.pdf
Scope Integration Use Cases Challenges and Best Practices.pdf
zbrain.ai-Scope Adoption Use Cases Challenges and Trends.pdf
zbrain.ai-Generative AI in logistics Use cases integration approaches develop...
Zbrain- Generative AI in Hospitality.pdf
zbrain.ai-Accelerating Enterprise AI Development with Retrieval-augmented Gen...
zbrain_ai_generative_ai_for_internal_audit.pdf
zbrain.ai-Scope integration strategies use cases and future trends.pdf
leewayhertz.com-Cloud AI services A comprehensive guide.pdf
leewayhertz.com-AI agents for real estate Applications benefits and implement...
leewayhertz.com-Use cases solution and implementation.pdf
leewayhertz.com-AI agents for real estate Applications benefits and implement...
leewayhertz.com-Use cases implementation and development (1).pdf
leewayhertz.com-How to build a private LLM (1).pdf
leewayhertz.com-AI Use Cases amp Applications Across Major industries.pdf
leewayhertz.com-Use cases solution AI agents and implementation.pdf
Ad

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
cuic standard and advanced reporting.pdf
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

leewayhertz.com-Parameter-efficient Fine-tuning PEFT Overview benefits techniques and model training.pdf

  • 1. 1/19 Parameter-efficient Fine-tuning (PEFT): Overview, benefits, techniques and model training leewayhertz.com/parameter-efficient-fine-tuning Transfer learning plays a crucial role in the development of large language models such as GPT-3 and BERT. It is an ML technique in which a model trained on a certain task is used as a starting point for a distinct but similar task. The idea behind transfer learning is that the knowledge gained by a model from solving one problem can be leveraged to help solve another problem. One of the earliest examples of transfer learning was using pre-trained word embeddings, such as Word2Vec, to improve the performance of NLP-based models. More recently, with the emergence of large pre-trained language models such as BERT and GPT-3, the scope of transfer learning has extended remarkably. Fine-tuning is one of the most popular methods used in transfer learning. It involves adapting a pre-trained model to a particular task by training it on a smaller set of task-specific labeled data. However, with the parameter count of large language models reaching trillions, fine-tuning the entire model has become computationally expensive and often impractical. In response, the focus has shifted towards in-context learning, where the model is provided with prompts for a given task and returns in-context updates. However, inefficiencies like processing the prompt each time the model makes a prediction and its poor performance at times make it a less favorable choice. This is where Parameter-efficient Fine-tuning (PEFT) comes in as an alternative paradigm to prompting. PEFT aims to fine-tune only a small subset of the model’s parameters, achieving comparable performance to full fine-
  • 2. 2/19 tuning while significantly reducing computational requirements. This article will discuss the PEFT method in detail, exploring its benefits and how it has become an efficient way to fine-tune LLMs on downstream tasks. A glossary of important terms What is PEFT? What is the difference between fine-tuning and parameter-efficient fine-tuning? Benefits of PEFT PEFT: A better alternative to standard fine-tuning Parameter-efficient fine-tuning techniques Training your model using PEFT Few-shot In-context Learning (ICL) vs. Parameter-efficient Fine-tuning (PEFT) Is PEFT more efficient than ICL? The process of parameter-efficient fine-tuning A glossary of important terms LLM models: Large Language Models or LLMs are a type of machine learning models that can learn the underlying structure and semantics of text data for NLP tasks. They do this by learning a set of latent variables representing the text’s high-level concepts and features. Essentially, LLM models try to capture what the text is about, without solely focusing on what words are used. Pre-trained models: Pre-trained models are machine learning models that have been trained on large amounts of data to facilitate a specific task, such as image classification, speech recognition, or natural language processing. These models have already learned the optimal set of weights and parameters needed to perform the task effectively so that they can be used as a starting point for further training on new data or for use in other applications. Parameters: Parameters are the values/variables that a model learns during training to make predictions or classifications on new data. Parameters are usually represented as weights and biases in neural networks, and they control how the input data is transformed into output predictions. Transfer learning: Transfer learning refers to taking a pre-trained model developed for a specific task and reusing it as a starting point for a new, related task. This involves using the pre-trained model’s learned feature representations as a starting point for a new model, which is then trained on a smaller dataset specific to the new task. Fine-tuning: Fine-tuning is a specific type of transfer learning where the pre-trained model’s weights are adjusted or fine-tuned on a new task-specific dataset. The pre- trained model is used as a starting point in this process, but the weights are adjusted during training to fit the new data better. The amount of fine-tuning can vary depending on the amount of available data and the similarity between the original and new tasks.
  • 3. 3/19 Padding: Padding is a common technique used during fine-tuning language models to handle variable-length input sequences. It is the process of adding special tokens (typically a “padding” token) to the input sequence to bring it up to a fixed length. Hidden representations: Hidden representations are the internal representations of the input data learned by the pre-trained model’s layers. These representations capture different levels of abstraction of the input data and can be used as features to train a new model for the task at hand. Few-shot learning: Few-shot learning is a machine learning technique that aims to train models on a limited amount of labeled data, typically in the range of a few dozen to a few hundred examples, and then generalize to new tasks with only a few or even a single labeled example. Few-shot learning algorithms can learn to recognize novel objects, categories, or concepts with very few examples by leveraging prior knowledge from related tasks or domains. What is PEFT? Parameter-efficient Fine-tuning (PEFT) is a technique used in Natural Language Processing (NLP) to improve the performance of pre-trained language models on specific downstream tasks. It involves reusing the pre-trained model’s parameters and fine-tuning them on a smaller dataset, which saves computational resources and time compared to training the entire model from scratch. PEFT achieves this efficiency by freezing some of the layers of the pre-trained model and only fine-tuning the last few layers that are specific to the downstream task. This way, the model can be adapted to new tasks with less computational overhead and fewer labeled examples. Although PEFT has been a relatively novel concept, updating the last layer of models has been in practice in the field of computer vision since the introduction of transfer learning. Even in NLP, experiments with static and non-static word embeddings were carried out early on. Parameter-efficient fine-tuning aims to improve the performance of pre-trained models, such as BERT and RoBERTa, on various downstream tasks, including sentiment analysis, named entity recognition, and question-answering. It achieves this in low- resource settings with limited data and computational resources. It modifies only a small subset of model parameters and is less prone to overfitting. What is the difference between fine-tuning and parameter-efficient fine-tuning? Fine-tuning and parameter-efficient fine-tuning are two approaches used in machine learning to improve the performance of pre-trained models on a specific task.
  • 4. 4/19 Fine-tuning is taking a pre-trained model and training it further on a new task with new data. The entire pre-trained model is usually trained in fine-tuning, including all its layers and parameters. This process can be computationally expensive and time-consuming, especially for large models. On the other hand, parameter-efficient fine-tuning is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and only updating those parameters during training. Doing so, PEFT can significantly reduce the computation required for fine-tuning. Contact LeewayHertz for AI consultancy and development Optimize AI model performance on any task with PEFT, without the need for extensive retraining or large-scale parameter updates Learn More Parameter-efficient Fine- tuning Standard Fine-tuning Goal Improve the performance of a pre-trained model on a specific task with limited data and computation Improve the performance of a pre-trained model on a specific task with ample data and computation Training Data Small dataset (fewer examples) Large dataset (many examples) Training Time Faster training time as compared to fine-tuning Longer training time as compared to PEFT Computational Resources Uses fewer computational resources Requires larger computational resources Model Parameters Modifies only a small subset of model parameters Re-trains the entire model Overfitting Less prone to overfitting as the model is not excessively modified More prone to overfitting as the model is extensively modified Training Performance Not as good as fine-tuning, but still good enough Typically results in better performance than PEFT Use Cases Ideal for low-resource settings or where large amounts of training data are not available Ideal for high-resource settings with ample training data and computational resources Parameter-efficient fine-tuning can be particularly useful in scenarios where computational resources are limited or where large pre-trained models are involved. In such cases, PEFT can provide a more efficient way of fine-tuning the model without
  • 5. 5/19 sacrificing performance. However, it’s important to note that PEFT may sometimes achieve a different level of performance than full fine-tuning, especially in cases where the pre-trained model requires significant modification to perform well on the new task. Benefits of PEFT Here we will discuss the benefits of PEFT in relation to traditional fine-tuning. So, let us understand why parameter-efficient fine-tuning is more beneficial than fine-tuning. 1. Decreased computational and storage costs: PEFT involves fine-tuning only a small number of extra model parameters while freezing most parameters of the pre- trained LLMs, thereby reducing computational and storage costs significantly. 2. Overcoming catastrophic forgetting: During full fine-tuning of LLMs, catastrophic forgetting can occur where the model forgets the knowledge it learned during pretraining. PEFT stands to overcome this issue by only updating a few parameters. 3. Better performance in low-data regimes: PEFT approaches have been shown to perform better than full fine-tuning in low-data regimes and generalize better to out- of-domain scenarios. 4. Portability: PEFT methods enable users to obtain tiny checkpoints worth a few MBs compared to the large checkpoints of full fine-tuning. This makes the trained weights from PEFT approaches easy to deploy and use for multiple tasks without replacing the entire model. 5. Performance comparable to full fine-tuning: PEFT enables achieving comparable performance to full fine-tuning with only small number of trainable parameters. PEFT: A better alternative to standard fine-tuning A standard fine-tuning process involves adjusting the hidden representations (h) extracted by transformer models to enhance their performance in downstream tasks. These hidden representations refer to any features the transformer architecture extracts, such as the output of a transformer layer or a self-attention layer. Before Fine-Tuning This is a total waste of money Embedding Layer Transformer Layer 1 Transformer Layer 2 Transformer Layer N [CLS] h LeewayHertz
  • 6. 6/19 To illustrate, suppose we have an input sentence, “This is a total waste of money.” Before fine-tuning, the transformer model computes the hidden representations (h) of each token in the sentence. After fine-tuning, the model’s parameters are updated, and the updated parameters will generate a different set of hidden representations, denoted by h’. Thus, the hidden representations generated by the pre-trained and fine-tuned models will differ even for the same sentence. After Fine-Tuning This is a total waste of money Embedding Layer Transformer Layer 1 Transformer Layer 2 Transformer Layer N [CLS] h’ Classifier Head LeewayHertz In essence, fine-tuning is a process that modifies the pre-trained language model’s hidden representations to make them more suitable for downstream tasks. However, fine- tuning all the parameters in the model is not necessary to achieve this goal. Only fine- tuning a small fraction of the parameters is often sufficient to change the hidden representations from h to h’. Parameter-efficient fine-tuning techniques Presently, only the following PEFT methods are employed. Nevertheless, ongoing research is underway to explore and develop new methods. Adapter Adapters are a special type of submodule that can be added to pre-trained language models to modify their hidden representation during fine-tuning. By inserting adapters after the multi-head attention and feed-forward layers in the transformer architecture, we can update only the parameters in the adapters during fine-tuning while keeping the rest of the model parameters frozen. Adopting adapters can be a straightforward process. All that is required is to add adapters into each transformer layer and place a classifier layer on top of the pre-trained model. By updating the parameters of the adapters and the classifier head, we can improve the performance of the pre-trained model on a particular task without updating the entire model. This approach can save time and computational resources while still producing impressive results.
  • 7. 7/19 How does fine-tuning using an adapter work? The adapter module comprises two feed-forward projection layers connected with a non- linear activation layer. There is also a skip connection that bypasses the feed-forward layers. If we take the adapter placed right after the multi-head attention layer, then the input to the adapter layer is the hidden representation h calculated by the multi-head attention layer. Here, h takes two different paths in the adapter layer; one is the skip-connection, which leaves the input unchanged, and the other way involves the feed-forward layers. Layer Norm Layer Norm Adapter Adapter Feed-Forward Multi-Headed Attention + + Adapters are Updated h h’= h + Adapter h h Feed-Forward Down-Project h Feed-Forward Up-Project Nonlinearity + Hidden Representation Skip Connection LeewayHertz Initially, the first feed-forward layer projects h into a low-dimension space. This space has a dimension less than h. Following this, the input is passed through a non-linear activation function, and the second feed-forward layer then projects it back up to the dimensionality of h. The results obtained from the two ways are summed together to obtain the final output of the adapter module. The skip-connection preserves the original input h of the adapter, while the feed-forward path generates an incremental change, represented as Δh, based on the original h. By adding the incremental change Δh, obtained from the feed-forward layer with the original h from the previous layer, the adapter modifies the hidden representation calculated by the pre-trained model. This allows the adapter to alter the hidden representation of the pre-trained model, thereby changing its output for a specific task. LoRA
  • 8. 8/19 Low-Rank Adaptation (LoRA) of large language models is another approach in the area of fine-tuning models for specific tasks or domains. Similar to the adapters, LoRA is also a small trainable submodule that can be inserted into the transformer architecture. It involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the transformer architecture, greatly diminishing the number of trainable parameters for downstream tasks. This method can minimize the number of trainable parameters by up to 10,000 times and the GPU memory necessity by 3 times while still performing on par or better than fine-tuning model quality on various tasks. LoRA also allows for more efficient task-switching, lowering the hardware barrier to entry, and has no additional inference latency compared to other methods. How does it work? LoRA is inserted in parallel to the modules in the pre-trained transformer model, specifically in parallel to the feed-forward layers. A feed-forward layer has two projection layers and a non-linear layer in between them, where the input vector is projected into an output vector with a different dimensionality using an affine transformation. The LoRA layers are inserted next to each of the two feed-forward layers. Feed-Forward Down-Project Nonlinearity Feed-Forward Up-Project + + LoRA layers LeewayHertz
  • 9. 9/19 Now, let us consider the feed-forward up-project layer and the LoRA next to it. The original parameters of the feed-forward layer take the output from the previous layer with the dimension d and projects it into d . Here, FFW is the abbreviation for feed- forward. The LoRA module placed next to it consists of two feed-forward layers. The LoRA’s first feed-forward layer takes the same input as the feed-forward up-project layer and projects it into an r-dimensional vector, which is far less than the d . Then, the second feed-forward layer projects the vector into another vector with a dimensionality of d . Finally, the two vectors are added together to form the final representation. h h’= h + + h dFFW r r dmodel h Feed-Forward Up-Project dFFW dmodel LeewayHertz As we have discussed earlier, fine-tuning is changing the hidden representation h calculated by the original transformer model. Hence, in this case, the hidden representation calculated by the feed-forward up-project layer of the original transformer is h. Meanwhile, the vector calculated by LoRA is the incremental change Δh that is used to modify the original h. Thus, the sum of the original representation and the incremental change is the updated hidden representation h’. By inserting LoRA modules next to the feed-forward layers and a classifier head on top of the pre-trained model, task-specific parameters for each task are kept to a minimum. Prefix tuning Prefix-tuning is a lightweight alternative to fine-tuning large pre-trained language models for natural language generation tasks. Fine-tuning requires updating and storing all the model parameters for each task, which can be very expensive given the large size of current models. Prefix-tuning keeps the language model parameters frozen and optimizes model FFW model FFW
  • 10. 10/19 a small continuous task-specific vector called the prefix. In prefix-tuning, the prefix is a set of free parameters that are trained along with the language model. The goal of prefix- tuning is to find a context that steers the language model toward generating text that solves a particular task. This is a total waste of money Embedding Layer Transformer Layer 1 Transformer Layer 2 Transformer Layer N [BOS] LeewayHertz Prefix Prefix Prefix The prefix can be seen as a sequence of “virtual tokens” that subsequent tokens can attend to. By learning only 0.1% of the parameters, prefix-tuning obtains comparable performance to fine-tuning in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training. Similar to all previously mentioned PEFT techniques, the end goal of prefix tuning is to reach h’. Prefix tuning uses prefixes to modify the hidden representations extracted by the original pre-trained language models. When the incremental change Δh is added to the original hidden representation h, we get the modified representation, i.e., h’. When using prefix tuning, only the prefixes are updated, while the rest of the layers are fixed and not updated. Prompt tuning Prompt tuning is another PEFT technique for adapting pre-trained language models to specific downstream tasks. Unlike the traditional “model tuning” approach, where all the pre-trained model parameters are tuned for each task, prompt tuning involves learning soft prompts through backpropagation that can be fine-tuned for specific tasks by incorporating labeled examples. Prompt tuning outperforms the few-shot learning of GPT- 3 and becomes more competitive as the model size increases. It also benefits domain transfer’s robustness and enables efficient prompt ensembling. It requires storing a small task-specific prompt for each task, making it easier to reuse a single frozen model for multiple downstream tasks, unlike model tuning, which requires making a task-specific copy of the entire pre-trained model for each task.
  • 11. 11/19 How does it work? Prompt tuning is a simpler variant of prefix tuning. In it, some vectors are prepended at the beginning of a sequence at the input layer. When presented with an input sentence, the embedding layer converts each token into its corresponding word embedding, and the prefix embeddings are prepended to the sequence of token embeddings. Next, the pre- trained transformer layers will process the embedding sequence like a transformer model does to a normal sequence. Only the prefix embeddings are adjusted during the fine- tuning process, while the rest of the transformer model is kept frozen and unchanged. LeewayHertz Prefix Embedding Embedding Layer Transformer Layer 1 Transformer Layer 2 Transformer Layer N [BOS] Input Sequence This technique has several advantages over traditional fine-tuning methods, including improved efficiency and reduced computational overhead. Additionally, the fact that only the prefix embeddings are fine-tuned means that there is a lower risk of overfitting to the training data, thereby producing more robust and generalizable models. P-tuning P-tuning can improve the performance of language models such as GPTs in Natural Language Understanding (NLU) tasks. Traditional fine-tuning techniques have not been effective for GPTs, but P-tuning uses trainable continuous prompt embeddings to improve their performance. This method has been tested on two NLU benchmarks, LAMA and SuperGLUE, and has shown significant improvements in precision and world knowledge recovery. P-tuning also reduces the need for prompt engineering and outperforms state- of-the-art approaches on the few-shot SuperGLUE benchmark. P-tuning can be used to improve pre-trained language models for various tasks, including sentence classification and predicting a country’s capital. The technique involves modifying the input embeddings of the pre-trained language model with differential output embeddings generated using a prompt. The continuous prompts can be optimized using a downstream loss function and a prompt encoder, which helps solve discreteness and association challenges. IA3
  • 12. 12/19 IA3, short for Infused Adapter by Inhibiting and Amplifying Inner Activations, is another parameter-efficient fine-tuning technique designed to improve upon the LoRA technique. It focuses on making the fine-tuning process more efficient by reducing the number of trainable parameters in a model. Both LoRA and IA3 share some similarities in their core objective of improving fine-tuning efficiency. They achieve this by introducing learned components, reducing the number of trainable parameters, and keeping the original pre-trained weights frozen. These shared characteristics make both techniques valuable tools for adapting large pre-trained models to specific tasks while minimizing computational demands. Additionally, both LoRA and IA3 prioritize maintaining model performance, ensuring that fine-tuned models remain competitive with fully fine-tuned ones. Furthermore, their capacity to merge adapter weights without adding inference latency contributes to their versatility and practicality for real-time applications and various downstream tasks. How does IA3 work? IA3 optimizes the fine-tuning process by rescaling the inner activations of a pre-trained model using learned vectors. These learned vectors are incorporated into the attention and feedforward modules within a standard transformer-based architecture. The key innovation of IA3 is that it freezes the original pre-trained weights of the model, making only the introduced learned vectors trainable during fine-tuning. This drastic reduction in the number of trainable parameters significantly improves the efficiency of fine-tuning without compromising model performance. IA3 is compatible with various downstream tasks, maintains inference speed, and can be applied to specific layers of a neural network, making it a valuable tool for efficient model adaptation and deployment. Contact LeewayHertz for AI consultancy and development Optimize AI model performance on any task with PEFT, without the need for extensive retraining or large-scale parameter updates Learn More Training your model using PEFT In our example, we will use LoRA to fine-tune a pre-trained sequence-to-sequence language model to generate text for a specific task, in this case, for Twitter complaints. Import the dependencies and define the variables First, let us import all the necessary libraries, modules and other dependencies, like AutoModelForSeq2SeqLM, PeftModel, torch, the datasets and AutoTokenizer, among others. The line of codes would be something like this: from transformers import AutoModelForSeq2SeqLM
  • 13. 13/19 from peft import PeftModel, PeftConfig import torch from datasets import load_dataset import os from transformers import AutoTokenizer from torch.utils.data import DataLoader from transformers import default_data_collator, get_linear_schedule_with_warmup from tqdm import tqdm from datasets import load_dataset Next, we need to define the name of the dataset, the text column name, the label column name, and the batch size for training the model. dataset_name = "twitter_complaints" text_column = "Tweet text" label_column = "text_label" batch_size = 8 Now, run the following commands to define the pre-trained PEFT model and load its configuration. peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM" config = PeftConfig.from_pretrained(peft_model_id) In the above set of codes, the ‘peft_model_id’ variable contains the ID of the pre-trained model and the ‘config’ variable is set to the model’s configuration. Now, set the maximum memory allowed for each device; say, GPU is allowed to use up to 6GB of memory, and the CPU can use up to 30GB of memory. max_memory = {0: "6GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu": "30GB"} Load the base model of the pre-trained PEFT model specified by peft_model_id. model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory)
  • 14. 14/19 In the above command, the ‘AutoModelForSeq2SeqLM’ class is used to load the base model and the ‘from_pretrained’ function is used to load the weights of the pre-trained model. The ‘device_map’ argument specifies the mapping between devices and model components, and the ‘max_memory’ argument specifies the maximum memory allowed for each device. Next, load the full PEFT model specified by ‘peft_model_id’ using the following command: model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory) Preprocess the data Map the dataset labels to human-readable class names: The first step in preprocessing the data is to map the dataset labels to human-readable class names. For this, you need to replace all the underscores with spaces in the label names of the training set. classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names] print(classes) Then, run the following codes to map the labels into human-readable class names. dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["Label"]]}, batched=True, num_proc=1, ) print(dataset) dataset["train"][0] Tokenization: First, we need to load a pre-trained tokenizer from the transformers library for tokenization. We also need to set the maximum length of the target labels by tokenizing each class label and taking the length of the resulting list of token IDs. This can be used later to pad all labels to a consistent length. For this, run the following: tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])
  • 15. 15/19 Run the following codes to extract the text and target labels from the input examples, tokenize the text using the pre-trained tokenizer, and pad the labels to a consistent length. def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, truncation=True) labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs Specify the steps needed to preprocess the dataset and prepare it for fine-tuning the model. processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, desc="Running tokenizer on dataset", ) Now, split the preprocessed dataset into separate training, evaluation, and test sets. train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["eval"]
  • 16. 16/19 test_dataset = processed_datasets["test"] Define a collate function: Next, we need to define a collate function to gather and combine the preprocessed examples into batches. def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt") Next, define the data loaders for the training, evaluation, and test datasets. train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) test_dataloader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) Model training and evaluation To train the model using the preprocessed dataset, first, define the specifications, like the number of epochs and loss function. Once trained, evaluate the model on its intended purpose. model.eval() i = 15 inputs = tokenizer(f'{text_column} : {dataset["test"][i]["Tweet text"]} Label : ', return_tensors="pt") print(inputs) with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
  • 17. 17/19 Assessing the performance of a fine-tuned machine learning model is an essential step. One common way to evaluate a model’s performance is by checking its accuracy on an evaluation dataset. You can refer to this GitHub repository to view the entire evaluation process, including the code for calculating these metrics. Few-shot In-context Learning (ICL) vs. Parameter-efficient Fine- tuning (PEFT) Few-shot in-context learning and parameter-efficient fine-tuning are techniques or approaches used to train natural language, processing models. Although both these approaches enable pre-trained language models to perform new tasks without extensive training, the methods adopted in both approaches are technically different. The first approach, ICL, allows the model to perform a new task by inputting prompted examples without requiring gradient-based training. However, ICL incurs significant computational, memory, and storage costs. The second approach, PEFT, involves training a small number of added or selected parameters to enable a model to perform a new task with minimal updates. ICL is an approach that aims to improve the few-shot learning performance of pre-trained language models by incorporating contextual information during fine-tuning. This approach involves fine-tuning a pre-trained language model on a few-shot task with additional contextual information provided as input. This contextual information could be in the form of additional sentences or paragraphs that provide more information about the task at hand. ICL aims to use this contextual information to enhance the model’s ability to generalize to new tasks, even with limited training examples. On the other hand, parameter-efficient fine-tuning is an approach that aims to improve the efficiency of fine-tuning pre-trained language models on downstream tasks by identifying and freezing important model parameters. This approach involves fine-tuning the pre- trained model on a small amount of data while also freezing some of the model’s parameters to prevent overfitting. By selectively freezing certain parameters, the model can retain more of its pre-trained knowledge, improving its performance on downstream tasks with limited training data. Is PEFT more efficient than ICL? Parametric Few-shot Learning (PFSL) is an important task for natural language processing applications, where models must quickly adapt to new tasks with limited training examples. In recent years, various approaches have been put forward to tackle this challenge, with ICL being one of the most popular techniques. However, a research paper published in 2021 introduces a new approach called parametric efficient few-shot learning, which outperforms ICL in terms of accuracy while requiring significantly fewer computational resources.
  • 18. 18/19 One of the main reasons PEFT outperforms ICL is its use of a novel scaling method called (IA)^3, which rescales inner activations with learned vectors. This technique performs better than fine-tuning the full model while introducing only a few additional parameters. In contrast, ICL fine-tunes the entire model on a small amount of data, which can lead to overfitting and a drop in accuracy. Another reason why PEFT is better than ICL is due to its use of two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. These loss terms help the model to better generalize to new tasks and avoid overfitting. In addition to its superior performance, parameter-efficient fine-tuning is also more computationally efficient than ICL. The research paper found that PEFT uses over 1,000x fewer floating-point operations (FLOPs) during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. This makes PEFT a more practical and scalable solution for real-world NLP applications. Overall, the introduction of PEFT represents a significant advancement in the field of few- shot learning for NLP applications. Its use of (IA)^3 scaling, additional loss terms, and superior computational efficiency make it a better alternative to ICL for tasks that require rapid adaptation to new few-shot learning scenarios. The process of parameter-efficient fine-tuning The steps involved in parameter-efficient fine-tuning can vary depending on the specific implementation and the pre-trained model being used. However, here is a general outline of the steps involved in PEFT: Pre-training: Initially, a large-scale model is pre-trained on a large dataset using a general task such as image classification or language modeling. This pre-training phase helps the model learn meaningful representations and features from the data. Task-specific dataset: Gather or create a dataset that is specific to the target task you want to fine-tune the pre-trained model for. This dataset should be labeled and representative of the target task. Parameter identification: Identify or estimate the importance or relevance of parameters in the pre-trained model for the target task. This step helps in determining which parameters should be prioritized during fine-tuning. Various techniques, such as importance estimation, sensitivity analysis, or gradient-based methods, can be used to identify important parameters. Subset selection: Select a subset of the pre-trained model’s parameters based on their importance or relevance to the target task. The subset can be determined by setting certain criteria, such as a threshold on the importance scores or selecting the top-k most important parameters.
  • 19. 19/19 Fine-tuning: Initialize the selected subset of parameters with the values from the pre- trained model and freeze the remaining parameters. Fine-tune the selected parameters using the task-specific dataset. This involves training the model on the target task data, typically using techniques like Stochastic Gradient Descent (SGD) or Adam optimization. Evaluation: Evaluate the performance of the fine-tuned model on a validation set or through other evaluation metrics relevant to the target task. This step helps assess the effectiveness of PEFT in achieving the desired performance while using fewer parameters. Iterative refinement (optional): Depending on the performance and requirements, you may choose to iterate and refine the PEFT process by adjusting the criteria for parameter selection, exploring different subsets, or fine-tuning for additional epochs to optimize the model’s performance further. However, it’s important to note that the specific implementation details and techniques used in PEFT can vary across research papers as well as applications. Endnote PEFT, or Parameter-efficient Fine-tuning, is a natural language processing technique used to improve the performance of pre-trained language models on specific downstream tasks. It involves freezing some of the layers of the pre-trained model and only fine-tuning the last few layers that are specific to the downstream task. This technique is more beneficial than traditional fine-tuning in several ways, such as decreased computational and storage costs, overcoming catastrophic forgetting, and comparable performance to full fine-tuning with a small number of trainable parameters. Overall, PEFT is a promising approach to improving the efficiency and effectiveness of NLP models in various applications. Start a conversation by filling the form