A step-by-step guide for LLM fine-tuning using PEFT and bitsandbytes

Krishna Chaitanya Kosaraju, Ph. D.

Mathematician and Machine Learning Expert | AI/ML Engineer | Data Scientist | NLP | LLM

Published Aug 13, 2023

The article details the process of fine-tuning a Large Language Model (LLM) called Falcon 7b to transition data from an unstructured to a structured format. Using a synthetic dataset previously generated with LangChain, in our previous article titled, "Using LangChain to Generate Cost-Effective Datasets for Fine-Tuning Large Language Models." The article is structured into multiple steps:

Loading Data: The dataset, containing both unstructured and structured text, is loaded and preprocessed.
Setup: Necessary modules are installed, including bitsandbytes, transformers, and accelerate. Afterward, packages are imported, and the model's setup is prepared on Hugging Face's platform.
Model Configuration: QLORA config: Introduces QLORA, which simplifies weight storage by categorizing neural network weights into 'bins' for efficient storage and back-propagation. LORA config: Details the use of LORA, specifying parameters like the adaptation matrices' placement, rank of the adaptation matrix, a scaling parameter, and task designation. Generation Config: Provides guidance for the model during text sequence generation, focusing on controlling randomness and sequence limits.
Data Preprocessing: A template is created using unstructured text (user_input) and structured text (user_plan).
Training: Training parameters are set up, followed by model training using the Transformers library.
Saving and Loading: The trained model can be saved locally or pushed to the Hugging Face Hub. Once loaded, the generation configuration is established and a prompt template is set.
Inference: Demonstrates querying the fine-tuned model using torch's inference mode to generate structured text from unstructured input.

In essence, the article provides a comprehensive guide to transforming unstructured text into structured data using Falcon 7b, complete with detailed steps and code snippets. The full code is available in colab.

Step - 0: Loading Data: We will first load the dataset.csv file and preprocess it.

df = pd.read_csv('dataset.csv', encoding='utf-8', usecols=['structured', 'user_input'])
df = df.drop(df.index[0])
df = df.dropna(subset=['summary'])
df.to_csv('dataset_clean.csv')

Here's a glimpse of what the sample data from this dataset looks like:

Unstructured: So, I'm looking for a winter backpacking adventure in Tokyo, and I'm not afraid to go alone! I'm hoping to find a hostel to stay in, and I'm going to book it through an online platform. I'm looking for a private tour guide who speaks English, and I'm sure I'll have a great time! After all, what could possibly go wrong with a budget of $5000?

Structured: Travelers: 1 persons, Accommodation: hostel, Booking Mode: online platform, Travel Type: backpacking, Duration: 2 weeks, Season: winter, Language Preference: English, Guide Preference: private tour, Destinations: Tokyo, Budget: $5000.

Unstructured: Ah, what a grand adventure I have planned! Seven of us are setting off to explore the wonders of Paris, Tokyo, and beyond in the glorious springtime! We shall be traveling for 23 days and have a budget of $15,000, so there will be plenty of room for sightseeing, folk performances, and other exciting activities. We shall be sure to take in the sights, sounds, and cultures of these two amazing cities! Who knows what wonders await us?

Structured: Cultural Interest: folk performances, Budget: $15000, Destinations: Paris, Tokyo, Duration: 23 days, Season: spring, Activities: sightseeing, Travelers: 7 persons.

Step 1

First, let's install the required modules: bitsandbytes, transformers, and accelerate. Please note that you'll need a GPU with at least 16GB of memory for this to function correctly.

The full code is available in colab.

!pip install -Uqqq pip 
!pip install -qqq bitsandbytes==0.39.0 
!pip install -qqq torch==2.0.1 
!pip install -qqq -U git+https://github.com/huggingface/transformers.git@e03a9cc 
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f 
!pip install -qqq -U git+https://github.com/huggingface/accelerate.git@c9fbb71 
!pip install -qqq datasets==2.12.0 
!pip install -qqq loralib==0.1.1 
!pip install -qqq einops==0.6.1

Let us import the necessary packages

import json
import os
import pandas
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from huggingface_hub import notebook_login
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)


os.environ["CUDA_VISIBLE_DEVICES"] = "0"

We will login to Hugging face to save the fine tuned model. Note that it will only save the config and adaptation matrices.

notebook_login()

Step - 2

We'll be working with the falcon-7b-instruct model. We will define three config file, one for QLORA, another LORA and finally a generation config model text generation.

QLORA config: Neural network weights commonly adhere to a Normal distribution. QLORA simplifies weight storage. Instead of using precise float point 32 or 16 formats, weights are categorized into "bins" by their values. These bins can then be represented in more straightforward formats like int/float4 or int/float8 - a technique known as Quantization, often dubbed normal float 4 or 8. An added layer, double quantization, further refines these bins to conserve memory. Although the weights get quantized, during back-propagation, we revert them to float 16 to update. Post-update, these specific values are discarded. Dive deeper into quantization with this Hugging Face guide.

With the assistance of bitsandbytes, we'll load our model in 4-bit format using normal float 4 and employ double quantization.

model = "tiiuae/falcon-7b-instruct


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token"

The tokenizer's padding token is set to its end-of-sequence (EOS) token. Gradient check-pointing is a technique that trades off computation time for memory. Instead of storing all intermediate activations in memory (as is typically done in the forward pass of back-propagation), gradient check-pointing stores only a subset of them. Then, during the backward pass, it recomputes the required activations on-the-fly. This way, the memory consumption is significantly reduced at the expense of additional computation.

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

This step preprocesses the model to ready it for training.

LORA config: Next we will prepare the config file for LORA. Moving forward, we will set up the configuration for LORA. For an in-depth understanding, refer to the relevant paper and the blog by Hugging face. When using LORA, it's essential to pinpoint where the adaptation matrices, W, should be placed. In our scenario, they are aligned with the query, key, and value functions. We must designate the rank parameter, r, which determines the rank of the adaptation matrix. For our purposes, we'll set r to 16. Additionally, there's a requirement to set a scaling parameter known as alpha, which modulates the scaling of learned weights. Higher alpha values yield fewer weight updates; specifically, weights are scaled by r/alpha. We've opted for an alpha value of 32. We also set the dropout rate at 5% and designate the task as Causal Language Model.

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)


model = get_peft_model(model, config)
print_trainable_parameters(model)

Generation Config: Next, we will set up the generation configuration for the LLM. This configuration guides the model when generating new text sequences. We've limited the maximum number of tokens to generate to 400 and adjusted the temperature to 0.5, making the generation more deterministic since values closer to 0 reduce randomness. Utilizing "nucleus sampling", the model won't sample from the entire token distribution. Instead, it will sample from a narrowed set of tokens whose combined probability surpasses a threshold, set here at 0.7. This approach adds a controlled element of randomness, effectively filtering out extremely improbable tokens. Moreover, we've stipulated that the model should produce only one output sequence by setting num_return_sequences to 1. Lastly, both the padding token ID and the end-of-sequence (EOS) token ID are set to the tokenizer's EOS token ID, ensuring that any sequence padding uses the EOS token and signifying sequence completion once the EOS token is generated.

generation_config = model.generation_config
generation_config.max_new_tokens = 400
generation_config.temperature = 0.5
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

Step - 3: Preprocessing and loading the dataset.

df = pd.read_csv('dataset.csv', encoding='utf-8', usecols=['structured', 'unstructured'])
df = df.drop(df.index[0])
df = df.dropna(subset=['summary'])
df.to_csv('dataset_clean.csv')
data = load_dataset("csv", data_files="dataset_clean.csv")

We're next focusing on the creation of a textual template, which makes use of unstructured text labeled as 'user_input' and structured text labeled as 'user_plan'.

def generate_prompt(data_point):
  return f"""
<user_input>: {data_point["unstructured"]}
<user_plan>: {data_point["structured"]}
""".strip()


def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
  return tokenized_full_prompt

data = data["train"].shuffle().map(generate_and_tokenize_prompt)

Step 4: Initiating the Training Process

First and foremost, we need to configure the training parameters. This includes settings related to batch size, learning rate, optimization technique, learning schedule, among others. Here's how it's done:

training_args = transformers.TrainingArguments(
      per_device_train_batch_size=1,
      gradient_accumulation_steps=4,
      num_train_epochs=1,
      learning_rate=2e-4,
      fp16=True,
      save_total_limit=3,
      logging_steps=1,
      output_dir="experiments",
      optim="paged_adamw_8bit",
      lr_scheduler_type="cosine",
      warmup_ratio=0.05,
)

With everything set up, we can dive into the actual training. The Transformers library simplifies this process for us, abstracting away most of the complexity.

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

No alt text provided for this image — Training loss

Step 5: Model Saving, Loading, and Inference

Firstly, to save your model locally, use:

model.save_pretrained("trained-model")

If you wish to upload the model to the Hugging Face Hub, utilize:

PEFT_MODEL = "asokraju/finetuned-falcon-7b"


model.push_to_hub(
    PEFT_MODEL, use_auth_token=True
)

If you wish to upload the model to the Hugging Face Hub, utilize:

PEFT_MODEL = "asokraju/finetuned-falcon-7b


config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)


tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token


model = PeftModel.from_pretrained(model, PEFT_MODEL)"

After loading the model, set up the generation configuration and establish a prompt template for user input:

generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id
prompt = f"""
<summary>: {data["user_input"][i]}
<requirement>:
""".strip()

Finally, to query the fine-tuned LLM and produce a response, leverage the torch inference mode:

i = np.random.randint(5000


device = "cuda:0"
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )


print(tokenizer.decode(outputs[0], skip_special_tokens=True)))

Conclusion

In today's data-driven landscape, the ability to harness and transform unstructured data into structured insights is paramount. Through the process outlined in the article, we can achieve this transformation with the Falcon 7b Large Language Model. By systematically breaking down the steps, from data loading to inference, the tutorial offers a pragmatic approach to navigate the complexities of fine-tuning and deploying LLMs. Leveraging tools like Hugging Face and the power of the Transformers library, we can push the boundaries of what's achievable in the realm of natural language processing. Whether you're aiming to draw out structured insights from unstructured text or further your understanding of LLMs, the methods elucidated here serve as a valuable blueprint. As we continue to innovate, the fusion of techniques and technologies highlighted in this article promises to play an integral role in shaping the future of data processing and analysis.

A step-by-step guide for LLM fine-tuning using PEFT and bitsandbytes

Krishna Chaitanya Kosaraju, Ph. D.

Mathematician and Machine Learning Expert | AI/ML Engineer | Data Scientist | NLP | LLM

Step - 0: Loading Data: We will first load the dataset.csv file and preprocess it.

Step 1

Step - 2

Step - 3: Preprocessing and loading the dataset.

Step 4: Initiating the Training Process

Step 5: Model Saving, Loading, and Inference

Conclusion

More articles by this author

Others also viewed

🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

🐍 Mamba > Transformers?

Evaluating The AI Scientist

🔮 Moving beyond RAG

🥇Top ML Papers of the Week

A Journey from AI to LLMs and MCP — 2 — How LLMs Work — Embeddings, Vectors, and Context Windows

Vector Databases Explained: The Future of Intelligent Data Retrieval

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

Day 8/50: Building a Small Language Model from Scratch: What are Rotary Positional Embeddings (RoPE)

Explore topics

Step - 0: Loading Data: We will first load the dataset.csv file and preprocess it.

Step 1

Step - 2

Step - 3: Preprocessing and loading the dataset.

Step 4: Initiating the Training Process

Step 5: Model Saving, Loading, and Inference

Conclusion

Using LangChain to Create Cost-Effective Datasets for Fine-Tuning Large Language Models

Aug 9, 2023

Decoding the Transformers: A Dive into GPT with TensorFlow

Jun 20, 2023

Others also viewed

🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

🐍 Mamba > Transformers?

Evaluating The AI Scientist

🔮 Moving beyond RAG

🥇Top ML Papers of the Week

A Journey from AI to LLMs and MCP — 2 — How LLMs Work — Embeddings, Vectors, and Context Windows

Vector Databases Explained: The Future of Intelligent Data Retrieval

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

Day 8/50: Building a Small Language Model from Scratch: What are Rotary Positional Embeddings (RoPE)

Explore topics