Supervised Fine-tuning of Small Language Models using Agentic DUKE-based Model Distillation and Rank-Stabilized LoRA
Fine-tune Meta’s lightweight Llama small language models using Agent-based Document Understanding and Knowledge Extraction (DUKE) with Rank-Stabilized LoRA (rsLoRA) for model distillation.
Imagine a major industry trade show, featuring hundreds of vendors and drawing tens of thousands of attendees. Your mission is to harness the power of artificial intelligence to enhance the guest experience, enabling attendees to ask questions and instantly receive valuable insights about the latest products and services showcased at the event, all through a user-friendly mobile app.
The core requirements are clear: deliver information rapidly and accurately, scale effortlessly to accommodate any volume of user requests, and do so in the most cost-effective way possible. To achieve these goals, you can leverage a range of advanced generative AI architectures, such as retrieval-augmented generation (RAG), continued or continual pre-training (CPT), fine-tuning, or a combination of these.
Introduction
In this post, we’ll explore a novel technique I've coined, DUKE (Document Understanding and Knowledge Extraction), used in combination with Rank-Stabilized LoRA (rsLoRA), a variation of Low-Rank Adaptation (LoRA) and a supervised fine-tuning method within the Parameter-Efficient Fine-Tuning (PEFT) framework. We’ll apply these approaches to fine-tune Meta Llama 3.2 3B Instruct, a lightweight, instruction-tuned three-billion-parameter generative model. We aim to adapt the Llama model to an entirely new domain, similar to the trade show example discussed earlier. AI Agents will be used to automate the processes of data collection, document understanding, and knowledge extraction.
TL;DR
If your interested in previewing the LoRA adapter created in this post for free, head over to Kaggle. A Kaggle Notebook is available that will demonstrate how to perform inference with the LoRA adapter. Kaggle gives users free access to GPU’s in their notebooks; up to 30 hours/week! 😍
DUKE
According to IBM, in real-world applications, a model’s accuracy and capacity alone are insufficient to ensure its usefulness; the model must also operate within constraints related to time, memory, budget, and computational resources. DUKE (Document Understanding and Knowledge Extraction) addresses this challenge through knowledge distillation or model distillation. This technique transfers expertise from a larger, more capable model to a smaller model better suited for real-time deployment.
A model’s accuracy and capacity are not enough to make the model useful; it must also fit within the available budget of time, memory, money, and computational resources. — IBM
DUKE leverages advanced, state-of-the-art generative AI models such as Anthropic Claude 3.7 Sonnet, OpenAI GPT-4.1, xAI Grok 3 Beta, Google Gemini 2.5 Pro, and Amazon Nova Premier. These large “teacher” models perform sophisticated document understanding and complex instruction following. Through this process, they distill and transfer knowledge of a new domain to a smaller “student” model by generating a fine-tuning dataset. In DUKE’s workflow, the teacher model extracts relevant knowledge from the new domain and generates an accurate, well-formatted fine-tuning dataset, as illustrated in Phase 2 of the diagram below. This approach enables the smaller model to achieve strong domain performance while remaining efficient and cost-effective for deployment.
Small Language Models (SLMs)
Meta describes its Llama 3.2 1B and 3B models as lightweight models, commonly known as Small Language Models (SLMs). These models have far fewer parameters than Large Language Models (LLMs), making them more efficient, faster to train, and better suited for constrained environments, such as mobile devices or edge computing. Although SLMs may not achieve the same level of performance as larger models on complex tasks, their smaller size allows for quick, cost-effective deployment where memory and computational power are at a premium. The GPU memory footprint of the Llama 3.2 1B and 3B models can be further reduced through 4- and 8-bit quantization, making them even more practical for real-time and embedded applications.
Meta’s lightweight models are ideal for on-device inference capabilities across mobile and edge devices, including wearables, embedded devices, and microcontrollers, using the PyTorch Edge ecosystem.
Open Source Code
The open-source code used for this post can be found on GitHub.
Domain Knowledge Collection
I selected NVIDIA’s Blackwell architecture, announced on March 18, 2024, at NVIDIA’s Graphics Technology Conference (GTC), as the target domain for fine-tuning Meta’s Llama 3.2 3B Instruct model. Since Llama 3.2’s pre-training data cuts off in December 2023, the base model lacks any inherent knowledge of NVIDIA’s new Blackwell architecture.
The training corpus includes NVIDIA’s Blackwell Architecture Technical Brief, Blackwell Datasheet (PDFs), and relevant technical blog posts. This material represents the specialized domain knowledge we want the model to acquire, imagine enabling conference attendees at GTC to ask detailed questions about Blackwell’s capabilities.
To efficiently gather training data, I created a Python-based web scraper using natural language code generation using Perplexity and Anthropic Claude 3.7 Sonnet. This approach, also referred to as vibe coding, leverages AI’s ability to translate high-level intent into functional code, bypassing manual scripting while maintaining precise control over data collection parameters.
The web scraper script, included in the GitHub repository, gathers the content of Blackwell-related blog posts and saves them as individual text files. Developing a fully functional script (Phase 1 in the diagram above) required several iterations and optimizations with Perplexity.
import requests
from bs4 import BeautifulSoup
import os
import re
# Headers to mimic a browser visit
BASE_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
}
def ensure_trailing_slash(url):
if not url.endswith("/"):
return url + "/"
return url
def extract_blog_text_and_title(session, url):
print(f" Attempting to fetch: {url}")
try:
response = session.get(url, timeout=15)
response.raise_for_status()
except requests.exceptions.HTTPError as e:
print(f" HTTP Error for {url}: {e.response.status_code} - {e.response.reason}")
# For debugging, you might want to see what the server returned for errors like 403 (Forbidden)
# if e.response.status_code == 403:
# print(f" Response text snippet for 403: {response.text[:500]}")
return (
"Error - Page Not Found"
if e.response.status_code == 404
else "Error - HTTP Request Failed"
), ""
except requests.exceptions.RequestException as e:
print(f" Request Exception for {url}: {e}")
return "Error - Request Failed", ""
soup = BeautifulSoup(response.text, "html.parser")
# Extract title
title = "Untitled Article"
# Try more specific H1 first
h1_title_element = soup.find("h1", class_=["display-3", "entry-title"])
if h1_title_element:
title = h1_title_element.get_text(strip=True)
else: # Fallback to HTML <title> tag
title_tag_element = soup.find("title")
if title_tag_element:
title = title_tag_element.get_text(strip=True)
print(f" Extracted title: '{title}'")
# Locate the main content area
content_area = None
selector_tried = ""
# Primary selector for NVIDIA blogs
primary_content_selector_tag = "div"
primary_content_selector_class = "entry-content"
content_area = soup.find(
primary_content_selector_tag, class_=primary_content_selector_class
)
selector_tried = f"'{primary_content_selector_tag}' with class '{primary_content_selector_class}'"
if content_area:
print(f" Found content area using primary selector: {selector_tried}")
else:
print(f" Primary content selector {selector_tried} NOT found.")
# Extract text from relevant tags within the content area
text_parts = []
content_tags_to_extract = [
"p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "pre", "code",
]
elements_found_in_content_area = content_area.find_all(content_tags_to_extract)
if not elements_found_in_content_area:
print(f" Content area found, BUT NO text elements (e.g., <p>, <h2>, <li>) found within it using tags: {content_tags_to_extract}.")
else:
print(f" Found {len(elements_found_in_content_area)} potential text elements within content area. Extracting text...")
for i, element in enumerate(elements_found_in_content_area):
element_text = ""
if element.name in ["pre", "code"]: # Preserve whitespace for code blocks
element_text = element.get_text(strip=False)
else: # For other text, strip extra whitespace and use space as separator
element_text = element.get_text(separator=" ", strip=True)
if element_text: # Only add if there's actual text extracted
text_parts.append(element_text)
# else:
# print(f" Element {i+1} ({element.name}) yielded no text.")
final_body_text = "\n\n".join(text_parts)
if not final_body_text.strip() and elements_found_in_content_area:
print(f" WARNING: Text elements were found in content area, but extracted body text is empty. This might indicate the elements found did not contain direct text or were filtered out.")
elif not final_body_text.strip():
print(f" Extracted body text is empty.")
else:
print(f" Successfully extracted body text (length: {len(final_body_text)} chars).")
return title, final_body_text
def sanitize_filename(name):
if not name:
return "untitled_article"
name = str(name)
name = re.sub(r"[^\w\s-]", "", name)
name = re.sub(r"[-\s]+", "-", name).strip("-_")
return name[:100] if name else "untitled_article"
if __name__ == "__main__":
blog_post_urls_input = [
"https://developer.nvidia.com/blog/nvidia-blackwell-delivers-massive-performance-leaps-in-mlperf-inference-v5-0/",
# Add any other URLs you are testing
]
blog_post_urls = [ensure_trailing_slash(url) for url in blog_post_urls_input]
output_dir = "nvidia_blog_posts_from_list"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"Created directory: {output_dir}")
if not blog_post_urls:
print("The list of blog post URLs is empty.")
else:
print(f"Processing {len(blog_post_urls)} blog posts...")
with requests.Session() as session:
session.headers.update(BASE_HEADERS)
for i, post_url in enumerate(blog_post_urls, start=1):
print(f"\n--- Processing article {i} of {len(blog_post_urls)} ---")
post_title, text_content = extract_blog_text_and_title(
session, post_url
)
if post_title.startswith("Error -"):
print(f" Skipping file creation for {post_url} due to error: {post_title}")
continue
if not text_content.strip() and (
not post_title or post_title == "Untitled Article"
):
print(f" No significant title or content extracted for {post_url}. Skipping file creation.")
continue
safe_title = sanitize_filename(post_title)
if not safe_title or safe_title == "untitled-article":
url_path_part = post_url.strip("/").split("/")[-1]
safe_title = (
sanitize_filename(url_path_part)
if url_path_part
else f"article_{i}"
)
filename = f"{safe_title}.txt"
filepath = os.path.join(output_dir, filename)
try:
with open(filepath, "w", encoding="utf-8") as f:
f.write(f"Title: {post_title}\n")
f.write(f"URL: {post_url}\n\n")
f.write(text_content)
print(f" Successfully saved: {filepath}")
except IOError as e:
print(f" Error writing file {filepath}: {e}")
except (
Exception
) as e: # Catch any other unexpected errors during file write
print(f" An unexpected error occurred while writing file for {post_url}: {e}")
print("\n--- Done extracting blog posts ---")
Document Understanding and Knowledge Extraction
Next, we use DUKE (Document Understanding and Knowledge Extraction) to create the fine-tuning dataset. DUKE starts by utilizing a larger state-of-the-art multimodal model for document understanding (Phase 2 in the diagram above). In this case, the model processes the content of Blackwell-related PDF files and blog posts, extracted as individual text files. With a carefully designed prompt, the model generated over 300 high-quality labeled training examples that merge technical specifications from NVIDIA’s documentation with insights from blog content. This method equips the smaller Llama model with extensive domain knowledge while ensuring efficient deployment.
## Objective ##
Based on the only the supplied documents, generate at least 200 instruction (user) and answer (assistant) pairs in the following JSONL (JSON Lines) format shown in the Examples section below. Follow all instructions in the Instructions section below.
## Examples ##
{"messages": [{"role": "user", "content": "What is the capital of New York State?"}, {"role": "assistant", "content": "The capital of of New York State is Albany."}]}
{"messages": [{"role": "user", "content": "How many hours are in a day?"}, {"role": "assistant", "content": "There are 24 hours in a day."}]}
{"messages": [{"role": "user", "content": "Provide an alphabetical list of planets in our solar system."}, {"role": "assistant", "content": "The planets, in alphabetical order, are Earth, Jupiter, Mars, Mercury, Neptune, Saturn, Uranus, and Venus."}]}
## Instructions ##
1. The instructions (user) and answers (assistant) must be accurate and based completely on only the supplied documents.
2. The user instructions can be formed as a statement (e.g. 'Provide an...', 'Explain the...'), or a question (e.g., 'What are...', 'When is...').
3. The instructions and answers should be written as complete, yet succinct sentences.
4. Answer should be presented directly and factually, without expressing feelings or opinions.
5. Do not refer to the documents or the document's text itself in the instructions or the answers.
6. Important, only return the JSONL output. Do not include additional explanation, preamble, or citations.
Each prompt (user)/response (assistant) pair (labeled training example) is a separate line in the JSON Lines (JSONL) file, which will become our fine-tuning dataset, also included in the GitHub repository.
{"messages": [{"role": "user", "content": "List the major architectural innovations introduced by the NVIDIA Blackwell architecture."}, {"role": "assistant", "content": "The major architectural innovations of the NVIDIA Blackwell architecture include the second-generation Transformer Engine, faster and wider NVLink interconnects, enhanced Confidential Computing technology, a new Decompression Engine, and a dedicated RAS Engine for reliability, availability, and serviceability."}]}
{"messages": [{"role": "user", "content": "What is the primary purpose of the NVIDIA Blackwell GPU architecture?"}, {"role": "assistant", "content": "The primary purpose of the NVIDIA Blackwell GPU architecture is to handle data center-scale generative AI workflows, enabling real-time performance and scalability for large language models with improved energy efficiency."}]}
{"messages": [{"role": "user", "content": "How many transistors are in the NVIDIA Blackwell GPU?"}, {"role": "assistant", "content": "The NVIDIA Blackwell GPU contains 208 billion transistors."}]}
For example, here is an excerpt from the NVIDIA Blackwell Architecture Technical Brief white paper regarding the new second-generation Transformer Engine:
Blackwell introduces the new second-generation Transformer Engine. The second-generation Transformer Engine uses custom Blackwell Tensor Core technology combined with TensorRT-LLM and Nemo Framework innovations to accelerate inference and training for LLMs and Mixture-of-Experts (MoE) models.
Here is an example of a prompt/response pair distilled by the LLM from the documents regarding NVIDIA’s second-generation Transformer Engine:
{
"messages": [
{
"role": "user",
"content": "What is the second-generation Transformer Engine in Blackwell used for?"
},
{
"role": "assistant",
"content": "The second-generation Transformer Engine in Blackwell accelerates inference and training for large language models and Mixture-of-Experts models by using custom Tensor Core technology and supporting advanced quantization formats."
}
]
}
Without fine-tuning, the Meta Llama 3.2 3B model lacks knowledge of NVIDIA’s second-generation Transformer Engine. The response to the prompt, “Describe NVIDIA’s second-generation Transformer Engine.”:
I couldn’t find any information on NVIDIA’s second-generation Transformer Engine. It’s possible that NVIDIA has not released a second-generation Transformer Engine, or it may be a product that is not well-known.
After fine-tuning, the model now possesses information about NVIDIA’s second-generation Transformer Engine and responds to prompts regarding it:
NVIDIA’s second-generation Transformer Engine combines custom Blackwell Tensor Core technology with NVIDIA TensorRT-LLM and NeMo Framework innovations to accelerate LLM inference and training.
The DUKE-generated fine-tuning dataset, used in this post, is available on Hugging Face.
Jupyter Notebook
The GitHub repository contains a comprehensive Jupyter notebook that handles three critical tasks:
The notebook supports both local and cloud execution. On local hardware, I successfully ran the workflow on a Windows 11 workstation equipped with an NVIDIA GeForce RTX 4080 SUPER (16 GB VRAM).
The same process worked seamlessly on the Cloud in Amazon SageMaker Studio — Amazon SageMaker AI using an Amazon EC2 G6 Instance featuring an NVIDIA L4 GPU (24 GB VRAM), demonstrating the solution’s flexibility across different hardware configurations.
Preparing the Labeled Data
Although we have already created the dataset of prompt/response pairs (labeled data) in JSONL format, we must prepare the data specifically for fine-tuning the Llama 3.2 model. The notebook provides methods to fit the format of the generated prompt/response pairs.
def formatting_prompts_func(examples):
messages = examples["messages"]
texts = [
tokenizer.apply_chat_template(
message, tokenize=False, add_generation_prompt=False
)
for message in messages
]
return {
"text": texts,
}
from datasets import load_dataset
dataset_raw = load_dataset(
"json", data_files="data/blackwell_architecture.jsonl", split="train"
)
dataset_llama_format = dataset_raw.map(
formatting_prompts_func,
batched=True,
)
Using the notebook, we start with the prompt/response pairs from the JSONL file:
{
"messages": [
{
"role": "user",
"content": "What is the maximum number of GPUs supported in a single Blackwell NVLink domain?"
},
{
"role": "assistant",
"content": "A single Blackwell NVLink domain supports up to 72 GPUs within the NVL72 configuration."
}
]
}
The lightweight Llama 3.2 models share many characteristics with the Llama 3.1 text-only models. Therefore, we can use the Llama-3.1 regular multi-turn conversation format for conversation style fine-tuning. Llama-3 renders multi-turn conversations as follows:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 14 May 2025
<|eot_id|><|start_header_id|>user<|end_header_id|>
What is the maximum number of GPUs supported in a single Blackwell NVLink domain?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
A single Blackwell NVLink domain supports up to 72 GPUs within the NVL72 configuration.<|eot_id|>
The final labeled data combines the newly formatted data with the original source data:
{
"messages": [
{
"role": "user",
"content": "What is the maximum number of GPUs supported in a single Blackwell NVLink domain?"
},
{
"role": "assistant",
"content": "A single Blackwell NVLink domain supports up to 72 GPUs within the NVL72 configuration."
}
],
"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the maximum number of GPUs supported in a single Blackwell NVLink domain?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nA single Blackwell NVLink domain supports up to 72 GPUs within the NVL72 configuration.<|eot_id|>"
}
PEFT and LoRA
According to Hugging Face, supervised fine-tuning (SFT) is the most common step in post-training foundation models and also one of the most effective. For this post, we will perform supervised fine-tuning using the Supervised Fine-tuning Trainer API, part of Hugging Face’s Transformer Reinforcement Learning (TRL) full-stack library.
Parameter-Efficient Fine-Tuning (PEFT) is a library for efficiently adapting large pre-trained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. PEFT methods only fine-tune a few (extra) model parameters, significantly decreasing computational and storage costs, while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store LLMs on consumer hardware.
We will use Low-Rank Adaptation (LoRA), specifically Rank-Stabilized LoRA (rsLoRA), to fine-tune the base Llama model. Adapter-based methods, like LoRA, add extra trainable parameters after a frozen pre-trained model’s attention and fully-connected layers to reduce memory usage and speed up training. According to Hugging Face, Low-Rank Adaptation (LoRA) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. LoRA drastically reduces the number of parameters that need to be fine-tuned. According to Hugging Face, LoRA is one of the most popular PEFT methods and a good starting point for PEFT. It was initially developed for LLMs but is also a tremendously popular training method for diffusion models because of its efficiency and effectiveness.
We will generate unmerged LoRA adapter files, which are LoRA adapters that have not been combined with the base model. The LoRA weights are separate from the original model’s weights, requiring them to be loaded and used separately during inference. They are much smaller than the base model’s files.
Fine-tuning
An epoch in machine learning represents one complete pass through the entire training dataset. I began fine-tuning with three epochs. Generally, going beyond three epochs isn’t ideal unless you are aiming for a model with reduced hallucinations and less creativity. Depending on your use case, dataset, and base model, it’s a good idea to experiment with different hyperparameters, including the number of epochs.
from huggingface_hub import snapshot_download
HF_TOKEN = "<your_hf_token_goes_here>"
# download the base model
base_model = "meta-llama/Llama-3.2-3B-Instruct"
snapshot_download(
repo_id=base_model, repo_type="model", token=HF_TOKEN
)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
base_model,
return_dict=True,
low_cpu_mem_usage=True,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
).to(device)
tokenizer = AutoTokenizer.from_pretrained(base_model)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
if model.config.pad_token_id is None:
model.config.pad_token_id = model.config.eos_token_id
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig
training_args = SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=3,
# max_steps=60,
learning_rate=2e-5,
bf16=True,
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
hub_model_id=hf_base_meta_lora,
label_names=["labels"],
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules="all-linear",
bias="none",
use_rslora=True,
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset_llama_format,
args=training_args,
peft_config=peft_config,
)
The total training time for three epochs, or 114 steps, took 65 seconds for 307 labeled prompt/response pairs using a NVIDIA GeForce RTX 4080 SUPER with 16 GB of GPU memory (VRAM). Of 3.2B parameters, 24.3K, or 0.75%, were fine-tunable using rsLoRA.
========== FINE-TUNING STATS ==========
GPU = NVIDIA GeForce RTX 4080 SUPER.
Max memory = 15.992 GB.
6.811 GB of memory reserved.
Trainable params: 24,313,856
All parameters: 3,237,063,680
Trainable %: 0.75
TrainOutput(
global_step=114,
training_loss=1.7906429725780821,
metrics={
'train_runtime': 65.4552,
'train_samples_per_second': 14.071,
'train_steps_per_second': 1.742,
'total_flos': 1273168315834368.0,
'train_loss': 1.7906429725780821
}
)
Training steps = 114
Training time (seconds): 65.4552
Training time (minutes): 1.09
Peak reserved memory = 8.004 GB
Peak reserved memory for training = 1.193 GB
Peak reserved memory % of max memory = 50.05 %
Peak reserved memory for training % of max memory = 7.46 %
The SFTTrainer Class has the train() method, which can output the training loss for each step:
The chart’s x-axis represents the training steps (1 to 114), while the y-axis shows the training loss at each step. The plot demonstrates a clear downward trend in training loss, indicating effective learning and convergence as training progresses. Some fluctuations are typical in model optimization.
TensorBoard
If you use Amazon SageMaker Studio to perform your fine-tuning, you will be able to view the fine-tuning results in TensorBoard. The data from each training run will be uploaded, along with the LoRA adapter files, to Hugging Face, where it can be easily viewed. TensorBoard provides the visualization and tooling needed for machine learning experimentation.
Testing the Fine-tuned Model
To evaluate the fine-tuned model, I designed five test prompts focused on the NVIDIA Blackwell architecture, incorporating details provided in the source documents and the fine-tuning dataset.
test_prompts = [
[{
"role": "user",
"content": "Describe the NVIDIA Blackwell architecture.",
}],
[{
"role": "user",
"content": "What challenges is the NVIDIA Blackwell architecture designed to address?",
}],
[{
"role": "user",
"content": "Please describe NVIDIA's second-generation Transformer Engine.",
}],
[{
"role": "user",
"content": "How many Blackwell GPUs does the NVIDIA GB200 NVL72 have?",
}],
[{
"role": "user",
"content": "How does the NVIDIA Blackwell architecture support generative AI?",
}],
]
We cycle through each of the five prompts, passing them to the model for real-time inference.
%%time
for i in range(0,5):
inputs = tokenizer.apply_chat_template(
test_prompts[i],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(device)
output = model.generate(
input_ids=inputs,
max_new_tokens=128,
use_cache=True,
temperature=0.1,
pad_token_id=tokenizer.pad_token_id,
).to(device)
output = tokenizer.decode(output[0], skip_special_tokens=True)
output = output.split('assistant\n\n')[1].strip()
print(f"Response {i+1}: {output}\n")
I compared the base model to the fine-tuned model (base model with an unmerged LoRA adapter). Without fine-tuning, the base model failed to respond accurately to any of the five test prompts. Its responses were also quite verbose, likely contributing to its slow total response time of 15.2 seconds for the five prompts.
Response 1: I couldn't find any information on the "NVIDIA Blackwell architecture." It's possible that it's a fictional or non-existent architecture, or it may be a very recent or obscure development that I'm not aware of.
However, I can suggest some possible alternatives:
1. NVIDIA Ampere architecture: This is a real architecture developed by NVIDIA, which was released in 2020. It's a high-performance computing architecture that features a new GPU design, improved power efficiency, and enhanced features such as ray tracing and AI acceleration.
2. NVIDIA Lovelace architecture: This is another real architecture developed by NVIDIA, which was released in 2022. It's a successor to the Ampere architecture and features improved performance, power efficiency, and new features such as enhanced ray tracing and AI acceleration.
3. NVIDIA Ada Lovelace architecture: This is a real architecture developed by NVIDIA, which was released in 2022. It's a high-performance computing architecture that features a new GPU design, improved power efficiency, and enhanced features such as ray tracing and AI acceleration.
If you could provide more context or information about the "NVIDIA Blackwell architecture," I may be able to help you better.
Response 2: I couldn't find any information on the NVIDIA Blackwell architecture. It's possible that it's a new or upcoming architecture, or it may not be a well-known or widely discussed topic.
However, I can suggest some possible sources where you may be able to find more information on the NVIDIA Blackwell architecture:
1. NVIDIA's official website or press releases: You can check NVIDIA's official website or press releases for any information on new or upcoming architectures.
2. Technical documentation: NVIDIA may have published technical documentation or whitepapers on the Blackwell architecture, which can be found on their website or through online archives.
3. Industry publications: Industry publications such as Tom's Hardware, AnandTech, or other tech news sites may have reported on the Blackwell architecture or have information on upcoming architectures.
4. Research papers: Researchers and academics may have published papers on the Blackwell architecture or related topics, which can be found through online academic databases such as arXiv or Google Scholar.
If you have any more information on the Blackwell architecture, such as its intended use or release date, I may be able to help you better.
Response 3: I am unable to verify information about NVIDIA's second-generation Transformer Engine.
Response 4: I couldn't find any information on the NVIDIA GB200 NVL72 GPU. It's possible that it's a fictional or non-existent GPU. If you could provide more context or clarify the name of the GPU, I'll do my best to provide the correct information.
Response 5: I couldn't find any information on the NVIDIA Blackwell architecture. It's possible that it's a new or upcoming architecture, or it may not exist at all.
However, I can tell you about some of the architectures that NVIDIA has developed to support generative AI:
1. Tensor Core: NVIDIA's Tensor Core is a specialized processor designed for deep learning and AI workloads. It's used in NVIDIA's GPUs, including the Tesla V100 and A100, to accelerate matrix operations and other compute-intensive tasks.
2. Deep Learning Super Sampling (DLSS): DLSS is a technology developed by NVIDIA to accelerate deep learning-based image processing. It uses a combination of tensor cores and other hardware accelerators to improve the performance of deep learning models.
3. NVIDIA Ampere Architecture: The Ampere architecture is a new generation of NVIDIA GPUs that includes a number of features to support generative AI, such as increased tensor core count, improved memory bandwidth, and enhanced support for deep learning frameworks like TensorFlow and PyTorch.
4. NVIDIA Hopper Architecture: The Hopper architecture is a new generation of NVIDIA GPUs that includes a number of features to support generative AI, such as increased tensor core count, improved memory bandwidth, and enhanced support
CPU times: total: 15 s
Wall time: 15.2 s
On the other hand, the fine-tuned model correctly responded to all five prompts. Based on the fine-tuning, it also responded in complete yet succinct sentences, as it was trained to do. The quantized model’s size plus the average length of the responses contributed to a significantly faster response time of 4.86 seconds, more than 3x faster than the base model.
Response 1: NVIDIA Blackwell is a next-generation datacenter GPU architecture designed for high-performance computing, AI, and HPC.
Response 2: The NVIDIA Blackwell architecture is designed to address the challenges of large language models, high-performance computing, and AI inference.
Response 3: NVIDIA's second-generation Transformer Engine is designed to accelerate large language models and other AI workloads.
Response 4: The NVIDIA GB200 NVL72 has 72 Blackwell GPUs.
Response 5: The NVIDIA Blackwell architecture supports generative AI through its advanced tensor cores, which accelerate complex neural network computations.
CPU times: total: 3.95 s
Wall time: 4.86 s
We use Perplexity once more to evaluate and verify the accuracy of the fine-tuned model’s responses: “Confirm the accuracy of the following statements…”.
The following analysis confirms the accuracy of each statement based on NVIDIA's official documentation and third-party technical assessments of the Blackwell architecture:
---
### **Response 1**:
**Accurate**. The NVIDIA Blackwell architecture is explicitly designed for high-performance computing (HPC), AI training, and inference. It powers AI factories and data centers, with advancements like 208 billion transistors and a 10 TB/s chip-to-chip interconnect to handle demanding workloads.
---
### **Response 2**:
**Accurate**. The architecture directly addresses challenges in trillion-parameter LLMs and HPC through innovations like the fifth-generation NVLink (scaling to 576 GPUs) and the second-generation Transformer Engine. These technologies enable real-time inference and training for multi-trillion-parameter models while reducing costs and energy consumption.
---
### **Response 3**:
**Accurate**. The second-generation Transformer Engine uses micro-tensor scaling and FP4 precision to accelerate LLMs and Mixture-of-Experts (MoE) models. Integrated with TensorRT-LLM and NeMo frameworks, it doubles inference performance and supports larger model sizes.
---
### **Response 4**:
**Accurate**. The GB200 NVL72 configuration combines 36 Grace CPUs and 72 Blackwell GPUs in a liquid-cooled rack-scale design. This creates a unified 72-GPU NVLink domain capable of 30x faster LLM inference than previous generations.
---
### **Response 5**:
**Accurate**. Blackwell’s Tensor Cores include FP4, FP6, FP8, and BF16 precisions, optimized for generative AI. These cores, combined with the Transformer Engine, accelerate neural network computations for LLMs, reducing inference costs by up to 25x compared to earlier architectures.
---
**Summary**: All statements are accurate, supported by NVIDIA’s technical specifications and performance benchmarks. Blackwell’s advancements in tensor cores, NVLink scaling, and Transformer Engine optimizations position it as a transformative architecture for AI and HPC workloads.
This doesn’t mean the fine-tuned model is flawless. I encountered instances where the responses weren’t entirely accurate. The reduced accuracy is a compromise when employing a lightweight model and the constraints of the fine-tuning dataset.
Conclusion
By integrating DUKE-based model distillation with LoRA fine-tuning, we establish a streamlined methodology for deploying specialized small language models such as Meta Llama 3.2 3B Instruct. This hybrid approach enables resource-efficient models to provide accurate, real-time, domain-specific insights while maintaining low computational overhead. The technique effectively addresses latency-sensitive, high-throughput deployment scenarios by leveraging large teacher models’ semantic understanding to curate targeted domain knowledge. The result is a reproducible framework for developing cost-optimized AI solutions capable of operating in constrained environments across diverse industrial applications.
This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, images, logos, and brands are the property of their respective owners.
Principal Solutions Architect @AWS | Data Analytics and Generative AI Specialist | Experienced Technology Leader, Consultant, CTO, COO, President | 14x AWS Certified / Gold Jacket
2moInterested in trying the LoRA adapter out for free? Notebook available on Kaggle: https://www.kaggle.com/code/garystafford/fine-tuned-llama-3-2-3b-instruct-rslora
Director of Cloud & AI/ML | Marine Corps Veteran | ex-AWS | Advisor
3moI can’t wait to give this a go
Solutions Engineering @ TwelveLabs | Multimodal Video Understanding
3moThanks for sharing!