SlideShare a Scribd company logo
6
Most read
8
Most read
11
Most read
Building and working
with small language models
Julien Simon, Chief Evangelist
julien@arcee.ai
linkedin.com/in/juliensimon
youtube.com/juliensimonfr
Gen AI from the trenches
• LLMs are trained on public internet data, making them a mile wide and an inch deep.
• Knowledge may be lacking, imprecise, or infringing on someone’s IP
• LLMs can't be fine-tuned efficiently.
« How well do commercial fine-tuning APIs infuse knowledge into LLMs? » https://arxiv.org/abs/2411.05059 (11/2024)
• Prompt « engineering » and in-context « learning » are anything but that.
« Is In-Context Learning Sufficient for Instruction Following in LLMs? » https://arxiv.org/html/2405.19874v2 (10/2024)
• Huge contexts don’t work well.
« Lost in the Middle: How Language Models Use Long Contexts » https://arxiv.org/abs/2307.03172 (06/2023)
• Retrieval-Augmented Generation is useful for fresh data, not for domain adaptation.
• Many business processes require not one but several model steps that collaborate with
external IT tools (such as documents, APIs, SQL, etc.).
• Most of the time, LLMs just make everything more difficult: latency, cost, scaling, etc.
• Small language models (SLMs): a new hope!
Building a world-class 32B reasoning model
https://www.arcee.ai/blog/how-arcee-ai-helped-madeline-co-build-a-world-class-reasoning-model-from-first-principles
A typical SLM workflow
Base
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
📄📄📄
Unlabeled
domain dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT) Alignment
📄📄📄
Unlabeled domain dataset + Q&A dataset
📄📄📄
Preference dataset
Instruction
pre-training
📄📄📄
Q&A dataset
« Fine-tuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021)
« Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023)
« Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024)
« How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)
Post-training challenges
• Building datasets is hard work
• Continuous Pre-Training requires a large corpus (at least 1 billion of tokens)
• Instruction Fine-Tuning and Alignment requires high-quality, diverse Q&A pairs
• Training / fine-tuning models: accuracy or cost, pick one
• Full fine-tuning (FFT): training the full model in original precision (say, BF16)
• Compute-heavy and expensive… assuming you can procure the amount of compute you need!
• Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA
• More memory-efficient than FFT, enabling smaller GPUs and shorter training times
• Effective for Instruction Fine-Tuning (IFT) and alignment
• Significant accuracy degradation for CPT
•https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/
•« LoRA vs. Full Fine-Tuning: An Illusion of Equivalence » https://arxiv.org/abs/2410.21228 (10/2024)
Technique #1 - Better PEFT with Spectrum
https://arxiv.org/abs/2406.06623 (06/2024)+ https://github.com/cognitivecomputations/spectrum
• Intuition: not all layers contribute equally to the output.
• Some layers have a higher signal-to-noise ratio (SNR) than others.
• Spectrum identifies these high SNR layers (linear algebra FTW)
• You can then run full fine-tuning on these layers, and ignore the other layers.
python spectrum.py --model-name <insert local or HF repo here>
--top-percent <top % of SNR ratios to target>
unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
- model.layers.1.input_layernorm
- model.layers.2.input_layernorm
- model.layers.3.input_layernorm
- model.layers.4.input_layernorm
- model.layers.5.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.21.mlp.down_proj
- model.layers.20.mlp.down_proj
- model.layers.22.mlp.down_proj
- model.layers.19.mlp.down_proj
- model.layers.23.mlp.down_proj
- model.layers.24.mlp.down_proj
. . .
"model.layers.10.self_attn.o_proj": {
"snr": 0.25031203031539917,
"type": "self_attn.o_proj"
},
"model.layers.11.self_attn.o_proj": {
"snr": 0.2547757625579834,
"type": "self_attn.o_proj"
},
"model.layers.12.self_attn.o_proj": {
"snr": 0.2616233825683594,
"type": "self_attn.o_proj"
},
"model.layers.13.self_attn.o_proj": {
"snr": 0.2736438810825348,
"type": "self_attn.o_proj"
},
. . .
Technique #2 - Model distillation with DistillKit
h
tt
ps://blog.arcee.ai/dis
ti
llkit-v0-1-by-arcee-ai/ (08/2024) + h
tt
ps://github.com/arcee-ai/Dis
ti
llKit
Inference
📄📄📄
Teacher logits Training (SFT)
📄📄📄
Labeled dataset
Student
model
Teacher
model
📄📄📄
Labeled dataset
Teacher
distribution
Distilled
model
Student
distribution
Loss function: minimize the difference between
the two token distributions at each output position
(Kullback-Leibler divergence)
Intuition: if we have excellent large models available,
can we teach smaller models to mimic their output,
at a fraction of the time and cost it would take to pretrain them?
10B is the new 72B (01/2025)
h
tt
ps://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-dis
ti
lling-deepseek-v3-into-10b-32b-small-language-models-slms
h
tt
ps://huggingface.co/arcee-ai/Virtuoso-Lite + h
tt
ps://huggingface.co/arcee-ai/Virtuoso-Medium-v2
Technique #3 - Model merging with MergeKit
https://arxiv.org/abs/2403.13257 (03/2024) + https://github.com/arcee-ai/mergekit
Intuition: if we have plenty of open-source models available, can we build a
model by merging several models that already have the qualities we need?
• Combine multiple task-specific models into a single multitask
model without any additional training
• Not an ensembling technique: only one model,
no inference penalty
• No training involved— only requires lightweight compute Domain A
embeddings
Domain B
embeddings
Domain C
embeddings
« We
fi
nd that model merging is not merely a process of
aggregation, but a transformative method that can drive substantial
advancements in model capabilities characterized by highly
nonlinear interactions between model parameters, resulting in new
functionalities that neither parent model could achieve alone,
leading to improved performance in domain-speci
fi
c assessments. »
Fine-tuning large language models for domain adaptation:
exploration of training strategies, scaling, model merging and
synergistic capabilities - MIT report in Nature
Merging models with MergeKit
HuggingFace Space: https://huggingface.co/spaces/arcee-ai/mergekit-gui
Enterprise edition + support: https://www.arcee.ai/product/mergekit
$ git clone https://github.com/arcee-ai/mergekit.git
$ cd mergekit
$ pip install -e .
Merging three 8B models with TIES
1 minute on my Mac :)
models:
- model: defog/llama-3-sqlcoder-8b
parameters:
density: 1.0
weight: 0.2
- model: MathGenie/MathCoder2-Llama-3-8B
parameters:
density: 1.0
weight: 0.6
- model: ajibawa-2023/Code-Llama-3-8B
parameters:
density: 1.0
weight: 0.2
merge_method: ties
base_model: meta-llama/Llama-3.1-8B
dtype: float16
72.48%
65.35%
64.37% 73.62%
GSM8K scores - lm_eval 12/16/2024
Arcee Fusion
models:
- model: defog/llama-3-sqlcoder-8b
- model: meta-llama/Llama-3.1-8B
merge_method: arcee_fusion
base_model: meta-llama/Llama-3.1-8B
dtype: float16
============================
WEIGHT DIFFERENCE ANALYSIS
Model 1: meta-llama/Llama-3.1-8B
Model 2: defog/llama-3-sqlcoder-8b
============================
SUMMARY STATISTICS:
Total layers analyzed: 291
Average difference: 98.92%
Median difference: 99.32%
Min difference: 81.91%
Max difference: 100.00%
KL DIVERGENCE STATISTICS:
Average KL divergence: 0.145548
Median KL divergence: 0.012739
Min KL divergence: 0.000142
Max KL divergence: 4.010872
===========================
WEIGHT DIFFERENCE ANALYSIS
Model 1: meta-llama/Llama-3.1-8B
Model 2: ./fusion-demo
===========================
SUMMARY STATISTICS:
Total layers analyzed: 291
Average difference: 10.88%
Median difference: 10.62%
Min difference: 0.00%
Max difference: 99.98%
KL DIVERGENCE STATISTICS:
Average KL divergence: 0.010137
Median KL divergence: 0.001984
Min KL divergence: 0.000000
Max KL divergence: 0.674373
Selected Papers on Model Merging and MergeKit
• Pre-training
• Model Merging in Pre-training of Large Language Models - ByteDance
• Post-training
• Enhancing Jailbreak Resistance in Large Language Models Using Model Merge - NTT
• Unlocking the Potential of Model Merging for Low-Resource Languages
• Model Merging for Knowledge Editing
• Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning
• Use cases
• Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
• PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare
• AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large
Language Model
A modern SLM workflow
Libraries available at https://github.com/arcee-ai/
Base
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
Alignment
MergeKit
Q&A
fine-tuning
Instruction-
tuned model
MergeKit
knowledge
injection
Domain-
adapted
model
MergeKit
alignment
Aligned
model
📄📄📄
Unlabeled
domain dataset
📄📄📄
Preference dataset
📄📄📄
Q&A dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT)
Spectrum DPO / GRPO
Spectrum/LoRA
EvolKit, DistillKit
« Arcee's MergeKit: A Toolkit for Merging Large Language Models » https://arxiv.org/abs/2403.13257 (03/2024)
« Spectrum: Targeted Training on Signal to Noise Ratio » https://arxiv.org/abs/2406.06623 (06/2024)
« Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation » https://arxiv.org/abs/2410.08371 (10/2024)
Arcee AI post-trained models
https://huggingface.co/arcee-ai
Best-in-class models based on open-weights architectures
Qwen2 1.5B
🥇
Best 1.5B model
Llama 3.1 8B
🥇
Best 8B model
Qwen2.5 14B
🥇
Best 14B model
Qwen2 72B
🥇
Best Arabic model
Hugging Face OpenLLM Leaderboard benchmarks at the time of model release
Llama 3.1 70B
🥇
Best 70B model
https://www.together.ai/models/arcee-ai-arcee-spotlight
https://gitlab.com/juliensimon/arcee-demos/-/blob/main/together-ai/spotlight-together.ipynb
https://www.together.ai/models/afm-4-5b-preview
https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family
https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model
https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length
Arcee Foundation Models (AFM)
AFM-4.5B-Preview
https://api.together.ai/models/arcee-ai/AFM-4.5B-Preview
AFM-4.5B-Preview vs. Qwen-3-4B
8/10. Tie on Industrials, loss on Communication Services.
200 questions generated by Claude Sonnet 3.7
20 questions for each one of the top 10 industries in the S&P 500
Judge: DeepSeek-R1 (670B)
AFM-4.5B-Preview vs. Google Gemma-3n-E4B-it
8/10, tie on Healthcare, loss on IT
200 questions generated by Claude Sonnet 3.7
20 questions for each one of the top 10 industries in the S&P 500
Judge: DeepSeek-R1 (670B)
AFM-4.5B-Preview vs. Llama-3.2-8B
10/10 😃
200 questions generated by Claude Sonnet 3.7
20 questions for each one of the top 10 industries in the S&P 500
Judge: DeepSeek-R1 (670B)
AFM-4.5B-Preview vs. Mixtral-8x7B-Instruct
Almost tied (4/10) with 8% of Mixtral’s size
200 questions generated by Claude Sonnet 3.7
20 questions for each one of the top 10 industries in the S&P 500
Judge: DeepSeek-R1 (670B)
Julien Simon, Chief Evangelist
julien@arcee.ai
linkedin.com/in/juliensimon
youtube.com/juliensimonfr
https://huggingface.co/arcee-ai
Models on OpenRouter and Together AI
Chat with AFM
AFM blog post

More Related Content

PPTX
How to fine-tune and develop your own large language model.pptx
PDF
Implementing high-quality and cost-effiient AI applications with small langua...
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Building High-Quality Domain-Specific Models with Mergekit
PDF
solulab.com-A Complete LLM Technique Comparison.pdf
PDF
solulab.com-A Complete LLM Technique Comparison (2).pdf
PPTX
Introduction to LLM Post-Training - MIT 6.S191 2025
How to fine-tune and develop your own large language model.pptx
Implementing high-quality and cost-effiient AI applications with small langua...
Tailoring Small Language Models for Enterprise Use Cases
Tailoring Small Language Models for Enterprise Use Cases
Building High-Quality Domain-Specific Models with Mergekit
solulab.com-A Complete LLM Technique Comparison.pdf
solulab.com-A Complete LLM Technique Comparison (2).pdf
Introduction to LLM Post-Training - MIT 6.S191 2025

Similar to Arcee AI - building and working with small language models (06/25) (20)

PDF
LLM Cheatsheet and it's brief introduction
PDF
Customizing LLMs
PPTX
NLP in 2020
PDF
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
PPTX
Paper presentation on LLM compression
PDF
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
PDF
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir
PDF
LLM.pdf
PPTX
Local Applications of Large Language Models based on RAG.pptx
PDF
Reproducible AI Using PyTorch and MLflow
PDF
LLM Fine-Tuning vs RAG A Complete Comparison.pdf
PDF
LLM Fine-Tuning vs RAG A Complete Comparison.pdf
PDF
BUILDING Q&A EDUCATIONAL APPLICATIONS WITH LLMS - MARCH 2024.pdf
PPTX
Fine-tuning Large Language Models by Dmitry Balabka
PDF
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
PDF
LLM Fine-Tuning vs RAG A Complete Comparison.pdf
PDF
ITB 2023 - Chatgpt Box! AI All The Things - Scott Steinbeck.pdf
PDF
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
PDF
Gen AI Applications in Different Industries.pdf
PDF
Xử lý ngôn ngữ tự nhiên dựa trên học sâu
LLM Cheatsheet and it's brief introduction
Customizing LLMs
NLP in 2020
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
Paper presentation on LLM compression
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir
LLM.pdf
Local Applications of Large Language Models based on RAG.pptx
Reproducible AI Using PyTorch and MLflow
LLM Fine-Tuning vs RAG A Complete Comparison.pdf
LLM Fine-Tuning vs RAG A Complete Comparison.pdf
BUILDING Q&A EDUCATIONAL APPLICATIONS WITH LLMS - MARCH 2024.pdf
Fine-tuning Large Language Models by Dmitry Balabka
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
LLM Fine-Tuning vs RAG A Complete Comparison.pdf
ITB 2023 - Chatgpt Box! AI All The Things - Scott Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
Gen AI Applications in Different Industries.pdf
Xử lý ngôn ngữ tự nhiên dựa trên học sâu
Ad

More from Julien SIMON (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
PDF
deep_dive_multihead_latent_attention.pdf
PDF
Deep Dive: Model Distillation with DistillKit
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
PDF
Julien Simon - Deep Dive - Optimizing LLM Inference
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
PDF
Julien Simon - Deep Dive - Quantizing LLMs
PDF
Julien Simon - Deep Dive - Model Merging
PDF
An introduction to computer vision with Hugging Face
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
PDF
Building NLP applications with Transformers
PPTX
Building Machine Learning Models Automatically (June 2020)
PDF
Starting your AI/ML project right (May 2020)
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
PPTX
An Introduction to Generative Adversarial Networks (April 2020)
PPTX
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
PDF
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Trying to figure out MCP by actually building an app from scratch with open s...
deep_dive_multihead_latent_attention.pdf
Deep Dive: Model Distillation with DistillKit
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Tailoring Small Language Models for Enterprise Use Cases
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien Simon - Deep Dive - Quantizing LLMs
Julien Simon - Deep Dive - Model Merging
An introduction to computer vision with Hugging Face
Reinventing Deep Learning
 with Hugging Face Transformers
Building NLP applications with Transformers
Building Machine Learning Models Automatically (June 2020)
Starting your AI/ML project right (May 2020)
Scale Machine Learning from zero to millions of users (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”

Arcee AI - building and working with small language models (06/25)

  • 1. Building and working with small language models Julien Simon, Chief Evangelist julien@arcee.ai linkedin.com/in/juliensimon youtube.com/juliensimonfr
  • 2. Gen AI from the trenches • LLMs are trained on public internet data, making them a mile wide and an inch deep. • Knowledge may be lacking, imprecise, or infringing on someone’s IP • LLMs can't be fine-tuned efficiently. « How well do commercial fine-tuning APIs infuse knowledge into LLMs? » https://arxiv.org/abs/2411.05059 (11/2024) • Prompt « engineering » and in-context « learning » are anything but that. « Is In-Context Learning Sufficient for Instruction Following in LLMs? » https://arxiv.org/html/2405.19874v2 (10/2024) • Huge contexts don’t work well. « Lost in the Middle: How Language Models Use Long Contexts » https://arxiv.org/abs/2307.03172 (06/2023) • Retrieval-Augmented Generation is useful for fresh data, not for domain adaptation. • Many business processes require not one but several model steps that collaborate with external IT tools (such as documents, APIs, SQL, etc.). • Most of the time, LLMs just make everything more difficult: latency, cost, scaling, etc. • Small language models (SLMs): a new hope!
  • 3. Building a world-class 32B reasoning model https://www.arcee.ai/blog/how-arcee-ai-helped-madeline-co-build-a-world-class-reasoning-model-from-first-principles
  • 4. A typical SLM workflow Base model Domain- adapted model Instruction- tuned model Aligned model 📄📄📄 Unlabeled domain dataset Continuous pre-training (CPT) Instruction fine-tuning (IFT) Alignment 📄📄📄 Unlabeled domain dataset + Q&A dataset 📄📄📄 Preference dataset Instruction pre-training 📄📄📄 Q&A dataset « Fine-tuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021) « Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023) « Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024) « How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)
  • 5. Post-training challenges • Building datasets is hard work • Continuous Pre-Training requires a large corpus (at least 1 billion of tokens) • Instruction Fine-Tuning and Alignment requires high-quality, diverse Q&A pairs • Training / fine-tuning models: accuracy or cost, pick one • Full fine-tuning (FFT): training the full model in original precision (say, BF16) • Compute-heavy and expensive… assuming you can procure the amount of compute you need! • Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA • More memory-efficient than FFT, enabling smaller GPUs and shorter training times • Effective for Instruction Fine-Tuning (IFT) and alignment • Significant accuracy degradation for CPT •https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/ •« LoRA vs. Full Fine-Tuning: An Illusion of Equivalence » https://arxiv.org/abs/2410.21228 (10/2024)
  • 6. Technique #1 - Better PEFT with Spectrum https://arxiv.org/abs/2406.06623 (06/2024)+ https://github.com/cognitivecomputations/spectrum • Intuition: not all layers contribute equally to the output. • Some layers have a higher signal-to-noise ratio (SNR) than others. • Spectrum identifies these high SNR layers (linear algebra FTW) • You can then run full fine-tuning on these layers, and ignore the other layers. python spectrum.py --model-name <insert local or HF repo here> --top-percent <top % of SNR ratios to target> unfrozen_parameters: - ^lm_head.weight$ - ^model.embed_tokens.weight$ # input_layernorm layers - model.layers.0.input_layernorm - model.layers.1.input_layernorm - model.layers.2.input_layernorm - model.layers.3.input_layernorm - model.layers.4.input_layernorm - model.layers.5.input_layernorm # lm_head layers # mlp.down_proj layers - model.layers.21.mlp.down_proj - model.layers.20.mlp.down_proj - model.layers.22.mlp.down_proj - model.layers.19.mlp.down_proj - model.layers.23.mlp.down_proj - model.layers.24.mlp.down_proj . . . "model.layers.10.self_attn.o_proj": { "snr": 0.25031203031539917, "type": "self_attn.o_proj" }, "model.layers.11.self_attn.o_proj": { "snr": 0.2547757625579834, "type": "self_attn.o_proj" }, "model.layers.12.self_attn.o_proj": { "snr": 0.2616233825683594, "type": "self_attn.o_proj" }, "model.layers.13.self_attn.o_proj": { "snr": 0.2736438810825348, "type": "self_attn.o_proj" }, . . .
  • 7. Technique #2 - Model distillation with DistillKit h tt ps://blog.arcee.ai/dis ti llkit-v0-1-by-arcee-ai/ (08/2024) + h tt ps://github.com/arcee-ai/Dis ti llKit Inference 📄📄📄 Teacher logits Training (SFT) 📄📄📄 Labeled dataset Student model Teacher model 📄📄📄 Labeled dataset Teacher distribution Distilled model Student distribution Loss function: minimize the difference between the two token distributions at each output position (Kullback-Leibler divergence) Intuition: if we have excellent large models available, can we teach smaller models to mimic their output, at a fraction of the time and cost it would take to pretrain them?
  • 8. 10B is the new 72B (01/2025) h tt ps://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-dis ti lling-deepseek-v3-into-10b-32b-small-language-models-slms h tt ps://huggingface.co/arcee-ai/Virtuoso-Lite + h tt ps://huggingface.co/arcee-ai/Virtuoso-Medium-v2
  • 9. Technique #3 - Model merging with MergeKit https://arxiv.org/abs/2403.13257 (03/2024) + https://github.com/arcee-ai/mergekit Intuition: if we have plenty of open-source models available, can we build a model by merging several models that already have the qualities we need? • Combine multiple task-specific models into a single multitask model without any additional training • Not an ensembling technique: only one model, no inference penalty • No training involved— only requires lightweight compute Domain A embeddings Domain B embeddings Domain C embeddings « We fi nd that model merging is not merely a process of aggregation, but a transformative method that can drive substantial advancements in model capabilities characterized by highly nonlinear interactions between model parameters, resulting in new functionalities that neither parent model could achieve alone, leading to improved performance in domain-speci fi c assessments. » Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities - MIT report in Nature
  • 10. Merging models with MergeKit HuggingFace Space: https://huggingface.co/spaces/arcee-ai/mergekit-gui Enterprise edition + support: https://www.arcee.ai/product/mergekit $ git clone https://github.com/arcee-ai/mergekit.git $ cd mergekit $ pip install -e . Merging three 8B models with TIES 1 minute on my Mac :) models: - model: defog/llama-3-sqlcoder-8b parameters: density: 1.0 weight: 0.2 - model: MathGenie/MathCoder2-Llama-3-8B parameters: density: 1.0 weight: 0.6 - model: ajibawa-2023/Code-Llama-3-8B parameters: density: 1.0 weight: 0.2 merge_method: ties base_model: meta-llama/Llama-3.1-8B dtype: float16 72.48% 65.35% 64.37% 73.62% GSM8K scores - lm_eval 12/16/2024
  • 11. Arcee Fusion models: - model: defog/llama-3-sqlcoder-8b - model: meta-llama/Llama-3.1-8B merge_method: arcee_fusion base_model: meta-llama/Llama-3.1-8B dtype: float16 ============================ WEIGHT DIFFERENCE ANALYSIS Model 1: meta-llama/Llama-3.1-8B Model 2: defog/llama-3-sqlcoder-8b ============================ SUMMARY STATISTICS: Total layers analyzed: 291 Average difference: 98.92% Median difference: 99.32% Min difference: 81.91% Max difference: 100.00% KL DIVERGENCE STATISTICS: Average KL divergence: 0.145548 Median KL divergence: 0.012739 Min KL divergence: 0.000142 Max KL divergence: 4.010872 =========================== WEIGHT DIFFERENCE ANALYSIS Model 1: meta-llama/Llama-3.1-8B Model 2: ./fusion-demo =========================== SUMMARY STATISTICS: Total layers analyzed: 291 Average difference: 10.88% Median difference: 10.62% Min difference: 0.00% Max difference: 99.98% KL DIVERGENCE STATISTICS: Average KL divergence: 0.010137 Median KL divergence: 0.001984 Min KL divergence: 0.000000 Max KL divergence: 0.674373
  • 12. Selected Papers on Model Merging and MergeKit • Pre-training • Model Merging in Pre-training of Large Language Models - ByteDance • Post-training • Enhancing Jailbreak Resistance in Large Language Models Using Model Merge - NTT • Unlocking the Potential of Model Merging for Low-Resource Languages • Model Merging for Knowledge Editing • Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning • Use cases • Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging • PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare • AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model
  • 13. A modern SLM workflow Libraries available at https://github.com/arcee-ai/ Base model Domain- adapted model Instruction- tuned model Aligned model Alignment MergeKit Q&A fine-tuning Instruction- tuned model MergeKit knowledge injection Domain- adapted model MergeKit alignment Aligned model 📄📄📄 Unlabeled domain dataset 📄📄📄 Preference dataset 📄📄📄 Q&A dataset Continuous pre-training (CPT) Instruction fine-tuning (IFT) Spectrum DPO / GRPO Spectrum/LoRA EvolKit, DistillKit « Arcee's MergeKit: A Toolkit for Merging Large Language Models » https://arxiv.org/abs/2403.13257 (03/2024) « Spectrum: Targeted Training on Signal to Noise Ratio » https://arxiv.org/abs/2406.06623 (06/2024) « Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation » https://arxiv.org/abs/2410.08371 (10/2024)
  • 14. Arcee AI post-trained models https://huggingface.co/arcee-ai Best-in-class models based on open-weights architectures Qwen2 1.5B 🥇 Best 1.5B model Llama 3.1 8B 🥇 Best 8B model Qwen2.5 14B 🥇 Best 14B model Qwen2 72B 🥇 Best Arabic model Hugging Face OpenLLM Leaderboard benchmarks at the time of model release Llama 3.1 70B 🥇 Best 70B model
  • 18. AFM-4.5B-Preview vs. Qwen-3-4B 8/10. Tie on Industrials, loss on Communication Services. 200 questions generated by Claude Sonnet 3.7 20 questions for each one of the top 10 industries in the S&P 500 Judge: DeepSeek-R1 (670B)
  • 19. AFM-4.5B-Preview vs. Google Gemma-3n-E4B-it 8/10, tie on Healthcare, loss on IT 200 questions generated by Claude Sonnet 3.7 20 questions for each one of the top 10 industries in the S&P 500 Judge: DeepSeek-R1 (670B)
  • 20. AFM-4.5B-Preview vs. Llama-3.2-8B 10/10 😃 200 questions generated by Claude Sonnet 3.7 20 questions for each one of the top 10 industries in the S&P 500 Judge: DeepSeek-R1 (670B)
  • 21. AFM-4.5B-Preview vs. Mixtral-8x7B-Instruct Almost tied (4/10) with 8% of Mixtral’s size 200 questions generated by Claude Sonnet 3.7 20 questions for each one of the top 10 industries in the S&P 500 Judge: DeepSeek-R1 (670B)
  • 22. Julien Simon, Chief Evangelist julien@arcee.ai linkedin.com/in/juliensimon youtube.com/juliensimonfr https://huggingface.co/arcee-ai Models on OpenRouter and Together AI Chat with AFM AFM blog post