Arcee AI - building and working with small language models (06/25)

Building and working
with small language models
Julien Simon, Chief Evangelist
julien@arcee.ai
linkedin.com/in/juliensimon
youtube.com/juliensimonfr

Gen AI from the trenches
• LLMs are trained on public internet data, making them a mile wide and an inch deep.
• Knowledge may be lacking, imprecise, or infringing on someone’s IP
• LLMs can't be fine-tuned efficiently.
« How well do commercial fine-tuning APIs infuse knowledge into LLMs? » https://arxiv.org/abs/2411.05059 (11/2024)
• Prompt « engineering » and in-context « learning » are anything but that.
« Is In-Context Learning Sufficient for Instruction Following in LLMs? » https://arxiv.org/html/2405.19874v2 (10/2024)
• Huge contexts don’t work well.
« Lost in the Middle: How Language Models Use Long Contexts » https://arxiv.org/abs/2307.03172 (06/2023)
• Retrieval-Augmented Generation is useful for fresh data, not for domain adaptation.
• Many business processes require not one but several model steps that collaborate with
external IT tools (such as documents, APIs, SQL, etc.).
• Most of the time, LLMs just make everything more difficult: latency, cost, scaling, etc.
• Small language models (SLMs): a new hope!

Building a world-class 32B reasoning model
https://www.arcee.ai/blog/how-arcee-ai-helped-madeline-co-build-a-world-class-reasoning-model-from-first-principles

A typical SLM workflow
Base
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
📄📄📄
Unlabeled
domain dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT) Alignment
📄📄📄
Unlabeled domain dataset + Q&A dataset
📄📄📄
Preference dataset
Instruction
pre-training
📄📄📄
Q&A dataset
« Fine-tuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021)
« Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023)
« Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024)
« How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)

Post-training challenges
• Building datasets is hard work
• Continuous Pre-Training requires a large corpus (at least 1 billion of tokens)
• Instruction Fine-Tuning and Alignment requires high-quality, diverse Q&A pairs
• Training / fine-tuning models: accuracy or cost, pick one
• Full fine-tuning (FFT): training the full model in original precision (say, BF16)
• Compute-heavy and expensive… assuming you can procure the amount of compute you need!
• Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA
• More memory-efficient than FFT, enabling smaller GPUs and shorter training times
• Effective for Instruction Fine-Tuning (IFT) and alignment
• Significant accuracy degradation for CPT
•https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/
•« LoRA vs. Full Fine-Tuning: An Illusion of Equivalence » https://arxiv.org/abs/2410.21228 (10/2024)

Technique #1 - Better PEFT with Spectrum
https://arxiv.org/abs/2406.06623 (06/2024)+ https://github.com/cognitivecomputations/spectrum
• Intuition: not all layers contribute equally to the output.
• Some layers have a higher signal-to-noise ratio (SNR) than others.
• Spectrum identifies these high SNR layers (linear algebra FTW)
• You can then run full fine-tuning on these layers, and ignore the other layers.
python spectrum.py --model-name <insert local or HF repo here>
--top-percent <top % of SNR ratios to target>
unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.21.mlp.down_proj
. . .
"model.layers.10.self_attn.o_proj": {
"snr": 0.25031203031539917,
"type": "self_attn.o_proj"
},
"snr": 0.2547757625579834,
},
"snr": 0.2616233825683594,
},
"snr": 0.2736438810825348,
},
. . .

Technique #2 - Model distillation with DistillKit
h
tt
ps://blog.arcee.ai/dis
ti
llkit-v0-1-by-arcee-ai/ (08/2024) + h
tt
ps://github.com/arcee-ai/Dis
ti
llKit
Inference
📄📄📄
Teacher logits Training (SFT)
📄📄📄
Labeled dataset
Student
model
Teacher
model
📄📄📄
Labeled dataset
Teacher
distribution
Distilled
model
Student
distribution
Loss function: minimize the difference between
the two token distributions at each output position
(Kullback-Leibler divergence)
Intuition: if we have excellent large models available,
can we teach smaller models to mimic their output,
at a fraction of the time and cost it would take to pretrain them?

10B is the new 72B (01/2025)
h
tt
ps://www.arcee.ai/blog/virtuoso-lite-virtuoso-medium-v2-dis
ti
lling-deepseek-v3-into-10b-32b-small-language-models-slms
h
tt
ps://huggingface.co/arcee-ai/Virtuoso-Lite + h
tt
ps://huggingface.co/arcee-ai/Virtuoso-Medium-v2

Technique #3 - Model merging with MergeKit
https://arxiv.org/abs/2403.13257 (03/2024) + https://github.com/arcee-ai/mergekit
Intuition: if we have plenty of open-source models available, can we build a
model by merging several models that already have the qualities we need?
• Combine multiple task-specific models into a single multitask
model without any additional training
• Not an ensembling technique: only one model,
no inference penalty
• No training involved— only requires lightweight compute Domain A
embeddings
Domain B
embeddings
Domain C
embeddings
« We
fi
nd that model merging is not merely a process of
aggregation, but a transformative method that can drive substantial
advancements in model capabilities characterized by highly
nonlinear interactions between model parameters, resulting in new
functionalities that neither parent model could achieve alone,
leading to improved performance in domain-speci
fi
c assessments. »
Fine-tuning large language models for domain adaptation:
exploration of training strategies, scaling, model merging and
synergistic capabilities - MIT report in Nature

Merging models with MergeKit
HuggingFace Space: https://huggingface.co/spaces/arcee-ai/mergekit-gui
Enterprise edition + support: https://www.arcee.ai/product/mergekit
$ git clone https://github.com/arcee-ai/mergekit.git
$ cd mergekit
$ pip install -e .
Merging three 8B models with TIES
1 minute on my Mac :)
models:
- model: defog/llama-3-sqlcoder-8b
parameters:
density: 1.0
weight: 0.2
- model: MathGenie/MathCoder2-Llama-3-8B
parameters:
density: 1.0
weight: 0.6
- model: ajibawa-2023/Code-Llama-3-8B
parameters:
density: 1.0
weight: 0.2
merge_method: ties
base_model: meta-llama/Llama-3.1-8B
dtype: float16
72.48%
65.35%
64.37% 73.62%
GSM8K scores - lm_eval 12/16/2024

Arcee Fusion
models:
- model: defog/llama-3-sqlcoder-8b
- model: meta-llama/Llama-3.1-8B
merge_method: arcee_fusion
base_model: meta-llama/Llama-3.1-8B
dtype: float16
============================
WEIGHT DIFFERENCE ANALYSIS
Model 1: meta-llama/Llama-3.1-8B
Model 2: defog/llama-3-sqlcoder-8b
============================
SUMMARY STATISTICS:
Total layers analyzed: 291
Average difference: 98.92%
Median difference: 99.32%
Min difference: 81.91%
Max difference: 100.00%
KL DIVERGENCE STATISTICS:
Average KL divergence: 0.145548
Median KL divergence: 0.012739
Min KL divergence: 0.000142
Max KL divergence: 4.010872
===========================
WEIGHT DIFFERENCE ANALYSIS
Model 1: meta-llama/Llama-3.1-8B
Model 2: ./fusion-demo
===========================
SUMMARY STATISTICS:
Total layers analyzed: 291
Average difference: 10.88%
Median difference: 10.62%
Min difference: 0.00%
Max difference: 99.98%
KL DIVERGENCE STATISTICS:
Average KL divergence: 0.010137
Median KL divergence: 0.001984
Min KL divergence: 0.000000
Max KL divergence: 0.674373

Selected Papers on Model Merging and MergeKit
• Pre-training
• Model Merging in Pre-training of Large Language Models - ByteDance
• Post-training
• Enhancing Jailbreak Resistance in Large Language Models Using Model Merge - NTT
• Unlocking the Potential of Model Merging for Low-Resource Languages
• Model Merging for Knowledge Editing
• Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning
• Use cases
• Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging
• PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare
• AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large
Language Model

A modern SLM workflow
Libraries available at https://github.com/arcee-ai/
Base
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
Alignment
MergeKit
Q&A
fine-tuning
Instruction-
tuned model
MergeKit
knowledge
injection
Domain-
adapted
model
MergeKit
alignment
Aligned
model
📄📄📄
Unlabeled
domain dataset
📄📄📄
Preference dataset
📄📄📄
Q&A dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT)
Spectrum DPO / GRPO
Spectrum/LoRA
EvolKit, DistillKit
« Arcee's MergeKit: A Toolkit for Merging Large Language Models » https://arxiv.org/abs/2403.13257 (03/2024)
« Spectrum: Targeted Training on Signal to Noise Ratio » https://arxiv.org/abs/2406.06623 (06/2024)
« Merging in a Bottle: Differentiable Adaptive Merging (DAM) and the Path from Averaging to Automation » https://arxiv.org/abs/2410.08371 (10/2024)

Arcee AI post-trained models
https://huggingface.co/arcee-ai
Best-in-class models based on open-weights architectures
Qwen2 1.5B
🥇
Best 1.5B model
Llama 3.1 8B
🥇
Best 8B model
Qwen2.5 14B
🥇
Best 14B model
Qwen2 72B
🥇
Best Arabic model
Hugging Face OpenLLM Leaderboard benchmarks at the time of model release
Llama 3.1 70B
🥇
Best 70B model

https://www.together.ai/models/arcee-ai-arcee-spotlight
https://gitlab.com/juliensimon/arcee-demos/-/blob/main/together-ai/spotlight-together.ipynb

https://www.together.ai/models/afm-4-5b-preview
https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family
https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model
https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length
Arcee Foundation Models (AFM)

AFM-4.5B-Preview
https://api.together.ai/models/arcee-ai/AFM-4.5B-Preview

AFM-4.5B-Preview vs. Qwen-3-4B
8/10. Tie on Industrials, loss on Communication Services.
200 questions generated by Claude Sonnet 3.7
20 questions for each one of the top 10 industries in the S&P 500
Judge: DeepSeek-R1 (670B)

AFM-4.5B-Preview vs. Google Gemma-3n-E4B-it
8/10, tie on Healthcare, loss on IT

AFM-4.5B-Preview vs. Llama-3.2-8B
10/10 😃

AFM-4.5B-Preview vs. Mixtral-8x7B-Instruct
Almost tied (4/10) with 8% of Mixtral’s size

Julien Simon, Chief Evangelist
julien@arcee.ai
linkedin.com/in/juliensimon
youtube.com/juliensimonfr
https://huggingface.co/arcee-ai
Models on OpenRouter and Together AI
Chat with AFM
AFM blog post

Arcee AI - building and working with small language models (06/25)

More Related Content

Similar to Arcee AI - building and working with small language models (06/25) (20)

More from Julien SIMON (20)

Recently uploaded (20)

Arcee AI - building and working with small language models (06/25)