SlideShare a Scribd company logo
3
Most read
4
Most read
17
Most read
Deep Dive: Optimizing LLM Inference
Companion video: https://youtu.be/hMs8VNRy5Ys
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Decoder-only inference
• GPT-like models are decoder-only models
• No encoder, no encoder-decoder multi-head attention
• Input processing (aka prefill): highly parallel
• Inputs (the tokenized prompt) are embedded and encoded
• Multi-head attention computes the keys and values (KV)
• Large matrix multiplication, high usage of the hardware accelerator
• Output generation: sequential
• The answer is generated one token at a time
• Each generated token is appended to the previous input
• The process is repeated until the stopping criteria is met (max. length or EOS)
• Low usage of the hardware accelerator
"Attention is all you need"
https://arxiv.org/abs/1706.03762
Inputs
(prompt)
Input
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The KV cache
• Can we avoid recomputing KV values again and again for the input tokens?
• We only really need to do it for the token we just generated and appended to the input!
• The KV cache stores the KV values for all previous tokens in the accelerator RAM
• Of course, this doesn't work for the first token, which is why it takes longer to generate 😃
• Cache size (FP16)= 2*2*batch_size*seq_length*num_layers*embeddings_length ➡ Gigabytes
"Generative LLM inference" https://awsdocs-neuron.readthedocs-hosted.com
https://youtu.be/2TT384U4vQg
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Continuous batching
https://www.usenix.org/conference/osdi22/presentation/yu (09/2022)
• Decoder-only inference requests are harder to batch than for traditional Transformers
• Input and output lengths can greatly vary, leading to very different generation times
Traditional batching waits for all requests to
complete
➡ low hardware usage
Available in
Hugging Face TGI
https://www.anyscale.com/blog/continuous-batching-llm-inference
Continuous batching evicts completed
requests and runs new requests
➡ high hardware usage
Token generation must pause regularly to run
prefill for new requests
(waiting_served_ratio parameter in TGI)
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding
https://arxiv.org/abs/2211.17192 (11/2022) + https://huggingface.co/blog/assisted-generation
• The vanilla generation process outputs only one token at a time
• It's also memory-bound: idle compute resources are available
• With a small model (Mq), we predict several potential completions in parallel
(aka speculative sampling)
• With the large model (Mp), we evaluate the completions and pick the best one
(or correct it)
• Each iteration generates at least one valid token, and potentially more
⍺ : acceptance rate of completions (how well Mq approximates Mp)
ɣ : number of predicted tokens
T5-XXL
• How can we build Mq?
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: building the small model
1. Select a small off-the-shelf version of the large model
2. Use a basic n-gram model to generate tokens found in the prompt
3. Fine-tune a small model on top of a frozen large model
4. Fine-tune the small and large model together
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: small off-the-shelf model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
Note: the two models must share the same tokenizer
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: n-grams
https://github.com/apoorvumang/prompt-lookup-decoding + https://twitter.com/joao_gante/status/1747322418425741550
• Input-grounded tasks (summarization, document QA, multi-turn chat, code editing):
high n-gram overlap between the input (prompt) and the generated output
• We can use strings present in the prompt to generate candidate token sequences
• Significant speedups (2x-4x), without model modification and with no effect on output quality
• Implemented in the transformers library
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path, device_map="auto")
. . .
generation_output = model.generate(
**input_ids, do_sample=False,
max_new_tokens=512, streamer=streamer,
prompt_lookup_num_tokens=10)
Available in
Hugging Face TGI
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: Medusa
https://arxiv.org/abs/2401.10774 + https://github.com/FasterDecoding/Medusa
• Add decoding heads to the LLM to predict multiple outputs in parallel
• Verify them simultaneously in each decoding step
• Two techniques to fine-tune the new heads
• MEDUSA-1: fine-tune on top of the frozen LLM
• Parameter-efficient fine tuning with QLoRA
Vicuna 7B, 60k samples: 5 hours on an A100
• MEDUSA-2: fine-tune together with the LLM
Available in
Hugging Face TGI
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
What is model merging?
• Building a "great" model is challenging, time-consuming and compute-intensive
• Instead, can we build one by merging several models based on the same architecture?
• Combine multiple task-specific models into a single multitask model without any additional training
• Not an ensembling technique: there's only one model at the end
• Merging only requires lightweight compute
• Fast process, no extra cost for training and inference, no extra inference latency
• Mergekit: https://github.com/arcee-ai/mergekit
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Merging techniques
• Model Soups https://arxiv.org/abs/2203.05482 (03/2022)
• Spherical Linear Interpolation (SLERP) https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
• Task Arithmetic https://arxiv.org/abs/2212.04089 (12/2022)
• Trim, Elect Sign, and Merge (TIES) https://arxiv.org/abs/2306.01708 (06/2023)
• Drop and Rescale (DARE) https://arxiv.org/abs/2311.03099 (06/2023)
• Franken-merging
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Merging techniques, continued
• Model Breadcrumbs https://arxiv.org/abs/2312.06795 (12/2023)
• Model Stock https://arxiv.org/abs/2403.19522 (03/2024)
• DELLA https://arxiv.org/abs/2406.11617 (06/2024)
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Model soups
https://arxiv.org/abs/2203.05482 (03/2022) + https://github.com/mlfoundations/model-soups
• Average many variants of the same model, trained on the same
dataset with different hyper-parameters, aka linear interpolation
• Optionally: weighted average, normalization
• Uniform soup: average all models
• Greedy soup: average models one by one, keeping only the ones
that gradually improve test accuracy
• Generally, model soups perform a little worse than ensembles, but
are more resilient to out of distribution data Fine-tuning a CLIP ViT-B/32 model on ImageNet.
BERT and T5 on four text classi
fi
cation datasets from the GLUE benchmark
res = (weights * tensors).sum(dim=0)
if self.normalize:
res /= weights.sum(dim=0)
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Spherical Linear IntERPolation (SLERP)
https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
• Originally designed for 3D CGI to find smooth transitions between two (camera) rotations
• Only works with two models (one of them can be favored)
• Helps preserve the magnitude of weights
• Helps preserve the "curvature" of the embeddings space
Linear (red) vs. spherical (blue) interpolation
Images: https://www.researchgate.net/figure/Slerp-and-linear-interpolation-comparison_fig2_318221910, https://allenchou.net/2018/05/game-math-deriving-the-slerp-formula/
# Copy the vectors to reuse them later
v0_copy = np.copy(v0)
v1_copy = np.copy(v1)
# Normalize the vectors to get the directions and angles
v0 = normalize(v0, eps)
v1 = normalize(v1, eps)
# Dot product with the normalized vectors
dot = np.sum(v0 * v1)
# Calculate initial angle between v0 and v1
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
# Angle at timestep t
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
# Finish the slerp algorithm
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
res = s0 * v0_copy + s1 * v1_copy
𝛳
t
𝛳
(1-t)
𝛳
v0
v1
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Task Arithmetic
https://arxiv.org/abs/2212.04089 (12/2022)
• Pre-trained models can be fine-tuned for a
particular task : text classification,
summarization, etc.
• Task vector: tensor updates applied to the pre-
trained model during fine-tuning
• A pre-trained model can be fine-tuned for
many tasks on many datasets, leading to
many task vectors
• Task vectors can be added or subtracted to
the base model to add or remove capabilities
• Many single-task models can be merged to
build a multiple-task model
• However, as the number of tasks grow, finding
the right mix becomes a computationally
expensive problem (hyper parameter tuning on
a validation set)
Adding pairs of task vectors to T5-Base
Adding task vectors to CLIP
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
TrIm, Elect Sign and Merge (TIES)
https://arxiv.org/abs/2306.01708 (06/2023)
• Parameter interference can degrade merged performance
• Influential vs. redundant parameter
• Sign conflicts
• Trim, Elect Sign & Merge
• Trim each task vector to retain only the influential parameter values
(top-k % largest values)
• Resolve the sign conflicts between different values
• Average parameters whose sign agrees with the direction of the largest
movement
• Add the averaged parameters to the original model (with a scale factor)
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Drop And REscale (DARE)
https://arxiv.org/abs/2311.03099 (06/2023) + https://github.com/yule-BUAA/MergeLM
• During fine-tuning, many LLM parameter updates are redundant
• DARE randomly eliminates up to 99% of the updates, with
minimal impact on model performance
Drop p% of parameters and rescale the rest by
• The larger the model, the more limited the impact
• This drastically reduces the size of task vectors
• Task vectors can be applied to a pre-trained model using
previous methods like TIES
1
1 − p
WizardLM- 13B (LM), WizardMath-13B (Math), and llama-2-13b-code-alpaca
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Franken-merging
• All previous methods require models that share a common architecture (T5, Llama, etc.)
• We can build a new model by stitching layers from different models, possibly with different architectures
• Weights are left untouched (aka passthrough merging)
• Very experimental, but some interesting models are popping up
• https://huggingface.co/models?sort=downloads&search=franken
- range 0, 16
Xwin
- range 8, 24
Euryale
- range 17, 32
Xwin
- range 25, 40
Euryale
- range 33, 48
Xwin
- range 41, 56
Euryale
- range 49, 64
Xwin
- range 57, 72
Euryale
- range 65, 80
Xwin
https://huggingface.co/alpindale/goliath-120b
merging two Llama-2-70b models
https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
https://huggingface.co/Sao10K/Euryale-1.3-L2-70B
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Model breadcrumbs
https://arxiv.org/abs/2312.06795 (12/2023) + https://github.com/rezazzr/breadcrumbs
• Improvement of the Task Arithmetic technique
• Compute a task direction for each fine-tuned model
(fine-tuned weights minus base weights)
• Mask (set to zero) a percentage of tiny and large
outliers in each task direction, resp. β and ɣ
(this is equivalent to using the parameter from the base model)
• Sum the base model and the masked task
directions, with a task weight (⍺)
• This method outperforms Task Vectors. It also scales much better to a large number of tasks: « After finding
optimal hyperparameters for both Model Breadcrumbs and Task Vectors using 10 tasks, we kept these
hyperparameters and incrementally merged all 200 tasks to create a multi-task model. […] the observed trend
remains, with Model Breadcrumbs (85% sparsity) consistently outperforming Task Vectors by a significant margin
as the number of tasks increases. »
• Merging also improves the
performance of fine-tuned models
• Optionally, add sign election before
merging (TIES method).
T5-base fined tuned on 4 GLUE tasks, then merged with six other T5 variants (IMDB, etc.)
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Model stock
https://arxiv.org/abs/2403.19522 + https://github.com/naver-ai/model-stock (03/2024)
• Observation 1: « Fine-tuned weights with different random
seeds reside on a very thin shell layer-wise »
3D intuition: for each layer, vectors from all fine-tuned models live on a sphere
• Observation 2: « Going closer to the center by averaging the
weights leads to improving both performances. » [ID, OOD]
• Model soups work because averaging
many single-task models gets us closer
to the center.
• However, building a large collection of
single-task models can be
computationally costly.
• By understanding the geometry of fine-
tuned weights, model stock allows us to
approximate the center coordinates from
just a few models.
• Periodic merging during fine-tuning
yields even better results.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Drop and rEscaLe via sampLing with mAgnitude (DELLA)
https://arxiv.org/abs/2406.11617 (06/2024) + https://github.com/declare-lab/della
1. Compute delta parameters for each fine-tuned model
(fine-tuned weights minus base weights)
2. Drop a fraction of deltas:
• The drop probability of each parameter is inversely
proportional to its magnitude
• For each parameter, we flip a coin with a Bernoulli
distribution to decide whether to keep it or drop it.
• If a parameter is dropped, we set it to zero.
3. Elect weights for merging, according to the dominant
direction (same as TIES — optional in mergekit)
4. Fuse (average) the elected weights WizardLM-13B, WizardMath-13B, llama-2-13b-code-alpaca

More Related Content

PPTX
HuggingFace AI - Hugging Face lets users create interactive, in-browser demos...
PDF
Julien Simon - Deep Dive - Quantizing LLMs
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
PDF
Julien Simon - Deep Dive - Model Merging
PDF
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
PDF
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
PPTX
Introduction to LLM Post-Training - MIT 6.S191 2025
PDF
Chat with your data, privately and locally
HuggingFace AI - Hugging Face lets users create interactive, in-browser demos...
Julien Simon - Deep Dive - Quantizing LLMs
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien Simon - Deep Dive - Model Merging
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
Introduction to LLM Post-Training - MIT 6.S191 2025
Chat with your data, privately and locally

What's hot (20)

PDF
Introduction to LLMs
PDF
Intro to LLMs
PPTX
Cloud AI GenAI Overview.pptx
PDF
LanGCHAIN Framework
PPTX
Explainable Machine Learning (Explainable ML)
PDF
Training language models to follow instructions with human feedback.pdf
PDF
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
PDF
Chicago AWS Solutions Architect Mehdy Haghy recaps the new AI/ML releases and...
PPTX
Generative AI.pptx
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
PDF
Generative-AI-in-enterprise-20230615.pdf
PDF
Unlocking the Power of Generative AI An Executive's Guide.pdf
PDF
Detecting fraud with Python and machine learning
PDF
Generative AI - Responsible Path Forward.pdf
PDF
Generative Models and ChatGPT
PDF
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
PDF
ChatGPT: Le bon la brute et le changement
PDF
Robustness in deep learning
PDF
Large Language Models Bootcamp
PDF
MLOps Using MLflow
Introduction to LLMs
Intro to LLMs
Cloud AI GenAI Overview.pptx
LanGCHAIN Framework
Explainable Machine Learning (Explainable ML)
Training language models to follow instructions with human feedback.pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
Chicago AWS Solutions Architect Mehdy Haghy recaps the new AI/ML releases and...
Generative AI.pptx
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Generative-AI-in-enterprise-20230615.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdf
Detecting fraud with Python and machine learning
Generative AI - Responsible Path Forward.pdf
Generative Models and ChatGPT
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
ChatGPT: Le bon la brute et le changement
Robustness in deep learning
Large Language Models Bootcamp
MLOps Using MLflow
Ad

Similar to Julien Simon - Deep Dive - Optimizing LLM Inference (20)

PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
PPTX
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
PDF
【DeepLearning研修】Transformerの基礎と応用 -- 第2回 Transformerの言語での応用
PDF
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
PDF
OpenAI GPT in Depth - Questions and Misconceptions
PDF
Deep Dive on Deep Learning (June 2018)
PDF
Liebman_Thesis.pdf
PDF
AWS_Meetup_BLR_July_22_Social.pdf
PDF
deep_dive_multihead_latent_attention.pdf
PDF
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
PDF
BloombergGPT.pdfA Large Language Model for Finance
PDF
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
PDF
Apache MXNet ODSC West 2018
PPTX
Serving BERT Models in Production with TorchServe
PDF
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
PDF
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
PDF
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
PDF
Building Applications with Apache MXNet
PPTX
Deep Learning and TensorFlow
PDF
LLM Cheatsheet and it's brief introduction
Julien Simon - Deep Dive: Compiling Deep Learning Models
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
【DeepLearning研修】Transformerの基礎と応用 -- 第2回 Transformerの言語での応用
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
OpenAI GPT in Depth - Questions and Misconceptions
Deep Dive on Deep Learning (June 2018)
Liebman_Thesis.pdf
AWS_Meetup_BLR_July_22_Social.pdf
deep_dive_multihead_latent_attention.pdf
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
BloombergGPT.pdfA Large Language Model for Finance
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
Apache MXNet ODSC West 2018
Serving BERT Models in Production with TorchServe
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Building Applications with Apache MXNet
Deep Learning and TensorFlow
LLM Cheatsheet and it's brief introduction
Ad

More from Julien SIMON (20)

PDF
Implementing high-quality and cost-effiient AI applications with small langua...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
PDF
Arcee AI - building and working with small language models (06/25)
PDF
Deep Dive: Model Distillation with DistillKit
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
PDF
Building High-Quality Domain-Specific Models with Mergekit
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
An introduction to computer vision with Hugging Face
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
PDF
Building NLP applications with Transformers
PPTX
Building Machine Learning Models Automatically (June 2020)
PDF
Starting your AI/ML project right (May 2020)
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
PPTX
An Introduction to Generative Adversarial Networks (April 2020)
PPTX
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
PDF
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
PDF
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Implementing high-quality and cost-effiient AI applications with small langua...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Trying to figure out MCP by actually building an app from scratch with open s...
Arcee AI - building and working with small language models (06/25)
Deep Dive: Model Distillation with DistillKit
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Building High-Quality Domain-Specific Models with Mergekit
Tailoring Small Language Models for Enterprise Use Cases
Tailoring Small Language Models for Enterprise Use Cases
Tailoring Small Language Models for Enterprise Use Cases
An introduction to computer vision with Hugging Face
Reinventing Deep Learning
 with Hugging Face Transformers
Building NLP applications with Transformers
Building Machine Learning Models Automatically (June 2020)
Starting your AI/ML project right (May 2020)
Scale Machine Learning from zero to millions of users (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Julien Simon - Deep Dive - Optimizing LLM Inference

  • 1. Deep Dive: Optimizing LLM Inference Companion video: https://youtu.be/hMs8VNRy5Ys Julien Simon https://www.linkedin.com/in/juliensimon https://www.youtube.com/juliensimonfr The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
  • 2. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Decoder-only inference • GPT-like models are decoder-only models • No encoder, no encoder-decoder multi-head attention • Input processing (aka prefill): highly parallel • Inputs (the tokenized prompt) are embedded and encoded • Multi-head attention computes the keys and values (KV) • Large matrix multiplication, high usage of the hardware accelerator • Output generation: sequential • The answer is generated one token at a time • Each generated token is appended to the previous input • The process is repeated until the stopping criteria is met (max. length or EOS) • Low usage of the hardware accelerator "Attention is all you need" https://arxiv.org/abs/1706.03762 Inputs (prompt) Input
  • 3. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. The KV cache • Can we avoid recomputing KV values again and again for the input tokens? • We only really need to do it for the token we just generated and appended to the input! • The KV cache stores the KV values for all previous tokens in the accelerator RAM • Of course, this doesn't work for the first token, which is why it takes longer to generate 😃 • Cache size (FP16)= 2*2*batch_size*seq_length*num_layers*embeddings_length ➡ Gigabytes "Generative LLM inference" https://awsdocs-neuron.readthedocs-hosted.com https://youtu.be/2TT384U4vQg
  • 4. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Continuous batching https://www.usenix.org/conference/osdi22/presentation/yu (09/2022) • Decoder-only inference requests are harder to batch than for traditional Transformers • Input and output lengths can greatly vary, leading to very different generation times Traditional batching waits for all requests to complete ➡ low hardware usage Available in Hugging Face TGI https://www.anyscale.com/blog/continuous-batching-llm-inference Continuous batching evicts completed requests and runs new requests ➡ high hardware usage Token generation must pause regularly to run prefill for new requests (waiting_served_ratio parameter in TGI)
  • 5. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Speculative decoding https://arxiv.org/abs/2211.17192 (11/2022) + https://huggingface.co/blog/assisted-generation • The vanilla generation process outputs only one token at a time • It's also memory-bound: idle compute resources are available • With a small model (Mq), we predict several potential completions in parallel (aka speculative sampling) • With the large model (Mp), we evaluate the completions and pick the best one (or correct it) • Each iteration generates at least one valid token, and potentially more ⍺ : acceptance rate of completions (how well Mq approximates Mp) ɣ : number of predicted tokens T5-XXL • How can we build Mq?
  • 6. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Speculative decoding: building the small model 1. Select a small off-the-shelf version of the large model 2. Use a basic n-gram model to generate tokens found in the prompt 3. Fine-tune a small model on top of a frozen large model 4. Fine-tune the small and large model together
  • 7. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Speculative decoding: small off-the-shelf model from transformers import AutoModelForCausalLM, AutoTokenizer import torch prompt = "Alice and Bob" checkpoint = "EleutherAI/pythia-1.4b-deduped" assistant_checkpoint = "EleutherAI/pythia-160m-deduped" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(checkpoint) inputs = tokenizer(prompt, return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device) outputs = model.generate(**inputs, assistant_model=assistant_model) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) # ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] Note: the two models must share the same tokenizer
  • 8. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Speculative decoding: n-grams https://github.com/apoorvumang/prompt-lookup-decoding + https://twitter.com/joao_gante/status/1747322418425741550 • Input-grounded tasks (summarization, document QA, multi-turn chat, code editing): high n-gram overlap between the input (prompt) and the generated output • We can use strings present in the prompt to generate candidate token sequences • Significant speedups (2x-4x), without model modification and with no effect on output quality • Implemented in the transformers library model = AutoModelForCausalLM.from_pretrained( model_name_or_path, device_map="auto") . . . generation_output = model.generate( **input_ids, do_sample=False, max_new_tokens=512, streamer=streamer, prompt_lookup_num_tokens=10) Available in Hugging Face TGI
  • 9. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Speculative decoding: Medusa https://arxiv.org/abs/2401.10774 + https://github.com/FasterDecoding/Medusa • Add decoding heads to the LLM to predict multiple outputs in parallel • Verify them simultaneously in each decoding step • Two techniques to fine-tune the new heads • MEDUSA-1: fine-tune on top of the frozen LLM • Parameter-efficient fine tuning with QLoRA Vicuna 7B, 60k samples: 5 hours on an A100 • MEDUSA-2: fine-tune together with the LLM Available in Hugging Face TGI
  • 10. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. What is model merging? • Building a "great" model is challenging, time-consuming and compute-intensive • Instead, can we build one by merging several models based on the same architecture? • Combine multiple task-specific models into a single multitask model without any additional training • Not an ensembling technique: there's only one model at the end • Merging only requires lightweight compute • Fast process, no extra cost for training and inference, no extra inference latency • Mergekit: https://github.com/arcee-ai/mergekit
  • 11. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Merging techniques • Model Soups https://arxiv.org/abs/2203.05482 (03/2022) • Spherical Linear Interpolation (SLERP) https://dl.acm.org/doi/10.1145/325334.325242 (07/1985) • Task Arithmetic https://arxiv.org/abs/2212.04089 (12/2022) • Trim, Elect Sign, and Merge (TIES) https://arxiv.org/abs/2306.01708 (06/2023) • Drop and Rescale (DARE) https://arxiv.org/abs/2311.03099 (06/2023) • Franken-merging
  • 12. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Merging techniques, continued • Model Breadcrumbs https://arxiv.org/abs/2312.06795 (12/2023) • Model Stock https://arxiv.org/abs/2403.19522 (03/2024) • DELLA https://arxiv.org/abs/2406.11617 (06/2024)
  • 13. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Model soups https://arxiv.org/abs/2203.05482 (03/2022) + https://github.com/mlfoundations/model-soups • Average many variants of the same model, trained on the same dataset with different hyper-parameters, aka linear interpolation • Optionally: weighted average, normalization • Uniform soup: average all models • Greedy soup: average models one by one, keeping only the ones that gradually improve test accuracy • Generally, model soups perform a little worse than ensembles, but are more resilient to out of distribution data Fine-tuning a CLIP ViT-B/32 model on ImageNet. BERT and T5 on four text classi fi cation datasets from the GLUE benchmark res = (weights * tensors).sum(dim=0) if self.normalize: res /= weights.sum(dim=0)
  • 14. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Spherical Linear IntERPolation (SLERP) https://dl.acm.org/doi/10.1145/325334.325242 (07/1985) • Originally designed for 3D CGI to find smooth transitions between two (camera) rotations • Only works with two models (one of them can be favored) • Helps preserve the magnitude of weights • Helps preserve the "curvature" of the embeddings space Linear (red) vs. spherical (blue) interpolation Images: https://www.researchgate.net/figure/Slerp-and-linear-interpolation-comparison_fig2_318221910, https://allenchou.net/2018/05/game-math-deriving-the-slerp-formula/ # Copy the vectors to reuse them later v0_copy = np.copy(v0) v1_copy = np.copy(v1) # Normalize the vectors to get the directions and angles v0 = normalize(v0, eps) v1 = normalize(v1, eps) # Dot product with the normalized vectors dot = np.sum(v0 * v1) # Calculate initial angle between v0 and v1 theta_0 = np.arccos(dot) sin_theta_0 = np.sin(theta_0) # Angle at timestep t theta_t = theta_0 * t sin_theta_t = np.sin(theta_t) # Finish the slerp algorithm s0 = np.sin(theta_0 - theta_t) / sin_theta_0 s1 = sin_theta_t / sin_theta_0 res = s0 * v0_copy + s1 * v1_copy 𝛳 t 𝛳 (1-t) 𝛳 v0 v1
  • 15. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Task Arithmetic https://arxiv.org/abs/2212.04089 (12/2022) • Pre-trained models can be fine-tuned for a particular task : text classification, summarization, etc. • Task vector: tensor updates applied to the pre- trained model during fine-tuning • A pre-trained model can be fine-tuned for many tasks on many datasets, leading to many task vectors • Task vectors can be added or subtracted to the base model to add or remove capabilities • Many single-task models can be merged to build a multiple-task model • However, as the number of tasks grow, finding the right mix becomes a computationally expensive problem (hyper parameter tuning on a validation set) Adding pairs of task vectors to T5-Base Adding task vectors to CLIP
  • 16. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. TrIm, Elect Sign and Merge (TIES) https://arxiv.org/abs/2306.01708 (06/2023) • Parameter interference can degrade merged performance • Influential vs. redundant parameter • Sign conflicts • Trim, Elect Sign & Merge • Trim each task vector to retain only the influential parameter values (top-k % largest values) • Resolve the sign conflicts between different values • Average parameters whose sign agrees with the direction of the largest movement • Add the averaged parameters to the original model (with a scale factor)
  • 17. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Drop And REscale (DARE) https://arxiv.org/abs/2311.03099 (06/2023) + https://github.com/yule-BUAA/MergeLM • During fine-tuning, many LLM parameter updates are redundant • DARE randomly eliminates up to 99% of the updates, with minimal impact on model performance Drop p% of parameters and rescale the rest by • The larger the model, the more limited the impact • This drastically reduces the size of task vectors • Task vectors can be applied to a pre-trained model using previous methods like TIES 1 1 − p WizardLM- 13B (LM), WizardMath-13B (Math), and llama-2-13b-code-alpaca
  • 18. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Franken-merging • All previous methods require models that share a common architecture (T5, Llama, etc.) • We can build a new model by stitching layers from different models, possibly with different architectures • Weights are left untouched (aka passthrough merging) • Very experimental, but some interesting models are popping up • https://huggingface.co/models?sort=downloads&search=franken - range 0, 16 Xwin - range 8, 24 Euryale - range 17, 32 Xwin - range 25, 40 Euryale - range 33, 48 Xwin - range 41, 56 Euryale - range 49, 64 Xwin - range 57, 72 Euryale - range 65, 80 Xwin https://huggingface.co/alpindale/goliath-120b merging two Llama-2-70b models https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1 https://huggingface.co/Sao10K/Euryale-1.3-L2-70B
  • 19. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Model breadcrumbs https://arxiv.org/abs/2312.06795 (12/2023) + https://github.com/rezazzr/breadcrumbs • Improvement of the Task Arithmetic technique • Compute a task direction for each fine-tuned model (fine-tuned weights minus base weights) • Mask (set to zero) a percentage of tiny and large outliers in each task direction, resp. β and ɣ (this is equivalent to using the parameter from the base model) • Sum the base model and the masked task directions, with a task weight (⍺) • This method outperforms Task Vectors. It also scales much better to a large number of tasks: « After finding optimal hyperparameters for both Model Breadcrumbs and Task Vectors using 10 tasks, we kept these hyperparameters and incrementally merged all 200 tasks to create a multi-task model. […] the observed trend remains, with Model Breadcrumbs (85% sparsity) consistently outperforming Task Vectors by a significant margin as the number of tasks increases. » • Merging also improves the performance of fine-tuned models • Optionally, add sign election before merging (TIES method). T5-base fined tuned on 4 GLUE tasks, then merged with six other T5 variants (IMDB, etc.)
  • 20. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Model stock https://arxiv.org/abs/2403.19522 + https://github.com/naver-ai/model-stock (03/2024) • Observation 1: « Fine-tuned weights with different random seeds reside on a very thin shell layer-wise » 3D intuition: for each layer, vectors from all fine-tuned models live on a sphere • Observation 2: « Going closer to the center by averaging the weights leads to improving both performances. » [ID, OOD] • Model soups work because averaging many single-task models gets us closer to the center. • However, building a large collection of single-task models can be computationally costly. • By understanding the geometry of fine- tuned weights, model stock allows us to approximate the center coordinates from just a few models. • Periodic merging during fine-tuning yields even better results.
  • 21. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Drop and rEscaLe via sampLing with mAgnitude (DELLA) https://arxiv.org/abs/2406.11617 (06/2024) + https://github.com/declare-lab/della 1. Compute delta parameters for each fine-tuned model (fine-tuned weights minus base weights) 2. Drop a fraction of deltas: • The drop probability of each parameter is inversely proportional to its magnitude • For each parameter, we flip a coin with a Bernoulli distribution to decide whether to keep it or drop it. • If a parameter is dropped, we set it to zero. 3. Elect weights for merging, according to the dominant direction (same as TIES — optional in mergekit) 4. Fuse (average) the elected weights WizardLM-13B, WizardMath-13B, llama-2-13b-code-alpaca