SlideShare a Scribd company logo
PR-411
Wortsman, Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time." International Conference on Machine Learning. PMLR, 2022.
주성훈, VUNO Inc.
2022. 11. 13.
1. Research Background
2. Methods
1. Research Background 3
Pre-training, fine-tuning, selecting a single model and discarding the rest
•Limitations :
•The selected model may not achieve the best performance.
•In particular, ensembling outputs of many models can outperform the best single model, albeit at a
high computational cost during inference.
•For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution
performance.
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
/ 21
2. Methods
1. Research Background 4
Approach - average the weights of models fine-tuned independently
•Averaging several of these models to form a model soup requires no additional training and adds no
cost at inference time.
/ 21
2. Methods
1. Research Background 5
Previous works
•Averaging model weights. (interpolated model)
• Stochastic Weight Averaging (SWA) (Izmailov et al., 2018), which averages
weights along a single optimization trajectory
• Recent work (Neyshabur et al, 2021) observes that fine-tuned models optimized independently from the same initialization lie in
the same basin of the error landscape, inspiring our method.
Wortsman, Mitchell, et al. "Robust fine-tuning of zero-shot models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018).
• Wortsman et al., average zero-shot and fine-tuned models,
finding improvements in-and out-of-distribution.
/ 21
2. Methods
1. Research Background 6
Error landscape visualizations
• initialization.
θ0 ∈ ℝd
2 fine tuned model,
2 seeds
2 fine tuned model,
2 LR
•These results suggest that interpolating the weights of two fine-tuned solutions can improve accuracy
compared to individual models
/ 21
2. Methods
1. Research Background 7
Error landscape visualizations
• initialization.
θ0 ∈ ℝd
2 fine tuned model,
2 seeds
2 fine tuned model,
2 LR
•These results suggest that models that form an angle closer to 90 degrees—may lead to higher accuracy on
the linear interpolation path.
θ1
θ2
Acc(
1
2
θ1 +
1
2
θ2) −
1
2
(Acc(θ1) + Acc(θ2))
/ 21
2. Methods
1. Research Background 8
Previous works
•Pre-training and fine-tuning (weight aggregation)
• Shu et al., has attempted to improve transfer learning by using multiple pretrained models with data-dependent gating (Shu et al.,
PMLR, 2021)
• Shu, Yang, et al. "Zoo-tuning: Adaptive transfer from a zoo of models." International Conference on Machine Learning. PMLR, 2021.
•Ensembles
• Ovadia et al. (Neurips 2019) show that ensembles exhibit high accuracy under distribution shift.
• Gontijo-Lopes et al. conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to
uncorrelated errors and better ensemble accuracy.
Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." Advances in neural information processing systems 32 (2019).
Gontijo-Lopes, Raphael, Yann Dauphin, and Ekin D. Cubuk. "No one representation to rule them all: Overlapping features of training methods." arXiv preprint arXiv:2110.12899 (2021).
/ 21
2. Methods
2. Methods
2. Methods 10
Approach to making a model soup
θ =
𝖥
𝗂
𝗇
𝖾
𝖳
𝗎
𝗇
𝖾
(θ0, h)
•Uniform soup is constructed by averaging all fine-tuned models and so
•성능이 낮은 hyperparameter configuration에 의한 낮은 성능의 모델이 포함 될 수 있음.
θi
𝒮
= {1,...,n}
•Learned soup
•optimizes model interpolation weights by gradient-based minibatch optimization
•This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks.
/ 21
3. Experimental Results
2. Methods
3. Experimental Results 12
•The greedy soup improves over the best model in the hyperparameter sweep by 0.7 percentage points.
•Pretraining: CLIP1) ViT-B/32
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
1) Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
• 5 distribution shift: inference on ImageNet-V2, ImageNet-R,
ImageNet-Sketch, ObjectNet, and ImageNet-A.
Model soups improve accuracy over the best individual fine-tuned model
/ 21
2. Methods
3. Experimental Results 13
Performance of the ‘Greedy soup’ for CLIP
•The greedy soup outperforms the best individual model—with no extra training and no extra compute
during inference, we were able to produce a better model.
/ 21
2. Methods
3. Experimental Results 14
Performance of the ‘Greedy soup’ for CLIP
•The greedy soup requires less models to reach the same accuracy as selecting the best individual
model on the held-out validation set.
/ 21
2. Methods
3. Experimental Results 15
•The greedy soup improves over the best model in the hyperparameter sweep by 0.5 percentage points.
•Pretraining: ALIGN1) EfficientNet-L2,
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
• AdamW with weight decay of 0.1 at a resolution of 289 × 289 for 25
epochs
• Linear probe initialitzation
• Grid search over learning rate (10-6,2 x 10-6,5 x 10-6, 1 x 10-5,2 x 10-5),
data augmentation, and mixup, obtaining 12 fine-tuned models
• Greedy soup select 5 models
1) Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." International Conference on Machine Learning. PMLR, 2021.
Model soups improve accuracy over the best individual fine-tuned model
/ 21
2. Methods
3. Experimental Results 16
•Pretraining: JFT-3B pre-trained ViT-G/14
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
Model soups improve accuracy over the best individual fine-tuned model
A model soup, surpassing the previous state of the art of 90.88% attained by the CoAtNet model
(Dai et al., 2021) while requiring 25% fewer FLOPs at inference time.
/ 21
2. Methods
3. Experimental Results 17
ViT-G/14 model pre-trained on JFT-3B -> ImageNet fine-tuning
•58 models fine-tuned: We vary the learning rate, decay schedule, loss function, and minimum crop size in the data
augmentation, and optionally apply RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et
al., 2019). We also train four models with sharpness-aware minimization (SAM) (Foret et al., 2021)
•Our greedy soup procedure selects 14 of the 58 models fine-tuned.
Model selection using test set
• 5 distribution shift: inference on ImageNet-V2, ImageNet-R,
ImageNet-Sketch, ObjectNet, and ImageNet-A.
/ 21
2. Methods
3. Experimental Results 18
Fine-tuning on text classification tasks (BERT, T51)
•사용된 Dataset과 task
•MRPC
• Label: Paraphrase or not
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed.
•RTE
https://huggingface.co/datasets/SetFit/rte
• Label: entailment
• Label: 문법 오류 유무
•CoLA
• Label: negative or positive (movie reviews)
•SST(Stanford Sentiment Treebank)-2
1) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67.
/ 21
2. Methods
3. Experimental Results 19
Fine-tuning on text classification tasks
•Image classification만큼 뚜렷하지는 않지만, Greedy soup으로 best individual model보다 성능 향상시키는 것이 가능함
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed.
/ 21
4. Conclusion
2. Methods
4. Conclusions 21
• Main contribution
• Our results challenge the conventional procedure of selecting the best model on the
held-out validation set when fine-tuning.
• With no extra compute during inference, we are often able to produce a better
model by averaging the weights of multiple fine-tuned solutions.
• Limitation
• (1) large, heterogeneous datasets에 대해 pre-trained model에만 실험. ImageNet 22K ->
ImageNet에 대한 실험 결과가 있지만, CLIP or ALIGN -> ImageNet에 비해서 성능 향상 효과가
약함
• (2) Ensemble 기법이 model calibration을 좋게 한다는 결과가 있지만, model soups은 그렇지
않았음.
Thank you.
/ 21

More Related Content

PDF
ViT (Vision Transformer) Review [CDM]
PDF
PR-217: EfficientDet: Scalable and Efficient Object Detection
PDF
Presentation - Model Efficiency for Edge AI
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PPTX
Diabetes Mellitus
PPTX
Hypertension
PPTX
Republic Act No. 11313 Safe Spaces Act (Bawal Bastos Law).pptx
ViT (Vision Transformer) Review [CDM]
PR-217: EfficientDet: Scalable and Efficient Object Detection
Presentation - Model Efficiency for Edge AI
Transformers In Vision From Zero to Hero (DLI).pptx
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Diabetes Mellitus
Hypertension
Republic Act No. 11313 Safe Spaces Act (Bawal Bastos Law).pptx

What's hot (20)

PPTX
Introduction to CNN
PDF
Optic flow estimation with deep learning
PDF
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
PDF
Basic Generative Adversarial Networks
PDF
YOLO9000 - PR023
PPTX
Graph Representation Learning
PDF
Real-time object detection coz YOLO!
PDF
SURF - Speeded Up Robust Features
PDF
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
PPTX
[DL輪読会]Efficient Video Generation on Complex Datasets
PDF
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PDF
Neural Radiance Fields & Neural Rendering.pdf
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PDF
semantic segmentation サーベイ
PDF
GANs and Applications
PDF
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
PPTX
Hit and-miss transform
PPTX
Convolutional neural network
PDF
210110 deformable detr
PDF
文献紹介:VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Introduction to CNN
Optic flow estimation with deep learning
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
Basic Generative Adversarial Networks
YOLO9000 - PR023
Graph Representation Learning
Real-time object detection coz YOLO!
SURF - Speeded Up Robust Features
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
[DL輪読会]Efficient Video Generation on Complex Datasets
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Neural Radiance Fields & Neural Rendering.pdf
Semantic segmentation with Convolutional Neural Network Approaches
semantic segmentation サーベイ
GANs and Applications
[DL輪読会]VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
Hit and-miss transform
Convolutional neural network
210110 deformable detr
文献紹介:VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Ad

Similar to PR-411: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (20)

PDF
The deep bootstrap 논문 리뷰
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
consistency regularization for generative adversarial networks_review
PPTX
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
PPTX
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
PPTX
StackNet Meta-Modelling framework
PDF
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PDF
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
PPTX
Mnist soln
PDF
PR422_hyper-deep ensembles.pdf
PDF
Block coordinate descent__in_computer_vision
PDF
What Makes Training Multi-modal Classification Networks Hard? ppt
PDF
Dark Knowledge - Google Transference in Ml
PDF
PR-445: Token Merging: Your ViT But Faster
PDF
Dataset Distillation by Matching Training Trajectories
PPTX
Conformer review
PDF
Poster_Reseau_Neurones_Journees_2013
PDF
ResNeSt: Split-Attention Networks
PDF
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
PDF
B. Kim, ICLR 2025, MLILAB, KAIST AI.pptx.pdf
The deep bootstrap 논문 리뷰
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
consistency regularization for generative adversarial networks_review
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
StackNet Meta-Modelling framework
PR-393: ResLT: Residual Learning for Long-tailed Recognition
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Mnist soln
PR422_hyper-deep ensembles.pdf
Block coordinate descent__in_computer_vision
What Makes Training Multi-modal Classification Networks Hard? ppt
Dark Knowledge - Google Transference in Ml
PR-445: Token Merging: Your ViT But Faster
Dataset Distillation by Matching Training Trajectories
Conformer review
Poster_Reseau_Neurones_Journees_2013
ResNeSt: Split-Attention Networks
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
B. Kim, ICLR 2025, MLILAB, KAIST AI.pptx.pdf
Ad

More from Sunghoon Joo (17)

PDF
PR-433: Test-time Training with Masked Autoencoders
PDF
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PDF
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PDF
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PDF
PR-339: Maintaining discrimination and fairness in class incremental learning
PDF
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
PDF
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PDF
PR-298 PARADE: Passage representation aggregation for document reranking
PDF
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PDF
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PDF
PR-246: A deep learning system for differential diagnosis of skin diseases
PDF
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PDF
PR-218: MFAS: Multimodal Fusion Architecture Search
PDF
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PDF
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PDF
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PDF
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-433: Test-time Training with Masked Autoencoders
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-339: Maintaining discrimination and fairness in class incremental learning
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-298 PARADE: Passage representation aggregation for document reranking
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Welding lecture in detail for understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Digital Logic Computer Design lecture notes
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Geodesy 1.pptx...............................................
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Construction Project Organization Group 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
573137875-Attendance-Management-System-original
UNIT 4 Total Quality Management .pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Welding lecture in detail for understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Digital Logic Computer Design lecture notes
Internet of Things (IOT) - A guide to understanding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Geodesy 1.pptx...............................................
R24 SURVEYING LAB MANUAL for civil enggi
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
additive manufacturing of ss316l using mig welding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Construction Project Organization Group 2.pptx

PR-411: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

  • 1. PR-411 Wortsman, Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." International Conference on Machine Learning. PMLR, 2022. 주성훈, VUNO Inc. 2022. 11. 13.
  • 3. 2. Methods 1. Research Background 3 Pre-training, fine-tuning, selecting a single model and discarding the rest •Limitations : •The selected model may not achieve the best performance. •In particular, ensembling outputs of many models can outperform the best single model, albeit at a high computational cost during inference. •For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution performance. https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html / 21
  • 4. 2. Methods 1. Research Background 4 Approach - average the weights of models fine-tuned independently •Averaging several of these models to form a model soup requires no additional training and adds no cost at inference time. / 21
  • 5. 2. Methods 1. Research Background 5 Previous works •Averaging model weights. (interpolated model) • Stochastic Weight Averaging (SWA) (Izmailov et al., 2018), which averages weights along a single optimization trajectory • Recent work (Neyshabur et al, 2021) observes that fine-tuned models optimized independently from the same initialization lie in the same basin of the error landscape, inspiring our method. Wortsman, Mitchell, et al. "Robust fine-tuning of zero-shot models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018). • Wortsman et al., average zero-shot and fine-tuned models, finding improvements in-and out-of-distribution. / 21
  • 6. 2. Methods 1. Research Background 6 Error landscape visualizations • initialization. θ0 ∈ ℝd 2 fine tuned model, 2 seeds 2 fine tuned model, 2 LR •These results suggest that interpolating the weights of two fine-tuned solutions can improve accuracy compared to individual models / 21
  • 7. 2. Methods 1. Research Background 7 Error landscape visualizations • initialization. θ0 ∈ ℝd 2 fine tuned model, 2 seeds 2 fine tuned model, 2 LR •These results suggest that models that form an angle closer to 90 degrees—may lead to higher accuracy on the linear interpolation path. θ1 θ2 Acc( 1 2 θ1 + 1 2 θ2) − 1 2 (Acc(θ1) + Acc(θ2)) / 21
  • 8. 2. Methods 1. Research Background 8 Previous works •Pre-training and fine-tuning (weight aggregation) • Shu et al., has attempted to improve transfer learning by using multiple pretrained models with data-dependent gating (Shu et al., PMLR, 2021) • Shu, Yang, et al. "Zoo-tuning: Adaptive transfer from a zoo of models." International Conference on Machine Learning. PMLR, 2021. •Ensembles • Ovadia et al. (Neurips 2019) show that ensembles exhibit high accuracy under distribution shift. • Gontijo-Lopes et al. conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to uncorrelated errors and better ensemble accuracy. Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." Advances in neural information processing systems 32 (2019). Gontijo-Lopes, Raphael, Yann Dauphin, and Ekin D. Cubuk. "No one representation to rule them all: Overlapping features of training methods." arXiv preprint arXiv:2110.12899 (2021). / 21
  • 10. 2. Methods 2. Methods 10 Approach to making a model soup θ = 𝖥 𝗂 𝗇 𝖾 𝖳 𝗎 𝗇 𝖾 (θ0, h) •Uniform soup is constructed by averaging all fine-tuned models and so •성능이 낮은 hyperparameter configuration에 의한 낮은 성능의 모델이 포함 될 수 있음. θi 𝒮 = {1,...,n} •Learned soup •optimizes model interpolation weights by gradient-based minibatch optimization •This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks. / 21
  • 12. 2. Methods 3. Experimental Results 12 •The greedy soup improves over the best model in the hyperparameter sweep by 0.7 percentage points. •Pretraining: CLIP1) ViT-B/32 •Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet. 1) Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021. • 5 distribution shift: inference on ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A. Model soups improve accuracy over the best individual fine-tuned model / 21
  • 13. 2. Methods 3. Experimental Results 13 Performance of the ‘Greedy soup’ for CLIP •The greedy soup outperforms the best individual model—with no extra training and no extra compute during inference, we were able to produce a better model. / 21
  • 14. 2. Methods 3. Experimental Results 14 Performance of the ‘Greedy soup’ for CLIP •The greedy soup requires less models to reach the same accuracy as selecting the best individual model on the held-out validation set. / 21
  • 15. 2. Methods 3. Experimental Results 15 •The greedy soup improves over the best model in the hyperparameter sweep by 0.5 percentage points. •Pretraining: ALIGN1) EfficientNet-L2, •Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet. • AdamW with weight decay of 0.1 at a resolution of 289 × 289 for 25 epochs • Linear probe initialitzation • Grid search over learning rate (10-6,2 x 10-6,5 x 10-6, 1 x 10-5,2 x 10-5), data augmentation, and mixup, obtaining 12 fine-tuned models • Greedy soup select 5 models 1) Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." International Conference on Machine Learning. PMLR, 2021. Model soups improve accuracy over the best individual fine-tuned model / 21
  • 16. 2. Methods 3. Experimental Results 16 •Pretraining: JFT-3B pre-trained ViT-G/14 •Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet. Model soups improve accuracy over the best individual fine-tuned model A model soup, surpassing the previous state of the art of 90.88% attained by the CoAtNet model (Dai et al., 2021) while requiring 25% fewer FLOPs at inference time. / 21
  • 17. 2. Methods 3. Experimental Results 17 ViT-G/14 model pre-trained on JFT-3B -> ImageNet fine-tuning •58 models fine-tuned: We vary the learning rate, decay schedule, loss function, and minimum crop size in the data augmentation, and optionally apply RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et al., 2019). We also train four models with sharpness-aware minimization (SAM) (Foret et al., 2021) •Our greedy soup procedure selects 14 of the 58 models fine-tuned. Model selection using test set • 5 distribution shift: inference on ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A. / 21
  • 18. 2. Methods 3. Experimental Results 18 Fine-tuning on text classification tasks (BERT, T51) •사용된 Dataset과 task •MRPC • Label: Paraphrase or not We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed. •RTE https://huggingface.co/datasets/SetFit/rte • Label: entailment • Label: 문법 오류 유무 •CoLA • Label: negative or positive (movie reviews) •SST(Stanford Sentiment Treebank)-2 1) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67. / 21
  • 19. 2. Methods 3. Experimental Results 19 Fine-tuning on text classification tasks •Image classification만큼 뚜렷하지는 않지만, Greedy soup으로 best individual model보다 성능 향상시키는 것이 가능함 We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed. / 21
  • 21. 2. Methods 4. Conclusions 21 • Main contribution • Our results challenge the conventional procedure of selecting the best model on the held-out validation set when fine-tuning. • With no extra compute during inference, we are often able to produce a better model by averaging the weights of multiple fine-tuned solutions. • Limitation • (1) large, heterogeneous datasets에 대해 pre-trained model에만 실험. ImageNet 22K -> ImageNet에 대한 실험 결과가 있지만, CLIP or ALIGN -> ImageNet에 비해서 성능 향상 효과가 약함 • (2) Ensemble 기법이 model calibration을 좋게 한다는 결과가 있지만, model soups은 그렇지 않았음. Thank you. / 21