SlideShare a Scribd company logo
Enhancing
Vision Models
for Fine-Grained
Classification
2
INTRODUCTION
• Description:
• This work explores the capabilities and performance of various vision models in classifying images, focusing specifically
on dog breeds and different food categories.
• It aims to provide a comparative analysis of the effectiveness of models such as EfficientNet, Vision Transformer, and
MobileNet in handling these classification tasks and make some changes in the vision transformer model and evaluate
the performance of the new model.
• Objectives:
• Performance Evaluation: Compare vision models on Dog Vision and Food Vision datasets.
• Impact Analysis: Assess the effectiveness of rotary embeddings in Vision Transformer.
• Insights: Identify strengths and weaknesses of each model to guide future research and applications.
• Gaps in Current Research:
Fine-Grained Classification
Small-Scale Datasets
3
Models Used:
Vision
Transformer
EfficientNet
Efficient
MobileNet
4
Dog Breed Dataset:
We created a comprehensive dataset with 20,000 images spanning 150 different dog breeds
to provide a robust test for fine-grained classification tasks.
150 classes
Food Vision Dataset:
A smaller, more focused dataset with 100 images across 3 classes was used to evaluate model performance on limited data
scenarios.
5
Approach:
Transfer Learning:
Utilizing pretrained models on ImageNet, fine-tuning was performed on our custom datasets to adapt
the learned features for specific tasks.
Rotary Positional Embeddings:
A novel approach where we replaced the absolute positional embeddings in Vision Transformer with
rotary positional embeddings to enhance the model's ability to capture fine-grained details.
Evaluation Metrics:
Train/Test Loss
Train/Test Accuracy
6
DOG VISION RESULT
Performance Overview:
This slide presents the performance results of various models
on the Dog Vision dataset
Conclusion:
• The Vision Transformer with Rotary Embeddings
outperforms other models in terms of both training
and test accuracy.
• EfficientNet models offer a practical trade-off between
performance and resource usage, suitable for
deployment on devices with limited computational
power.
• MobileNet, while efficient, shows lower accuracy,
indicating a need for further optimization or use in less
complex tasks.
7
FOOD VISION RESULT
Performance Overview:
This slide presents the performance results of various models
on the Food Vision dataset
Conclusion:
• The Vision Transformer excels in both training and test
accuracy, making it a strong candidate for food image
classification tasks.
• EfficientNet models provide a balanced approach, suitable
for applications where computational resources are a
concern.
• MobileNet, despite its efficiency, may require further
optimization for tasks requiring higher accuracy.
8
PREDICTION FROM BEST MODEL
Vision Transformer on Dog Vision dataset Vision Transformer on Food Vision dataset
9
Explanation of Rotary Embeddings:
Definition: Rotary embeddings are a type of positional encoding
that enhances the self-attention mechanism in Vision
Transformers.
Mechanism: They encode relative position information, allowing
the model to better capture spatial relationships in images.
Effect on Vision Transformer’s Performance:
Improved Accuracy:
Dog Vision Dataset: Test accuracy improved from ~84% to
~89% .
Reduced Test Loss: Significantly lower test loss, indicating better
model generalization.
Benefits:
Enhanced Spatial Understanding: Rotary embeddings improve
the model's ability to understand and process fine-grained spatial
details.
Consistency: Consistently high performance across different
datasets and tasks.
IMPACT OF ROTARY IMBEDDING
Accuracy:
Loss:
10
Expand Dataset Scope:
Additional Classes: Incorporate more classes into both datasets to test model scalability and robustness.
Diverse Images: Include more diverse and challenging images to further evaluate model performance.
Model Enhancements:
Advanced Architectures: Experiment with other state-of-the-art architectures and hybrid models.
Optimized Training: Explore techniques for reducing training time and computational costs without compromising
accuracy.
Real-World Applications:
Deployment: Implement models in real-world applications such as mobile apps and automated systems.
User Interaction: Test model performance with real-time user interaction and feedback to refine models further.
Additional Research:
Fine-Grained Tasks: Investigate the performance of Vision Transformers with rotary embeddings on other fine-grained
classification tasks.
Efficiency Optimization: Focus on reducing the computational requirements of Vision Transformers for more efficient
deployment.
FUTURE WORK
11
REFERENCES
1. Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv
preprint arXiv:2010.11929.
2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE
conference on computer vision and pattern recognition (CVPR), 770-778.
3. Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. International
Conference on Machine Learning (ICML), 6105-6114.
4. Howard, A. G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv
preprint arXiv:1704.04861.
5. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS),
5998-6008.
6. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision
Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030.
7. Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of
the 38th International Conference on Machine Learning (ICML), 10347-10357.
8. Ramachandran, P., et al. (2019). Stand-Alone Self-Attention in Vision Models. Advances in Neural Information
Processing Systems (NeurIPS), 68-80.
12

More Related Content

PPTX
Research on the Application of Deep Learning Algorithms in Image Classificati...
PDF
Can Exposure, Noise and Compression affect Image Recognition? An Assessment o...
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
PDF
MobileViTv1
PDF
ViT (Vision Transformer) Review [CDM]
PDF
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
PDF
How is a Vision Transformer (ViT) model built and implemented?
Research on the Application of Deep Learning Algorithms in Image Classificati...
Can Exposure, Noise and Compression affect Image Recognition? An Assessment o...
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
MobileViTv1
ViT (Vision Transformer) Review [CDM]
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
How is a Vision Transformer (ViT) model built and implemented?

Similar to Enhancing Vision Models for Fine-Grained Classification (20)

PDF
02 - Data validation and validity deze keer
PDF
A Brief Guide to Large Vision Models
PDF
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
PPTX
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
PPTX
Dog Breed Classification classification.pptx
PDF
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
PPTX
vision transformer siêu cấp vip ro vũ trụ
PDF
Smadav Pro 15.2.2 Crack Plus Serial Key Free Download
PDF
Convolution Neural Networks and Ensembles for Visually Impaired Aid.pdf
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Large Vision Models Explained Transforming AI.pdf
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PDF
BriefHistoryTransformerstransformers.pdf
PDF
Visual Transformers
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
On-the-fly Visual Category Search in Web-scale Image Collections
PDF
Chapter 3 Deep Learning architectures.pdf
PDF
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
PDF
Image Classification with Deep Learning.pdf
PDF
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
02 - Data validation and validity deze keer
A Brief Guide to Large Vision Models
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Dog Breed Classification classification.pptx
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
vision transformer siêu cấp vip ro vũ trụ
Smadav Pro 15.2.2 Crack Plus Serial Key Free Download
Convolution Neural Networks and Ensembles for Visually Impaired Aid.pdf
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Large Vision Models Explained Transforming AI.pdf
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
BriefHistoryTransformerstransformers.pdf
Visual Transformers
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
On-the-fly Visual Category Search in Web-scale Image Collections
Chapter 3 Deep Learning architectures.pdf
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Image Classification with Deep Learning.pdf
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
Ad

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
project resource management chapter-09.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Getting Started with Data Integration: FME Form 101
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Touch Screen Technology
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Tartificialntelligence_presentation.pptx
A comparative study of natural language inference in Swahili using monolingua...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
project resource management chapter-09.pdf
WOOl fibre morphology and structure.pdf for textiles
Getting Started with Data Integration: FME Form 101
Univ-Connecticut-ChatGPT-Presentaion.pdf
Encapsulation_ Review paper, used for researhc scholars
Group 1 Presentation -Planning and Decision Making .pptx
1. Introduction to Computer Programming.pptx
Zenith AI: Advanced Artificial Intelligence
Accuracy of neural networks in brain wave diagnosis of schizophrenia
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Programs and apps: productivity, graphics, security and other tools
A Presentation on Touch Screen Technology
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Chapter 5: Probability Theory and Statistics
Tartificialntelligence_presentation.pptx
Ad

Enhancing Vision Models for Fine-Grained Classification

  • 2. 2 INTRODUCTION • Description: • This work explores the capabilities and performance of various vision models in classifying images, focusing specifically on dog breeds and different food categories. • It aims to provide a comparative analysis of the effectiveness of models such as EfficientNet, Vision Transformer, and MobileNet in handling these classification tasks and make some changes in the vision transformer model and evaluate the performance of the new model. • Objectives: • Performance Evaluation: Compare vision models on Dog Vision and Food Vision datasets. • Impact Analysis: Assess the effectiveness of rotary embeddings in Vision Transformer. • Insights: Identify strengths and weaknesses of each model to guide future research and applications. • Gaps in Current Research: Fine-Grained Classification Small-Scale Datasets
  • 4. 4 Dog Breed Dataset: We created a comprehensive dataset with 20,000 images spanning 150 different dog breeds to provide a robust test for fine-grained classification tasks. 150 classes Food Vision Dataset: A smaller, more focused dataset with 100 images across 3 classes was used to evaluate model performance on limited data scenarios.
  • 5. 5 Approach: Transfer Learning: Utilizing pretrained models on ImageNet, fine-tuning was performed on our custom datasets to adapt the learned features for specific tasks. Rotary Positional Embeddings: A novel approach where we replaced the absolute positional embeddings in Vision Transformer with rotary positional embeddings to enhance the model's ability to capture fine-grained details. Evaluation Metrics: Train/Test Loss Train/Test Accuracy
  • 6. 6 DOG VISION RESULT Performance Overview: This slide presents the performance results of various models on the Dog Vision dataset Conclusion: • The Vision Transformer with Rotary Embeddings outperforms other models in terms of both training and test accuracy. • EfficientNet models offer a practical trade-off between performance and resource usage, suitable for deployment on devices with limited computational power. • MobileNet, while efficient, shows lower accuracy, indicating a need for further optimization or use in less complex tasks.
  • 7. 7 FOOD VISION RESULT Performance Overview: This slide presents the performance results of various models on the Food Vision dataset Conclusion: • The Vision Transformer excels in both training and test accuracy, making it a strong candidate for food image classification tasks. • EfficientNet models provide a balanced approach, suitable for applications where computational resources are a concern. • MobileNet, despite its efficiency, may require further optimization for tasks requiring higher accuracy.
  • 8. 8 PREDICTION FROM BEST MODEL Vision Transformer on Dog Vision dataset Vision Transformer on Food Vision dataset
  • 9. 9 Explanation of Rotary Embeddings: Definition: Rotary embeddings are a type of positional encoding that enhances the self-attention mechanism in Vision Transformers. Mechanism: They encode relative position information, allowing the model to better capture spatial relationships in images. Effect on Vision Transformer’s Performance: Improved Accuracy: Dog Vision Dataset: Test accuracy improved from ~84% to ~89% . Reduced Test Loss: Significantly lower test loss, indicating better model generalization. Benefits: Enhanced Spatial Understanding: Rotary embeddings improve the model's ability to understand and process fine-grained spatial details. Consistency: Consistently high performance across different datasets and tasks. IMPACT OF ROTARY IMBEDDING Accuracy: Loss:
  • 10. 10 Expand Dataset Scope: Additional Classes: Incorporate more classes into both datasets to test model scalability and robustness. Diverse Images: Include more diverse and challenging images to further evaluate model performance. Model Enhancements: Advanced Architectures: Experiment with other state-of-the-art architectures and hybrid models. Optimized Training: Explore techniques for reducing training time and computational costs without compromising accuracy. Real-World Applications: Deployment: Implement models in real-world applications such as mobile apps and automated systems. User Interaction: Test model performance with real-time user interaction and feedback to refine models further. Additional Research: Fine-Grained Tasks: Investigate the performance of Vision Transformers with rotary embeddings on other fine-grained classification tasks. Efficiency Optimization: Focus on reducing the computational requirements of Vision Transformers for more efficient deployment. FUTURE WORK
  • 11. 11 REFERENCES 1. Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929. 2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 770-778. 3. Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. International Conference on Machine Learning (ICML), 6105-6114. 4. Howard, A. G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861. 5. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 5998-6008. 6. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030. 7. Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of the 38th International Conference on Machine Learning (ICML), 10347-10357. 8. Ramachandran, P., et al. (2019). Stand-Alone Self-Attention in Vision Models. Advances in Neural Information Processing Systems (NeurIPS), 68-80.
  • 12. 12