Enhancing Vision Models for Fine-Grained Classification

Enhancing
Vision Models
for Fine-Grained
Classification

2
INTRODUCTION
• Description:
• This work explores the capabilities and performance of various vision models in classifying images, focusing specifically
on dog breeds and different food categories.
• It aims to provide a comparative analysis of the effectiveness of models such as EfficientNet, Vision Transformer, and
MobileNet in handling these classification tasks and make some changes in the vision transformer model and evaluate
the performance of the new model.
• Objectives:
• Performance Evaluation: Compare vision models on Dog Vision and Food Vision datasets.
• Impact Analysis: Assess the effectiveness of rotary embeddings in Vision Transformer.
• Insights: Identify strengths and weaknesses of each model to guide future research and applications.
• Gaps in Current Research:
Fine-Grained Classification
Small-Scale Datasets

3
Models Used:
Vision
Transformer
EfficientNet
Efficient
MobileNet

4
Dog Breed Dataset:
We created a comprehensive dataset with 20,000 images spanning 150 different dog breeds
to provide a robust test for fine-grained classification tasks.
150 classes
Food Vision Dataset:
A smaller, more focused dataset with 100 images across 3 classes was used to evaluate model performance on limited data
scenarios.

5
Approach:
Transfer Learning:
Utilizing pretrained models on ImageNet, fine-tuning was performed on our custom datasets to adapt
the learned features for specific tasks.
Rotary Positional Embeddings:
A novel approach where we replaced the absolute positional embeddings in Vision Transformer with
rotary positional embeddings to enhance the model's ability to capture fine-grained details.
Evaluation Metrics:
Train/Test Loss
Train/Test Accuracy

6
DOG VISION RESULT
Performance Overview:
This slide presents the performance results of various models
on the Dog Vision dataset
Conclusion:
• The Vision Transformer with Rotary Embeddings
outperforms other models in terms of both training
and test accuracy.
• EfficientNet models offer a practical trade-off between
performance and resource usage, suitable for
deployment on devices with limited computational
power.
• MobileNet, while efficient, shows lower accuracy,
indicating a need for further optimization or use in less
complex tasks.

7
FOOD VISION RESULT
Performance Overview:
This slide presents the performance results of various models
on the Food Vision dataset
Conclusion:
• The Vision Transformer excels in both training and test
accuracy, making it a strong candidate for food image
classification tasks.
• EfficientNet models provide a balanced approach, suitable
for applications where computational resources are a
concern.
• MobileNet, despite its efficiency, may require further
optimization for tasks requiring higher accuracy.

8
PREDICTION FROM BEST MODEL
Vision Transformer on Dog Vision dataset Vision Transformer on Food Vision dataset

9
Explanation of Rotary Embeddings:
Definition: Rotary embeddings are a type of positional encoding
that enhances the self-attention mechanism in Vision
Transformers.
Mechanism: They encode relative position information, allowing
the model to better capture spatial relationships in images.
Effect on Vision Transformer’s Performance:
Improved Accuracy:
Dog Vision Dataset: Test accuracy improved from ~84% to
~89% .
Reduced Test Loss: Significantly lower test loss, indicating better
model generalization.
Benefits:
Enhanced Spatial Understanding: Rotary embeddings improve
the model's ability to understand and process fine-grained spatial
details.
Consistency: Consistently high performance across different
datasets and tasks.
IMPACT OF ROTARY IMBEDDING
Accuracy:
Loss:

10
Expand Dataset Scope:
Additional Classes: Incorporate more classes into both datasets to test model scalability and robustness.
Diverse Images: Include more diverse and challenging images to further evaluate model performance.
Model Enhancements:
Advanced Architectures: Experiment with other state-of-the-art architectures and hybrid models.
Optimized Training: Explore techniques for reducing training time and computational costs without compromising
accuracy.
Real-World Applications:
Deployment: Implement models in real-world applications such as mobile apps and automated systems.
User Interaction: Test model performance with real-time user interaction and feedback to refine models further.
Additional Research:
Fine-Grained Tasks: Investigate the performance of Vision Transformers with rotary embeddings on other fine-grained
classification tasks.
Efficiency Optimization: Focus on reducing the computational requirements of Vision Transformers for more efficient
deployment.
FUTURE WORK

11
REFERENCES
1. Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv
preprint arXiv:2010.11929.
2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE
conference on computer vision and pattern recognition (CVPR), 770-778.
3. Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. International
Conference on Machine Learning (ICML), 6105-6114.
4. Howard, A. G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv
preprint arXiv:1704.04861.
5. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS),
5998-6008.
6. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision
Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030.
7. Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of
the 38th International Conference on Machine Learning (ICML), 10347-10357.
8. Ramachandran, P., et al. (2019). Stand-Alone Self-Attention in Vision Models. Advances in Neural Information
Processing Systems (NeurIPS), 68-80.

Enhancing Vision Models for Fine-Grained Classification

More Related Content

Similar to Enhancing Vision Models for Fine-Grained Classification (20)

Recently uploaded (20)

Enhancing Vision Models for Fine-Grained Classification