Presenting... Foodvision Extended
Over the past few weeks, I’ve worked on an exciting computer vision project: a food classification model that recognizes 20 food items using deep learning. From handling the Food101 dataset to deploying the model on Hugging Face, I’ve gained valuable insights into building, training, and fine-tuning machine-learning models. In this article, I’ll walk you through the entire process, including the challenges I faced, like overfitting, and how I overcame them with data augmentation and advanced techniques like early stopping and scheduling.
Looking Back: The Original FoodVision Mini
Before diving into my food classification model, I want to shout out to the project that inspired me: FoodVision Mini. This project, initially developed to classify just three food categories—pizza 🍕, steak 🥩, and sushi 🍣—was built using the Vision Transformer (ViT) B16 model.
The original FoodVision Mini was impressive for its simplicity and performance, focusing on demonstrating how Vision Transformers could be applied to image classification tasks. By leveraging ViT's attention mechanism, which breaks an image into small patches and treats each patch like a "word," it allowed the model to achieve high accuracy while capturing both local and global patterns in the images.
However, as exciting as FoodVision Mini was, I wanted to take this idea a step further. Rather than just classifying three food items, I aimed to:
This new project builds on the foundation laid by FoodVision Mini but with more complexity and a broader range of foods, making it more useful for real-world applications like restaurant menu scanning or food/calorie-tracking apps.
Why the Upgrade?
The motivation behind expanding beyond the original FoodVision Mini was simple: to test the boundaries of what could be achieved with more diverse data. While the original project was an excellent starting point, I knew that scaling up the model to classify more food categories would present unique challenges, such as:
Through these upgrades, my goal was not just to recreate what FoodVision Mini had done, but to enhance it, making the model more robust and scalable to real-world scenarios.
Dataset: Food101 with 20 Classes
For this extended model, I worked with the Food101 dataset, a large collection of images featuring 101 different types of food. However, to keep my focus narrow and manageable, I selected 20 diverse food categories that would still provide ample variety for the training process while ensuring the training time remained reasonable. The Food101 dataset provided a solid foundation for training, with diverse and challenging examples across all 20 selected categories. This diversity pushed the model to capture fine details, making it a great learning experience in balancing data diversity and model performance.
Dataset Details: Image Distribution
For this project, I organized my dataset into three distinct subsets to ensure a well-rounded evaluation of the model's performance:
This structured approach to dataset distribution ensures that the model is adequately trained, validated, and tested, leading to a more reliable assessment of its performance in classifying the 20 food categories.
Training the Model
I trained the model for 20 epochs, which took about 4 hours in total. During this training phase, the model improved its ability to recognize the food categories. However, I encountered some challenges, particularly with overfitting. This meant that while the model excelled at classifying training data, it struggled when presented with new, unseen images.
To address this issue, I utilized a Vision Transformer (ViT) architecture, which has shown remarkable performance in image classification tasks. ViT uses an attention mechanism that enables the model to focus on important features within images rather than relying solely on local patterns like traditional convolutional neural networks (CNNs). This capability allows ViT to capture global dependencies in the data, making it a suitable choice for the complexity of food images.
Despite the initial overfitting concerns, the model has made significant strides in recognizing and classifying the various food items effectively.
Handling Overfitting: Data Augmentation and Early Stopping
To reduce overfitting and make the model generalize better, I used:
Performance Metrics:
Here’s how the model performed after 20 epochs of training:
Although the model isn’t perfect (about 86% accuracy overall), it performs well in predicting the majority of food items accurately and quickly! This is a significant improvement from my original model, and I plan to keep refining it to improve these metrics further.
How Can the Model Be Improved?
While the model is performing well, there are several ways to enhance it further:
Deployment on Hugging Face
Once the model was trained, I deployed it on Hugging Face to make it accessible to everyone. Hugging Face Spaces offers a user-friendly interface where anyone can test the model in real-time. You can try it out here: [Foodvision Extended]
What's Next?
In the future, I plan to expand the model to classify even more food items, making it even more versatile and useful. Additionally, I will explore other model architectures beyond Vision Transformer to see if they can improve accuracy and performance. This exploration could lead to discovering new techniques and strategies in food classification, ultimately enhancing the user experience and practical applications of the model.
Conclusion
Building and deploying this food classification model was a rewarding experience that taught me the importance of overcoming common machine learning challenges like overfitting and tuning hyperparameters. I learned how to tackle common challenges in machine learning, such as overfitting and adjusting hyperparameters for better performance.
If you’re working on similar projects or want to collaborate or you just want to learn how I did all of it, feel free to reach out! I’d love to connect and hear your thoughts!
Tech Entrepreneur | Co-Founder of EduPlusOne |Mentor - DSA, Java, C++ Python
10moInteresting
Second Year Computer Engineer (Co-op) @ University of Guelph
10moAmazing work Arpit!