Presenting... Foodvision Extended

Presenting... Foodvision Extended

Over the past few weeks, I’ve worked on an exciting computer vision project: a food classification model that recognizes 20 food items using deep learning. From handling the Food101 dataset to deploying the model on Hugging Face, I’ve gained valuable insights into building, training, and fine-tuning machine-learning models. In this article, I’ll walk you through the entire process, including the challenges I faced, like overfitting, and how I overcame them with data augmentation and advanced techniques like early stopping and scheduling.


Looking Back: The Original FoodVision Mini

Before diving into my food classification model, I want to shout out to the project that inspired me: FoodVision Mini. This project, initially developed to classify just three food categories—pizza 🍕, steak 🥩, and sushi 🍣—was built using the Vision Transformer (ViT) B16 model.

The original FoodVision Mini was impressive for its simplicity and performance, focusing on demonstrating how Vision Transformers could be applied to image classification tasks. By leveraging ViT's attention mechanism, which breaks an image into small patches and treats each patch like a "word," it allowed the model to achieve high accuracy while capturing both local and global patterns in the images.

However, as exciting as FoodVision Mini was, I wanted to take this idea a step further. Rather than just classifying three food items, I aimed to:

  1. Expand the number of food categories: My model now classifies 20 different food items instead of just three.
  2. Overcome the limitations of training time and resource usage: By experimenting with different techniques like data augmentation, schedulers, and early stopping, I optimized the model for better performance over more extended training periods.

This new project builds on the foundation laid by FoodVision Mini but with more complexity and a broader range of foods, making it more useful for real-world applications like restaurant menu scanning or food/calorie-tracking apps.


Why the Upgrade?

The motivation behind expanding beyond the original FoodVision Mini was simple: to test the boundaries of what could be achieved with more diverse data. While the original project was an excellent starting point, I knew that scaling up the model to classify more food categories would present unique challenges, such as:

  • Handling a broader range of visual diversity: More categories mean more variance in textures, shapes, and colors.
  • Managing overfitting and generalization: With more food items, the model needs to generalize well across different types of images, something I struggled with early on.
  • Efficient training: With 20 categories, training the model took longer, and it became crucial to optimize the training process using techniques like early stopping and learning rate scheduling.

Through these upgrades, my goal was not just to recreate what FoodVision Mini had done, but to enhance it, making the model more robust and scalable to real-world scenarios.


Article content
Model predicting low-resolution image correctly

  • More Categories: Instead of 3, the model now classifies 20 food items, making it more versatile for food recognition tasks.
  • Enhanced Model Training: I implemented advanced techniques to combat overfitting and optimize training, like data augmentation, schedulers, and early stopping, which helped stabilize the model.
  • Better Deployment: While FoodVision Mini was a great demo, deploying my version on Hugging Face Spaces ensures that anyone can interact with the model live, testing its ability to classify a wider range of food.


Dataset: Food101 with 20 Classes

For this extended model, I worked with the Food101 dataset, a large collection of images featuring 101 different types of food. However, to keep my focus narrow and manageable, I selected 20 diverse food categories that would still provide ample variety for the training process while ensuring the training time remained reasonable. The Food101 dataset provided a solid foundation for training, with diverse and challenging examples across all 20 selected categories. This diversity pushed the model to capture fine details, making it a great learning experience in balancing data diversity and model performance.


Dataset Details: Image Distribution

For this project, I organized my dataset into three distinct subsets to ensure a well-rounded evaluation of the model's performance:

  • Training Set: 1,000 images – This set is used to train the model, allowing it to learn the features and characteristics of each food category.
  • Validation Set: 500 images – This set is used during training to tune the model's hyperparameters and make adjustments. It helps assess how well the model generalizes to unseen data.
  • Test Set: 150 images – This set is reserved for final evaluation after the model has been trained. It provides a clear measure of how the model performs on completely new images.

This structured approach to dataset distribution ensures that the model is adequately trained, validated, and tested, leading to a more reliable assessment of its performance in classifying the 20 food categories.


Training the Model

I trained the model for 20 epochs, which took about 4 hours in total. During this training phase, the model improved its ability to recognize the food categories. However, I encountered some challenges, particularly with overfitting. This meant that while the model excelled at classifying training data, it struggled when presented with new, unseen images.

To address this issue, I utilized a Vision Transformer (ViT) architecture, which has shown remarkable performance in image classification tasks. ViT uses an attention mechanism that enables the model to focus on important features within images rather than relying solely on local patterns like traditional convolutional neural networks (CNNs). This capability allows ViT to capture global dependencies in the data, making it a suitable choice for the complexity of food images.

Despite the initial overfitting concerns, the model has made significant strides in recognizing and classifying the various food items effectively.


Handling Overfitting: Data Augmentation and Early Stopping

To reduce overfitting and make the model generalize better, I used:

  • Data Augmentation: Flipping, rotating, zooming, and shifting images created more diversity in the dataset, forcing the model to learn generalized patterns.
  • Learning Rate Scheduler: This helped by gradually reducing the learning rate, ensuring the model didn’t make drastic updates to the weights in later stages of training.
  • Early Stopping: By stopping the training process when the validation loss stopped improving, I avoided over-training the model.


Performance Metrics:

Here’s how the model performed after 20 epochs of training:

  • Accuracy: 86.32%
  • Precision: 82.97%
  • Recall: 80.32%
  • F1 Score: 80.23%

Although the model isn’t perfect (about 86% accuracy overall), it performs well in predicting the majority of food items accurately and quickly! This is a significant improvement from my original model, and I plan to keep refining it to improve these metrics further.


Article content
A Model Predicting Hot Dog Confidently

How Can the Model Be Improved?

While the model is performing well, there are several ways to enhance it further:

  1. Expand the Dataset: The current dataset of 1,000 images per category is limited. Adding more images or using data augmentation (rotation, zoom, flips) can introduce more diversity, helping the model generalize better.
  2. Hyperparameter Tuning: The training parameters could be optimized further. Using techniques like grid search or learning rate scheduling can find the most efficient settings for faster and better learning.


Deployment on Hugging Face

Once the model was trained, I deployed it on Hugging Face to make it accessible to everyone. Hugging Face Spaces offers a user-friendly interface where anyone can test the model in real-time. You can try it out here: [Foodvision Extended]


What's Next?

In the future, I plan to expand the model to classify even more food items, making it even more versatile and useful. Additionally, I will explore other model architectures beyond Vision Transformer to see if they can improve accuracy and performance. This exploration could lead to discovering new techniques and strategies in food classification, ultimately enhancing the user experience and practical applications of the model.


Conclusion

Building and deploying this food classification model was a rewarding experience that taught me the importance of overcoming common machine learning challenges like overfitting and tuning hyperparameters. I learned how to tackle common challenges in machine learning, such as overfitting and adjusting hyperparameters for better performance.

If you’re working on similar projects or want to collaborate or you just want to learn how I did all of it, feel free to reach out! I’d love to connect and hear your thoughts!




Geetanjali Grover

Tech Entrepreneur | Co-Founder of EduPlusOne |Mentor - DSA, Java, C++ Python

10mo

Interesting

Like
Reply
Elijah Fernandes

Second Year Computer Engineer (Co-op) @ University of Guelph

10mo

Amazing work Arpit!

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics