How to Fix Overfitting, Underfitting, and Imbalanced Data in Machine Learning

How to Fix Overfitting, Underfitting, and Imbalanced Data in Machine Learning

Machine learning models are designed to learn patterns from data and make accurate predictions on unseen inputs. However, models often fall short of their potential due to a variety of challenges like underfitting, overfitting, high variance, poor parameter choices, and imbalanced data distributions. This article offers a deep dive into these topics, explaining the challenges and presenting techniques to address them effectively.


Underfitting vs. Overfitting

Article content
Image from geeksforgeeks

Underfitting

Underfitting occurs when a model is too simplistic to capture the underlying structure of the data. This leads to high errors both on training and test sets.

Symptoms:

  • Low training accuracy
  • Poor test performance
  • Model cannot learn even simple patterns

Causes:

  • Model is too simple
  • Too much regularization
  • Inadequate features or training time

Overfitting

Overfitting arises when a model learns not only the signal but also the noise in the training data. While it performs well on the training set, it fails to generalize.

Symptoms:

  • Very high training accuracy
  • Low test accuracy
  • Model is overly complex relative to dataset size

Article content

Bias vs. Variance Trade-off

Understanding underfitting and overfitting requires an exploration of the bias-variance trade-off.

Article content
Image from towardsdaatascience

Bias

Bias is the error due to oversimplifying the model. It prevents the model from capturing the data's true patterns.

  • High Bias → Underfitting
  • Example: Predicting house prices with just the number of rooms

Variance

Variance is the model's sensitivity to fluctuations in training data. High variance indicates the model is too reactive to the specifics of the training set.

  • High Variance → Overfitting
  • Example: A deep neural network trained on small data with no regularization

The Bias-Variance Decomposition

For mean squared error in regression problems:

Article content
Note: This decomposition is specific to squared error loss. Other loss functions (e.g., log loss in classification) have different error decompositions.

The goal is to find the optimal complexity where both bias and variance are minimized.


Nuanced View of Model Complexity

While simple models often underfit and complex models overfit, the relationship isn't always straightforward. Even simple models can overfit if the dataset is noisy or if the features are irrelevant. Likewise, complex models can generalize well when regularized appropriately or trained with sufficient data.

Controlling Complexity

  • Regularization: Penalize large weights (L1, L2)
  • Pruning: Remove unnecessary branches in decision trees
  • Early Stopping: Halt training when performance on validation set worsens
  • Feature Selection: Reduce noise and redundancy


Cross-Validation

Cross-validation is a technique used to assess a model's generalization ability and guide decisions like hyperparameter tuning.

Article content
Image from towardsdatascience

k-Fold Cross-Validation

  • The dataset is split into k equal parts (folds)
  • The model trains on k-1 folds and validates on the remaining fold
  • The process repeats k times and the average score is taken

Stratified k-Fold

Preserves class distribution across folds — particularly useful for classification tasks, especially with imbalanced datasets.

Other Methods

  • Leave-One-Out (LOO): Extreme version of k-fold where k = number of samples
  • Time Series Split: Preserves temporal order for time-dependent data

Cross-validation helps reduce overfitting by validating model performance on different slices of the data.


Hyperparameter Tuning

Hyperparameters are external configurations that control the model's learning process, such as tree depth, learning rate, or number of clusters. Unlike model parameters, they are not learned during training and must be tuned manually or algorithmically.

1. Grid Search Cross-Validation

Grid Search evaluates all possible combinations from a specified parameter grid using cross-validation.

Pros:

  • Exhaustive
  • Simple to implement

Cons:

  • Computationally expensive
  • Inefficient when the grid is large

2. Random Search

Randomly samples parameter combinations and evaluates them.

Pros:

  • Often more efficient than Grid Search
  • Can find good combinations faster

Cons:

  • May miss optimal points if not enough iterations

3. Bayesian Optimization

Builds a probabilistic model of the objective function and uses it to choose the most promising hyperparameters.

Pros:

  • Smart exploration
  • Fewer evaluations needed

Cons:

  • More complex to implement

4. Modern Tools

  • Optuna: Dynamic, pruning-based optimization with user-friendly APIs
  • Hyperopt: Tree-structured Parzen Estimators
  • Ray Tune: Distributed hyperparameter tuning
  • Scikit-Optimize: Simple integration with Scikit-learn models

These frameworks improve search efficiency and integrate well with existing ML pipelines.


Imbalanced Data

What is Imbalanced Data?

In binary or multi-class classification, data is imbalanced when some classes are underrepresented.

Example: In fraud detection, 99.9% of transactions are legitimate and only 0.1% are fraud.

Why Accuracy Fails

In imbalanced settings, a model can achieve 99% accuracy by always predicting the majority class — but this provides no insight into minority class performance.


Evaluation Metrics for Imbalanced Data

  • Precision: How many predicted positives are true?
  • Recall: How many actual positives are captured?
  • F1-Score: Harmonic mean of precision and recall
  • AUC-ROC: Area under the Receiver Operating Characteristic curve
  • AUC-PR: Area under Precision-Recall curve (better for rare events)

Confusion matrices help visualize class-wise performance.


Handling Imbalanced Data

1. Resampling Techniques

a. Undersampling: Remove instances from the majority class

  • Risk: May discard valuable information

b. Oversampling: Duplicate instances from the minority class

  • Risk: Can lead to overfitting

2. SMOTE: Synthetic Minority Over-sampling Technique

SMOTE creates synthetic examples of the minority class by interpolating between existing instances.

Article content
Image from internet

How SMOTE Works:

  1. For a given minority class sample, find k nearest neighbors.
  2. Select one neighbor at random.
  3. Generate a new sample by interpolating between the two.

Advantages:

  • Generates new, diverse minority samples
  • Less prone to overfitting than simple duplication

Limitations:

  • Can introduce unrealistic synthetic data
  • Risk of adding noise or overlap between classes
  • Assumes features are continuous and linearly interpolable

SMOTE Variants

  • Borderline-SMOTE: Focuses on examples near decision boundary
  • SMOTE-Tomek Links: Combines SMOTE with data cleaning by removing overlapping examples
  • ADASYN: Adaptively generates samples where the model struggles most


Putting It All Together

Building a robust machine learning model is a multi-step journey. Here’s how the discussed components interlink:

Article content

Final Thoughts

Effective machine learning isn’t just about applying algorithms, it’s about understanding the subtle trade-offs that impact model performance. Whether you're balancing bias and variance, tuning hyperparameters, or addressing imbalanced data, each decision influences how your model generalizes to the real world.

By mastering these techniques, data scientists can ensure that their models are not only accurate but also fair, robust, and interpretable.

Paresh Mate

Product Manager | Bridging Technology, Strategy & Customer Value | Customer Success | CRM | MBA Heriot Watt University UK

3mo

💡 Great insight

Like
Reply
Rahul Gupta

Senior Manager – Cloud Solutions Architect | AD & Endpoint Modernization | Digital Workplace Leader| Digital Transformation | Future Technology Director | Finops | PMP | Cybersecurity ISC2 Certified | DEVOPS | Automation

4mo

Thoughtful post, thanks Amit

Anthony Soares 🌟

Digital ERP Leader | SAP S/4HANA Sales & Distribution Expert | 19+ Years Global SAP Delivery | Driving Enterprise-Wide SAP SD Transformation & Innovation

4mo

Amit Kharche I completely agree with your insights on the challenges of machine learning models, especially in dealing with overfitting, underfitting, and bias-variance trade-offs. It's crucial for data scientists to understand these concepts and apply techniques like cross-validation and hyperparameter tuning to optimize model performance.

To view or add a comment, sign in

Others also viewed

Explore topics