How to Fix Overfitting, Underfitting, and Imbalanced Data in Machine Learning

Amit Kharche

AI & Analytics Strategist | Driving Enterprise Analytics & ML Transformation | DGM @ Adani | Cloud-Native: Azure & GCP | Ex-Kraft Heinz, Mahindra

Published Apr 11, 2025

Machine learning models are designed to learn patterns from data and make accurate predictions on unseen inputs. However, models often fall short of their potential due to a variety of challenges like underfitting, overfitting, high variance, poor parameter choices, and imbalanced data distributions. This article offers a deep dive into these topics, explaining the challenges and presenting techniques to address them effectively.

Underfitting vs. Overfitting

Article content — Image from geeksforgeeks

Underfitting

Underfitting occurs when a model is too simplistic to capture the underlying structure of the data. This leads to high errors both on training and test sets.

Symptoms:

Low training accuracy
Poor test performance
Model cannot learn even simple patterns

Causes:

Model is too simple
Too much regularization
Inadequate features or training time

Overfitting

Overfitting arises when a model learns not only the signal but also the noise in the training data. While it performs well on the training set, it fails to generalize.

Symptoms:

Very high training accuracy
Low test accuracy
Model is overly complex relative to dataset size

Bias vs. Variance Trade-off

Understanding underfitting and overfitting requires an exploration of the bias-variance trade-off.

Bias

Bias is the error due to oversimplifying the model. It prevents the model from capturing the data's true patterns.

High Bias → Underfitting
Example: Predicting house prices with just the number of rooms

Variance

Variance is the model's sensitivity to fluctuations in training data. High variance indicates the model is too reactive to the specifics of the training set.

High Variance → Overfitting
Example: A deep neural network trained on small data with no regularization

The Bias-Variance Decomposition

For mean squared error in regression problems:

Note: This decomposition is specific to squared error loss. Other loss functions (e.g., log loss in classification) have different error decompositions.

The goal is to find the optimal complexity where both bias and variance are minimized.

Nuanced View of Model Complexity

While simple models often underfit and complex models overfit, the relationship isn't always straightforward. Even simple models can overfit if the dataset is noisy or if the features are irrelevant. Likewise, complex models can generalize well when regularized appropriately or trained with sufficient data.

Controlling Complexity

Regularization: Penalize large weights (L1, L2)
Pruning: Remove unnecessary branches in decision trees
Early Stopping: Halt training when performance on validation set worsens
Feature Selection: Reduce noise and redundancy

Cross-Validation

Cross-validation is a technique used to assess a model's generalization ability and guide decisions like hyperparameter tuning.

k-Fold Cross-Validation

The dataset is split into k equal parts (folds)
The model trains on k-1 folds and validates on the remaining fold
The process repeats k times and the average score is taken

Stratified k-Fold

Preserves class distribution across folds — particularly useful for classification tasks, especially with imbalanced datasets.

Other Methods

Leave-One-Out (LOO): Extreme version of k-fold where k = number of samples
Time Series Split: Preserves temporal order for time-dependent data

Cross-validation helps reduce overfitting by validating model performance on different slices of the data.

Hyperparameter Tuning

Hyperparameters are external configurations that control the model's learning process, such as tree depth, learning rate, or number of clusters. Unlike model parameters, they are not learned during training and must be tuned manually or algorithmically.

1. Grid Search Cross-Validation

Grid Search evaluates all possible combinations from a specified parameter grid using cross-validation.

Pros:

Exhaustive
Simple to implement

Cons:

Computationally expensive
Inefficient when the grid is large

2. Random Search

Randomly samples parameter combinations and evaluates them.

Pros:

Often more efficient than Grid Search
Can find good combinations faster

Cons:

May miss optimal points if not enough iterations

3. Bayesian Optimization

Builds a probabilistic model of the objective function and uses it to choose the most promising hyperparameters.

Pros:

Smart exploration
Fewer evaluations needed

Cons:

More complex to implement

4. Modern Tools

Optuna: Dynamic, pruning-based optimization with user-friendly APIs
Hyperopt: Tree-structured Parzen Estimators
Ray Tune: Distributed hyperparameter tuning
Scikit-Optimize: Simple integration with Scikit-learn models

These frameworks improve search efficiency and integrate well with existing ML pipelines.

Imbalanced Data

What is Imbalanced Data?

In binary or multi-class classification, data is imbalanced when some classes are underrepresented.

Example: In fraud detection, 99.9% of transactions are legitimate and only 0.1% are fraud.

Why Accuracy Fails

In imbalanced settings, a model can achieve 99% accuracy by always predicting the majority class — but this provides no insight into minority class performance.

Evaluation Metrics for Imbalanced Data

Precision: How many predicted positives are true?
Recall: How many actual positives are captured?
F1-Score: Harmonic mean of precision and recall
AUC-ROC: Area under the Receiver Operating Characteristic curve
AUC-PR: Area under Precision-Recall curve (better for rare events)

Confusion matrices help visualize class-wise performance.

Handling Imbalanced Data

1. Resampling Techniques

a. Undersampling: Remove instances from the majority class

Risk: May discard valuable information

b. Oversampling: Duplicate instances from the minority class

Risk: Can lead to overfitting

2. SMOTE: Synthetic Minority Over-sampling Technique

SMOTE creates synthetic examples of the minority class by interpolating between existing instances.

How SMOTE Works:

For a given minority class sample, find k nearest neighbors.
Select one neighbor at random.
Generate a new sample by interpolating between the two.

Advantages:

Generates new, diverse minority samples
Less prone to overfitting than simple duplication

Limitations:

Can introduce unrealistic synthetic data
Risk of adding noise or overlap between classes
Assumes features are continuous and linearly interpolable

SMOTE Variants

Borderline-SMOTE: Focuses on examples near decision boundary
SMOTE-Tomek Links: Combines SMOTE with data cleaning by removing overlapping examples
ADASYN: Adaptively generates samples where the model struggles most

Putting It All Together

Building a robust machine learning model is a multi-step journey. Here’s how the discussed components interlink:

Final Thoughts

Effective machine learning isn’t just about applying algorithms, it’s about understanding the subtle trade-offs that impact model performance. Whether you're balancing bias and variance, tuning hyperparameters, or addressing imbalanced data, each decision influences how your model generalizes to the real world.

By mastering these techniques, data scientists can ensure that their models are not only accurate but also fair, robust, and interpretable.

DataToDecision: AI & Analytics

1,949 follower

+ Subscribe

Paresh Mate

Product Manager | Bridging Technology, Strategy & Customer Value | Customer Success | CRM | MBA Heriot Watt University UK

3mo

💡 Great insight

Rahul Gupta

4mo

Thoughtful post, thanks Amit

1 Reaction

Anthony Soares 🌟

Digital ERP Leader | SAP S/4HANA Sales & Distribution Expert | 19+ Years Global SAP Delivery | Driving Enterprise-Wide SAP SD Transformation & Innovation

4mo

Amit Kharche I completely agree with your insights on the challenges of machine learning models, especially in dealing with overfitting, underfitting, and bias-variance trade-offs. It's crucial for data scientists to understand these concepts and apply techniques like cross-validation and hyperparameter tuning to optimize model performance.

How to Fix Overfitting, Underfitting, and Imbalanced Data in Machine Learning

Amit Kharche

AI & Analytics Strategist | Driving Enterprise Analytics & ML Transformation | DGM @ Adani | Cloud-Native: Azure & GCP | Ex-Kraft Heinz, Mahindra

Underfitting vs. Overfitting

Underfitting

Overfitting

Bias vs. Variance Trade-off

Bias

Variance

The Bias-Variance Decomposition

Nuanced View of Model Complexity

Controlling Complexity

Cross-Validation

Hyperparameter Tuning

Imbalanced Data

Evaluation Metrics for Imbalanced Data

Handling Imbalanced Data

Putting It All Together

Final Thoughts

DataToDecision: AI & Analytics

1,949 follower

More articles by this author

Others also viewed

Bias variance tradeoff - a simple analogy

Machine Learning in Causal Inference: Limitations and Potential

Handling Imbalanced Datasets in Machine Learning

Augmentation Data Deep Dive

Decoding Machine Learning: A Business Leader's Guide to Avoiding Common Misconceptions

Machine Learning for Predictive Analytics: Forecasting Future Trends

Understanding Bayesian Classification

From Statistics to Artificial Intelligence: The Evolution of Data Science and Its Growing Popularity

The Journey of Optimizing a Machine Learning Model: A Tale of Techniques and Trials

Comparing Machine Learning Models to Find the Best Fit

Explore topics

Underfitting vs. Overfitting

Underfitting

Overfitting

Bias vs. Variance Trade-off

Bias

Variance

The Bias-Variance Decomposition

Nuanced View of Model Complexity

Controlling Complexity

Cross-Validation

Hyperparameter Tuning

Imbalanced Data

Evaluation Metrics for Imbalanced Data

Handling Imbalanced Data

Putting It All Together

Final Thoughts

DataToDecision: AI & Analytics

1,949 follower

Bias and Fairness in AI: A Leader’s Guide to Mitigation and Trust

Aug 15, 2025

AI Ethics & Societal Risks: What Every AI Program Owner Should Know

Aug 12, 2025

LLM Observability: Model Health, Latency, and Business Risk

Aug 11, 2025

Why LLM Deployment is Not Just a Technical Task — It's Strategic Delivery

Aug 8, 2025

Serving LLMs at Scale: HuggingFace, Triton, vLLM in the Enterprise

Aug 7, 2025

How to Serve LLMs in Production: Tools, Architecture & Strategic Considerations

Aug 6, 2025

Model Compression Techniques: Quantization, Pruning & Distillation for Real-World Deployment

Aug 5, 2025

ML Versioning with MLflow, DVC, GitHub: Why It Matters for Delivery Leaders

Aug 4, 2025

Feature Stores & AutoML: Scaling AI with Less Code, More Strategy

Aug 2, 2025

CI/CD in AI Projects: Automating Delivery for Business-Ready ML

Jul 30, 2025

Others also viewed

Bias variance tradeoff - a simple analogy

Machine Learning in Causal Inference: Limitations and Potential

Handling Imbalanced Datasets in Machine Learning

Augmentation Data Deep Dive

Decoding Machine Learning: A Business Leader's Guide to Avoiding Common Misconceptions

Machine Learning for Predictive Analytics: Forecasting Future Trends

Understanding Bayesian Classification

From Statistics to Artificial Intelligence: The Evolution of Data Science and Its Growing Popularity

The Journey of Optimizing a Machine Learning Model: A Tale of Techniques and Trials

Comparing Machine Learning Models to Find the Best Fit

Explore topics