From Raw Data to Robust Models: A Semester in Practical Data Science

Reid Dial

Fourth-year Finance Major with a Business Development minor and Data Analytics certificate.

Published Apr 30, 2025

1. Data Cleaning: From Disorder to Dataset

Every successful project starts with clean data. That principle became clear as we began preprocessing a global air pollution dataset containing pollutant concentrations (e.g., PM2.5, CO, NO₂, O₃), country and city data, and corresponding AQI values. Key cleaning steps included: - Handling missing or inaccurate sensor values, especially in pollutant readings, which are prone to gaps - Encoding categorical AQI categories numerically, such as converting “Good” to 0 and “Hazardous” to 5, for model compatibility - Validating independence of features using the Variance Inflation Factor (VIF) to prevent multicollinearity, which could distort regression outcomes These tasks, while seemingly tedious, were essential in ensuring the reliability of the downstream models and the interpretability of the results.

2. Feature Engineering: Designing for Signal, Not Noise

With a cleaned dataset, the next challenge was identifying which features to keep and which to set aside. PM2.5 emerged as the most powerful predictor, with a correlation coefficient of 0.98 with overall AQI values. This made it both an asset and a liability. Its dominance raised concerns about overfitting, especially in regions where PM2.5 data may not always be available or accurate. To explore this, we trained parallel models — one with PM2.5 included and one without — allowing us to observe the trade-offs firsthand. We also used tools such as: - Correlation matrices to assess inter-feature relationships - Mutual information scores to rank predictive power of features - Recursive feature elimination to test how model performance changed with streamlined inputs This step revealed how much engineering goes into making data usable — not just feeding it in.

3. Model Development: Exploring Regression and Classification

Our predictive goals fell into two main categories: predicting continuous AQI values and classifying AQI into categories. This naturally led us to experiment with both regression and classification models. We began with linear regression, which performed remarkably well (R² = 0.974). Its interpretability helped us understand pollutant contributions clearly, but it also reinforced concerns about over-reliance on PM2.5. The model’s performance fell dramatically when this feature was removed. Next, we explored the Random Forest Classifier, which offered increased robustness, particularly in handling noisy data and outliers. The model produced 94% validation accuracy and helped visualize feature importance. However, it underperformed on less common classes like “Hazardous” or “Very Unhealthy,” which hinted at an imbalance in the dataset. To counteract overfitting and improve generalization, we used a Gradient Boosting Classifier (GBC) and intentionally removed PM2.5 from its training set. While this reduced the overall accuracy to 64.88%, the model became more adaptable and fair across diverse scenarios, especially where dominant features were unavailable.

4. Class Imbalance: A Hidden Obstacle

As our modeling matured, we discovered that poor performance on minority AQI categories wasn’t a model architecture issue — it was a data distribution issue. Only 0.81% of our dataset fell into the “Hazardous” category, making it nearly invisible to the model during training. To address this, we applied SMOTE (Synthetic Minority Oversampling Technique) to artificially balance the training data. After implementation, we saw an impressive jump in recall for rare classes. For example, the “Hazardous” recall improved from 4% to 38%. However, the improvement came with a cost: - Overall model accuracy declined from 64% to 56% - “Good” and “Moderate” categories were misclassified more frequently - The model began to over-predict rare classes in situations where they didn’t apply This reinforced an important truth: modeling isn’t just about finding patterns — it’s about understanding the ethics and consequences of misclassification.

5. Model Optimization: Tuning for Real-World Application

To refine our models further, we conducted hyperparameter tuning using GridSearchCV with five-fold cross-validation. For the GBC, we tested various combinations of max depth, minimum samples per split, and number of estimators. The result was a marginal improvement in validation metrics but an important reminder that tuning cannot overcome foundational data issues like class imbalance. Ultimately, the tuning process taught us how to balance model complexity with real-world usability. An overtuned model may perform better on validation sets but fail when introduced to new, diverse environments.

Conclusion: Building More Than a Model

This project wasn’t just a final exam — it was a case study in real-world data science. I walked away with practical experience in transforming raw data into insight, testing modeling assumptions, and evaluating trade-offs with clarity. More importantly, I now recognize that effective models are not only accurate but robust, interpretable, and adaptable. The air quality dataset served as the backdrop, but the technical journey — cleaning, engineering, modeling, evaluating, and tuning — was the real story. That’s the lesson I’ll carry forward into every future project.

From Raw Data to Robust Models: A Semester in Practical Data Science

Reid Dial

Fourth-year Finance Major with a Business Development minor and Data Analytics certificate.

1. Data Cleaning: From Disorder to Dataset

2. Feature Engineering: Designing for Signal, Not Noise

3. Model Development: Exploring Regression and Classification

4. Class Imbalance: A Hidden Obstacle

5. Model Optimization: Tuning for Real-World Application

Conclusion: Building More Than a Model

More articles by this author

Others also viewed

Meet the Data Scientists Delivering Smarter Results

Unraveling Big Data through an In-Depth Exploration of the Three V's

Understanding the Common Ground Between Linear and Logistic Regression in Data Science

Data Disciplines Unveiled: Navigating the Complex World of Data with a Touch of 90s Nostalgia

A CONVERSATION –D FOR DATA ; D FOR DRAMA THERE IS MORE DRAMA THAN DATA IN LIFE HOW WILL DATA SCIENCE WORK ?? - sudhanshu

The Evolutionary Journey of Robust Statistical Methods for data analysis (2/5) 🚀📈

Creativity in Data Science

5 Lessons Data Scientists Can Learn from Crowd Forecasting Research

Data Science: Making Sense of the Chaos

Understanding Causal Analysis

Explore topics

1. Data Cleaning: From Disorder to Dataset

2. Feature Engineering: Designing for Signal, Not Noise

3. Model Development: Exploring Regression and Classification

4. Class Imbalance: A Hidden Obstacle

5. Model Optimization: Tuning for Real-World Application

Conclusion: Building More Than a Model

Unlocking Language with AI: A Beginner’s Guide to NLP

Apr 30, 2025

A Cleaner Future: Predicting and Combating Global Air Pollution

Apr 2, 2025

Building a Predictive Model for Fraudulent Job Listings: Lessons from Data Analysis and Machine Learning

Feb 25, 2025

Performance & Playoff Impact: Analyzing Free Throws in the NBA (2006-2016)

Feb 5, 2025

Transforming Raw Data into Business Insights: Lessons from a Data Analytics Class

Dec 12, 2024

Unveiling the World of Billionaires: Insights from Data Analysis

Dec 12, 2024

Predictive Analytics in Action: Shaping Business Strategies in Real-Time

Dec 11, 2024

Harnessing Consumer Insights for Effective Resource Allocation

Nov 15, 2024

Leveraging Data Analytics for Retail Optimization: A Family-Owned Business Case Study

Oct 17, 2024

Decoding Data for Safer Tomorrow

Sep 24, 2024

Others also viewed

Meet the Data Scientists Delivering Smarter Results

Unraveling Big Data through an In-Depth Exploration of the Three V's

Understanding the Common Ground Between Linear and Logistic Regression in Data Science

Data Disciplines Unveiled: Navigating the Complex World of Data with a Touch of 90s Nostalgia

A CONVERSATION –D FOR DATA ; D FOR DRAMA THERE IS MORE DRAMA THAN DATA IN LIFE HOW WILL DATA SCIENCE WORK ?? - sudhanshu

The Evolutionary Journey of Robust Statistical Methods for data analysis (2/5) 🚀📈

Creativity in Data Science

5 Lessons Data Scientists Can Learn from Crowd Forecasting Research

Data Science: Making Sense of the Chaos

Understanding Causal Analysis

Explore topics