Exploring Good Old Fashioned AI (Part 5): Refining Supervised Classifiers for Better Performance
Credit: Image generated with ChatGPT (DALL·E) in the style of New Yorker Cartoons, inspired by a particular superhero inventor's lab

Exploring Good Old Fashioned AI (Part 5): Refining Supervised Classifiers for Better Performance

If you missed the other parts of this learning series about eCornell's "Designing and Building AI Solutions" Program, you can find all articles here in this newsletter.

Cornell University's eCornell "Designing and Building AI Solutions" certificate program (Instructed by Lutz Finger) led my journey from building supervised classifiers to deploying them under careful refinement to balance accuracy with generalizability. As a recap to my previous articles, I explored building supervised classifiers using logistic regression and decision trees through Cornell's "Designing and Building AI Solutions" program.

As I continue to explore the eCornell program's second module, "Exploring Good Old-Fashioned AI", I turn to one of most key learning around the critical task and systemic approach of refining AI models to improve their real-world performance while addressing bias and fairness concerns.

Over the past years, researchers and AI practitioners have paid close attention to how AI models influence outcomes in critical sectors such as healthcare, finance, criminal justice, and human resources. Supervised classifiers and neural network models like large language models (LLMs) have shown potential for both benefit and harm in these high-stakes applications. Most AI models trained on large datasets are especially susceptible to bias, whether through labeled examples in supervised learning or through representational imbalances in unsupervised approaches. Supervised classifiers, which learn from labeled examples, can inherit and perpetuate biases present in historical data. Large language models, trained on vast corpora of text, risk encoding and amplifying stereotypes, exclusionary norms, and representational harms through their outputs.

Understanding the Bias Variance Trade-off

Under Lutz Finger 's instruction in the eCornell program, I learned that machine learning models face a fundamental challenge: Finding the 'sweet spot' between model simplicity and complexity. Models that are too simple can exhibit high bias. They make overly broad assumptions that miss important patterns in the data. Conversely, overly complex models show high variance. They capture noise as if it were meaningful patterns.

The consequences of poor model fit extend far beyond academic dialogue. Historical examples demonstrate the real-world impact of this balance. The 2007-2008 financial crisis partly resulted from overfitted models that failed to account for nationwide housing price collapses. Healthcare predictive models that are too simplistic might fail to identify at-risk patients, potentially endangering lives. More real-world AI-related incidents can be found on the Organization for Economic Co-operation and Development (OECD)’s AI Incidents and Hazards Monitor.

Practical Implementation Experiences

The eCornell program exercises transforms these theoretical concepts into tangible skills through practices in systematic model refinement. Working with financial services datasets revealed fundamental challenges in machine learning deployment. My initial models showed promising training performance but struggled with generalization, demonstrating classic overfitting behavior. This hands-on experience taught me that high training accuracy often signals memorization rather than genuine learning.

Feature engineering and correlation analysis yielded crucial insights about model complexity. The program introduced a financial services-related dataset that includes economic and financial data as potential features for our training dataset. My analysis revealed severe multicollinearity issues among these economic features, with variance inflation factor (VIF) values indicating concerning levels of correlation. This discovery reinforced that adding features without careful evaluation can harm model performance. Through systematic experimentation, I learned to balance feature ‘richness’ with statistical independence.

Systematic refinement delivered substantial performance improvements while revealing surprising business insights. My decision tree classifier showed significant improvement through methodical optimization, as measured by AUC-ROC scores. Feature importance analysis revealed that contact-related variables emerged as dominant predictors, challenging initial assumptions about success factors. The refined model achieved better precision-recall balance, demonstrating practical trade-offs for business deployment. These metrics illustrated how proper model refinement can transform baseline classifiers into production-ready systems.

Model evaluation taught me the nuanced nature of performance metrics in real-world applications. The program exercises demonstrated how AUC-ROC provides a comprehensive view of classifier performance across different thresholds. I learned to interpret precision-recall trade-offs in business contexts, understanding when to prioritize minimizing false positives versus capturing all positive cases. Feature importance measurements revealed which variables truly drive predictions, enabling more informed decisions about data collection and model complexity.

Outliers can significantly impact model performance, but not all outliers are created equal. Through my analysis, I identified potential outliers in the dataset, particularly in features related to customer interactions where some instances showed extreme values far beyond typical ranges.

Our favorite box-plots and Interquartile Range (IQR) tools provide a systematic approach to outlier detection. By calculating the difference between the third and first quartiles, I could identify data points falling beyond reasonable bounds. However, the decision to remove outliers requires careful consideration of their business significance.

My experiments with outlier removal yielded interesting results. After removing extreme data points identified through IQR analysis, the logistic regression model showed minimal performance change, while the decision tree model remained relatively stable. This suggested that decision trees demonstrate greater robustness to outliers compared to linear models, an important consideration when selecting the appropriate AI model architecture to addressing business/societal objectives in real-world applications.

These exercises taught the fundamental lesson that model refinement requires balancing multiple competing objectives. Success means optimizing not just for accuracy, but for generalizability, interpretability, and business relevance. The program methodology of iterative improvement mirrors real-world data science workflows, where patient and systematic refinements consistently outperforms hasty algorithm switching.

Key Learning about AI Model Refinement Strategies

Improving underperforming models requires a systematic approach with multiple strategies. Through the eCornell program, I learned several techniques for addressing model performance issues.

  • Adding more data helps, but only to a certain extent. The concept of diminishing value returns applies strongly to machine learning. My own experiences from program exercises showed that beyond a certain threshold, additional data provided minimal improvement in model performance.
  • Feature engineering and selection can dramatically impact results. Working with datasets that were enhanced from their baseline version, I evaluated additional economic data in our financial services datasets. Multicollinearity analysis revealed strong correlations among these features, with Variance Inflation Factors (VIF) indicating severe feature redundancy.
  •  Model complexity must match data complexity. Through recursive feature elimination, I identified key features that balanced predictive power with model simplicity. The refined model showed improved performance metrics while maintaining better generalizability.

Recent Research about AI Bias and Fairness in Supervised Classifier Models

Today, academic research identifies multiple sources and types of bias in AI systems, each with distinct implications for fairness. Data bias arises from unrepresentative or incomplete training sets, often reflecting historical inequalities. Model (or algorithmic) bias emerges from AI model/system design choices that inadvertently prioritize certain features or groups. According to the highly cited 2020 paper “Fairness in Machine Learning: A Survey,” user interaction bias can further deepen disparities as AI systems create ‘runaway feedback loops’ in real-world deployment, where biased AI ‘decisions’ influence user behavior and perpetuate further biases in subsequent data collection and model training.

Researchers categorize bias into representational and allocational harms. Representational harms involve the misrepresentation or stereotyping of groups in model outputs, while allocational harms refer to the unequal distribution of resources or opportunities due to AI-driven decisions. For example, the more recent neural network architectures like large language models (LLMs) have been shown to generate more negative sentiment toward women and underrepresent racial-specific language, leading to both subtle and overt forms of discrimination. Machine learning-related bias differs fundamentally from human bias, yet both can have serious consequences. While machine learning bias results from oversimplification of learned patterns, human bias stems from cognitive, cultural, and social factors. Both types can lead to discriminatory and potentially degraded or harmful outcomes from deployed AI systems.

A wave of systematic surveys and empirical studies have shaped the field by cataloging sources of bias, fairness definitions, and mitigation strategies, as well as by benchmarking the trade-offs between fairness and performance metrics. The paper “Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey” (2023), which analyzed 341 publications from 2020 to 2025, categorizing bias mitigation methods into pre-processing, in-processing, and postprocessing approaches. This research is notable for breadth, covering technical methods, datasets, metrics, and benchmarking practices. It highlights that 1) pre-processing methods (such as relabeling, reweighting, and sampling) are widely used to address bias in training data, while 2) in-processing methods (like adversarial debiasing and fairness constraints) and 3) post-processing methods (such as threshold adjustment and calibration) are increasingly adopted to ensure fairness during and after model training.

In the eCornell “Designing and Building AI Solutions" program, my own experiences with systemic model refinement reinforced insights about intricate relationships between fairness and performance metrics. A comprehensive empirical study, "A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers" (2023), found that while bias mitigation methods can improve fairness in 24 to 59% of scenarios, they also decrease machine learning performance metrics in 42% to 66% of cases. However, I observed that changes in metrics from bias and fairness mitigation can sometimes reflect appropriate corrections to the ‘seemingly positive-looking metrics’. For example, the models may have overfitted on biased data and have degraded generalizability to real-world situations to make effective predictions/inferences. Other metric changes might result from removing features that previously contributed to statistical significance or inflated R² values. In such cases, Adjusted R² would be a more appropriate metric since it penalizes models for including additional features. More importantly, the research shows that no single method consistently outperformed others across all tasks, highlighting the need for strategically relevant and context-specific selection of mitigation strategies.

Foundational Academic Frameworks

Even with the growth of new research, several foundational papers from the past decade continue to be highly-cited and have shaped the fairness landscape in supervised learning.

Moritz Hardt, Eric Price, and Nati Srebro’s “Equality of Opportunity in Supervised Learning” (NeurIPS 2016) remains a touchstone for defining and operationalizing fairness metrics such as equalized odds and equal opportunity. These metrics have become standard in academic research and industries, particularly in evaluating group fairness and ensuring that models do not systematically disadvantage protected groups in terms of false positive or false negative rates.

Zemel et al. (2013) introduces the highly-cited Learning Fair Representations (LFR) approach, which is extended in subsequent works. LFR seeks to learn latent data representations that are both predictive and fair, though it can complicate model interpretability. 

Adversarial debiasing, as described in works by Zhang et al. (2018) and further developed over years, has also become a mainstay. Adversarial debiasing leveraging adversarial networks or models (using either simple or complex model architectures) to remove information about protected attributes from learned representations/patterns.

Causal Fairness and Explainability: Recent years have seen a surge in the use of causal models, such as Structural Causal Models (SCMs) and Causal Bayesian Networks, to diagnose and mitigate bias in supervised classifiers. These approaches allow nuanced interventions, such as modifying graph relationships (e.g., associations between variables/features) or adjusting conditional probability distributions to reduce bias while preserving causal relationships. This meant the ‘surgical’ treatment of specific causal relationships. This is important as the integration of causal understanding with fairness metrics ensures that bias and fairness mitigation strategies do not inadvertently harm the AI model’s generalizability or performance. This can be ideal for high-stakes applications in business and societal context, given high causal and interpretability requirements.

Real-World Business and Societal Case Studies

The abstract concepts of bias and fairness I encountered in the Cornell program take on stark reality when examining recent corporate and societal case studies. These cases demonstrate how model refinement decisions directly impact people's lives when deployed at scale. Examples such a feature selection, proxy variables, and performance optimization all carry real-world consequences.

Healthcare applications reveal how seemingly neutral optimization choices can perpetuate life-threatening disparities. The Obermeyer et al. (2019) study of commercial risk prediction models used by major U.S. hospital systems showed systematic underestimation of Black patients' health needs by using historical healthcare spend as a proxy for medical need. This mirrors my Cornell experience with feature selection, where I learned that not all predictive variables should be included simply because they improve model performance. When the healthcare model was recalibrated to use chronic health condition counts instead of cost, the proportion of Black patients identified for extra healthcare rose dramatically, demonstrating how thoughtful feature engineering can correct structural inequities.

Medical imaging classifiers demonstrate how training data composition directly affects diagnostic accuracy across populations. Supervised machine learning classifiers for dermatological diagnosis exhibit marked performance disparities across skin tones when training data lacks diversity. Studies show these supervised models achieve significantly lower diagnostic performance on dark skin compared to light skin, raising concerns about perpetuating existing healthcare inequities where Black patients face higher melanoma mortality rates. Even medical devices like pulse oximeters, equipped with light-sensing technologies that measure oxygen saturation, exhibit measurement bias, systematically overestimating oxygen saturation in patients with darker skin tones, creating significant disparities in detecting dangerous oxygen levels during the COVID-19 pandemic.

Financial services cases illustrate how historical training data can embed and amplify existing biases. A widely publicized credit card controversy revealed how a major financial institution's model/algorithm offered women significantly lower credit limits despite higher credit scores, exemplifying the representational bias challenges discussed in current research. Similarly, documented cases at large banking institutions have shown algorithms assigning higher risk scores to Black and Latino applicants with similar financial backgrounds to their white counterparts. These cases reinforce lessons from the eCornell program exercises, where I had to carefully consider whether historical patterns in the data reflected genuine predictive relationships or embedded biases.

Criminal justice applications demonstrate the feedback loop problems that current research identifies as particularly dangerous. The COMPAS recidivism prediction tool falsely labels Black defendants as high risk at nearly twice the rate of white defendants, while predictive policing systems like PredPol create self-reinforcing cycles by directing more police resources to already over-policed neighborhoods. These examples validate the "runaway feedback loops" described in fairness research and connect directly to my program learning about the importance of validating model performance across different populations.

Human resources automation reveals how proxy discrimination can occur even without explicit protected attributes. A high-profile case involving a major technology company's recruiting tool, trained on predominantly male applicant data, systematically penalized resumes containing terms associated with women and downgraded graduates of women's colleges. This documented example perfectly illustrates the multicollinearity and feature correlation challenges I encountered in the Cornell program, where seemingly independent variables can encode protected characteristics in subtle ways.

Note: This is not a fully exhaustive list. As I mentioned earlier, you can find more documented incident on the Organization for Economic Co-operation and Development (OECD)’s AI Incidents and Hazards Monitor.

These examples of real-world case studies reveal consistent challenges that reinforce my eCornell program learning about systematic model refinement and validation. Feedback loops, proxy variables/features, and model opacity emerge as recurring themes across healthcare, finance, criminal justice, and human resources. The regulatory responses, including the EU AI Act, U.S. legislative initiatives, and local bias audit requirements, demonstrate growing recognition that the technical skills I learned must be paired with systematic fairness evaluation.

Operationalizing Fairness: Metrics and Tools for Practical Implementation

The academic frameworks and examples of real-world case studies highlight the need for practical tools that translate fairness concepts into measurable implementation steps. My eCornell program experiences prepared me to understand how fairness metrics create similar tensions or trade-off decisions between competing objectives in responsible AI development.

Fairness metrics evolved to capture the complexities revealed in these case studies, with the emergence of group fairness and individual fairness approaches. Group fairness seeks parity in outcomes across protected groups, using measures such as demographic parity and equalized odds. Individual fairness, by contrast, expects that similar individuals receive similar treatment, a principle that is challenging to operationalize at scale but directly relevant to the feature correlation challenges I encountered in my telemarketing analysis. Machine learning model bias differs fundamentally from human bias, yet both can have serious consequences. While model bias results from oversimplified model assumptions, human bias stems from cognitive, cultural, and social factors. Both types can lead to discriminatory outcomes in deployed systems.

The development and widespread adoption of fairness toolkits is instrumental in standardizing evaluation and mitigation approaches. Tools like IBM 's AI Fairness 360 (AIF360) and Microsoft Research-originated Fairlearn provide implementations of fairness metrics and mitigation methods, enabling both research and practical deployments across industries. These toolkits offer the systematic approach to fairness evaluation that mirrors the methodical model refinement techniques I learned through Cornell. They provide implementations of a wide range of fairness metrics and mitigation models, enabling research and practical deployments across industries.

The integration of these tools with traditional model development workflows represents the evolution of responsible AI practice. Just as I learned to systematically evaluate model performance through multiple metrics in the Cornell program, modern AI development requires systematic evaluation of fairness metrics alongside traditional performance measures. This parallel evaluation ensures that the optimization techniques I mastered can be applied to create systems that deliver both technical excellence and social responsibility.

Extension to Large Language Models (LLMs) and Contemporary Applications

The bias and fairness challenges identified in supervised classifiers extend critically to large language models (LLMs), which have become ubiquitous in AI applications since their inception. Research demonstrates that LLMs exhibit similar trade-offs between fairness and performance metrics, with studies showing that LLMs generate more negative sentiment toward women and underrepresent racial-specific language, manifesting both representational and allocational harms. The complexity of these challenges increases with LLMs due to their generative nature and broader application scope.

To illustrate how organizations are addressing these challenges in practice, I examine Singapore's approach, which exemplifies pragmatism in governance and culture, emphasizing practical and economical outcomes over theoretical ideals. Singapore's Government Technology Agency (GovTech) recently addressed these concerns through their "Measuring What Matters" framework for evaluating safety risks in real-world LLM applications (2025) and their recently launched Responsible AI Benchmark. The Singapore GovTech's Responsible AI Benchmark systematically evaluates models like GPT-5 and Claude across fairness, safety, and robustness dimensions. Their findings reveal that while newer models like GPT-5 show improvements in robustness, they still lag behind competitors like Claude 4 Opus in fairness metrics, underscoring that even state-of-the-art LLMs face ongoing challenges in balancing performance with equitable outcomes. This governmental recognition and systematic evaluation approach exemplifies how organizations are operationalizing fairness assessment for LLMs, moving beyond theoretical frameworks to practical deployment considerations that directly impact millions of users in public services.

Article content
Figure: Comparative Safety Performance of Leading LLMs on Singapore GovTech's Responsible AI Benchmark (2025). Scores represent the percentage of safe responses to adversarial prompts across proprietary and open-weight/open-source models.

Key Takeaways for AI Practitioners

  • Model refinement requires navigating inherent trade-offs: My eCornell program exercise experiences from balancing bias-variance tradeoffs and fairness-performance tensions documented in my research examples confirm that perfect optimization of AI performance metrics is unattainable. AI system designers and managers must explicitly define strategically acceptable and practical thresholds for accuracy, fairness, and interpretability based on their specific context, as Singapore GovTech demonstrated with their pragmatic safety benchmarks.
  • Systematic bias evaluation must parallel performance optimization: The real-world case studies from healthcare to criminal justice show that technical excellence without fairness assessment creates harmful systems. Integrating tools like AI Fairness 360 or conducting regular bias audits throughout refinement cycles prevents the embedding of discriminatory patterns that become extremely costly or to fix post-deployment and when irreversible incident arise.
  • Feature selection carries ethical weight beyond statistical significance to drive real-world effectiveness. The eCornell program exercise involving financial services data and the healthcare proxy variable cases demonstrate that choosing predictive features shapes real-world outcomes. Each feature decision should be evaluated not just for its VIF values or contribution to R² metric, but for whether it perpetuates historical biases or creates new forms of proxy discrimination.
  • Documentation and transparency build trust and enable iteration: Creating audit trails or transcripts of refinement decisions during my program exercises proves invaluable when models face scrutiny or require updates. This practice aligns with emerging regulatory requirements like the EU AI Act and helps organizations demonstrate responsible development practices.
  • Context determines appropriate complexity: While LLMs dominate headlines, my Cornell experience confirmed that simpler, refined AI models can outperform complex models for specific problems/tasks. In one of the exercises, Through careful and systematic refinements, I achieved significant improvements in model performance (e.g., AUC-ROC) by addressing fundamentals of the problem and business objectives.

Conclusion

Model refinement bridges theoretical AI knowledge and practical deployment success. Through Cornell University's eCornell "Designing and Building AI Solutions" program, I discovered that the journey from initial baseline models to optimized systems represents more than statistical improvement. This journey exemplifies how systematic refinement transforms experimental models into production ready systems that balance performance, fairness, and business value.

Although this article focuses on bias and fairness management of supervised classifiers, the program’s second module - Exploring Good Old-Fashioned AI – had crystallized five fundamental principles that guide effective model refinement during our exercises. Measuring bias-variance tradeoffs determines whether models truly learn or simply memorize training data. Outliers demand careful consideration as potential edge cases rather than automatic removal. Feature engineering proves more valuable as the immediate measure than model complexity for driving performance. The reality of diminishing returns from data volume increments makes efficiency and interpretability as important as accuracy. Most critically, responsible AI implementation requires balanced datasets and embedding fairness considerations from the start rather than retrofitting them later.

Classical techniques converge with contemporary challenges as AI deployment scales across organizations. My exercises with these principles directly parallel the challenges Singapore GovTech navigates in their LLM safety benchmarks. Whether practitioners refine decision trees or evaluate GPT-5, they face the same fundamental question: how do we optimize for technical excellence while ensuring equitable outcomes?

Real world applications demonstrate that refinement encompasses ethical responsibility beyond algorithmic tuning. The healthcare proxy variable disasters and criminal justice feedback loops I studied show that each refinement decision carries societal implications. These cases validate why organizations increasingly adopt frameworks like AI Fairness 360 and implement systematic bias audits. They recognize that post deployment corrections prove far costlier than embedded safeguards.

The evolution from GOFAI to generative AI reinforces these foundational refinement principles rather than replacing them. While LLMs dominate current discourse, my Cornell experience confirms that methodical optimization of simpler models often yields superior results for specific business problems. The ability to systematically improve performance while maintaining interpretability and fairness distinguishes mature AI practice from trend chasing experimentation.

Cornell University's eCornell "Designing and Building AI Solutions" program develops this essential balance between innovation and responsibility. Prof. Lutz Finger's instruction and structured approach transforms abstract concepts into practical skills. The program prepares practitioners to navigate the complex landscape where technical metrics, business objectives, and ethical considerations converge.

Begin your journey toward AI fluency today: https://ecornell.cornell.edu/certificates/technology/designing-and-building-ai-solutions/

What challenges have you encountered when refining machine learning models in your organization?

How do you balance performance improvements with maintaining model interpretability and fairness? I would love to hear your experiences in the comments below.

Coming Next: Join me exploring unsupervised learning and clustering techniques, discovering how AI finds hidden patterns without labeled data. I shall examine how these methods complement supervised approaches and enable new forms of business insight.

Latest Updates: To demonstrate these implementation principles in practice and following my previous articles, I have recently integrated the best-performing XGBoost Decision Tree model into my TalentSol applicant tracking system (ATS) application. The TalentSol ATS application architecture now features a comprehensive four-level ML data pipeline that transforms raw application data through Apache Airflow DAGs. This architecture enables TalentSol to learn from recruiter feedback and hiring outcomes through automated model retraining. I built the integration using a unified data schema and PostgreSQL database that seamlessly connects the React/TypeScript frontend with the Node.js backend and Python ML services. The XGBoost model achieves 70% recall and 57% precision for initial candidate screening, processing applications quickly for real-time recruiter workflows. I have open-sourced the complete implementation for community collaboration: the main TalentSol application (https://github.com/youshen-lim/TalentSol---Applicant-Tracking-System-Application) and the supervised classifier for initial candidate screening (https://github.com/youshen-lim/TalentSol_Supervised-Classifier-for-Initial-Candidate-Screening-Decision-Trees). This production deployment validates Cornell's teaching from the that effective AI solutions prioritize appropriate AI technique selection and business objective alignment over AI model complexity.


If you missed other parts of this learning series about eCornell "Designing and Building AI Solutions" Program, find all articles here: https://www.linkedin.com/newsletters/core-ai-7332966495557230592/

Program Overview: Looking Backwards, Accelerating Forward https://www.linkedin.com/pulse/looking-backwards-accelerating-forward-what-i-learned-youshen-lim-e0kpc/

"Creating Business Value with AI" Series:

"Exploring Good Old Fashioned AI" Series:


Aaron (Youshen) Lim is documenting his learning journey through Cornell University "Designing and Building AI Solutions" certificate program. Follow the CoreAI newsletter for more insights into practical AI implementation.

To view or add a comment, sign in

Others also viewed

Explore topics