FeatureSelection
February 15, 2020
1 Feature Selection
A discussion that often comes up doing applied Machine Learning work is whether and how to
perform do feature selection. In this post, I will consider two standard justifications offered for
doing so and evaluate whether they make sense. In many ways, this discussion centers on one of
the core tradeoffs in Supervised Learning: does increasing predictive accuracyccomes at the expense
of reducing interpretability?
1.1 Improve model accuracy?
The typical way bias-variance tradeoff is introduced in textbooks and courses is in the context of
linear regression. The story goes as follows: you can reduce in-sample error to a point as arbitrarily
low as possible by increasing the number of parameters in your model. However, when you try
to use the same model to predict out of sample your accuracy is going to be much lower. This
is because the extra parameters get tuned to the in-sample noise and when you get data that
doesn’t contain the same noise they don’t work so well. The suggested remedy is to regularize your
model using ridge, lasso or a combination of the two called elastic net. Regularization proceeds
by shrinking the coefficients of certain variables to very small values(ridge and elastic net) or zero
(lasso) by imposing a constraint on how big the L2 (squared sum) or L1 (absolute sum) of the
coefficients can get.
But is this true in the case of non-parametric models like Random Forests as well? Although I
wasn’t able to find any formal work that addresses this specific question (suggestions welcome), it’s
possible that it doesn’t. One reason might be that during the fitting of each tree a certain number
of variables are dropped. The overfitting problem in the context of Random Forests comes from
growing a tree that is too deep or requiring too few samples fall into each leaf of the tree. This
can be dealt with by ensembling together many trees so that the variance of the overall estimator
is smaller than that for any individual estimator.
I trained a Random Forest classifier following the standard Machine Learning workflow on a banking
dataset from a Portugese bank that analyzed the effect of telemarketing campaigns on whether
contacted customers subcribed to the product being marketed. I tried the model with the following
four variations:
• a base model without variable selection
• using Variance Inflation factor for variable selection
• using Hierarchical Clustering for variable selection
1
• using a mix of 2 and 3, where numerical variables were selected by VIF and categorical
variables by 3.
Four metrics obtained from these models are presented below.
1.2 What is Collinearity?
• In the most basic sense a variable is considered collinear if it can be written as a linear
combination of other variables. In Linear Regression world this becomes a problem because
it blows up your standard errors as it is not possible to attribute variation in the output
variable to the collinear variables based on the given data alone. In some ways this is a
problem with the dataset and people worrying about it are confusing the property of the
dataset with the properties of the model. For a more comprehensive discussion consider
reading this and this.
• Variance Inflation Factor is a metric that allows you to quantify how much of the variation
in one variable is explained by other covariates. This can be obtained by regressing each
variable on the complement set and getting R-squared for each. VIF is defined as 1
1−R2
i
Typical Feature Selection routines using VIF use some threshold of VIF to drop variables.
• Hierarchical Clustering using Spearman’s Rank Correlation allows us to learn about depen-
dencies between not just numerical features but also categorical features and also allows us
to model non-linear dependencies. A typical routine using Hierarchical Clustering first fits a
model, gets variable importances, gets Hierarchical cluster memberships and decides to drop
the least important members of each cluster.
[15]: Text(0.5,0.98,'Comparison of various feature selection methods on classification
metrics')
2
From looking at the results above it doesn’t appear as if variable selection improves model perfor-
mance. Infact it seems as if having more variables results in better performance. Of course this is
just one dataset and a more comprehensive assessment would repeat the same process over several
datasets.
1.3 Improve Interpretability?
Unless you’re building a system where accuracy is all that matters, you don’t care about accuracy
alone. Models need to be interpretable. Interpretability means different things to different people
and several different use cases are commonly lumped together. These might be as follows:
• the end user should be able to understand how the model arrived at a prediction
• the end user should be able to trust that the model is giving the right amoung of importance
to the right variables in arriving at a prediction
• the modeler should be able to debug the model if it starts making predictions that don’t seem
correct
• the end user should be able to derive recommendations for actions from the model.
Having fewer variables in a model helps on all four counts but it doesn’t completely address all
these issues.
3
• Business recommendations could be derived from such a model by stratifying the population
based on features the model considers important for prediction and applying business rules
for taking the action relevant for optimizing the business metric under consideration. If these
variables are non-overlapping, it is probably easier to apply this procedure.
• Having fewer and uncorrelated variables doesn’t shed any insight into the mechanism for
arriving at the prediction.
• Having fewer and uncorrelated variables changes how Random Forest default importance
measure. Below are the default variable importance measures from the model:
[8]: Text(0.5,0.98,'Variable Importance Comparison')
The base model considers duration of the call to the most important measure but duration is
4
not known before a call is performed and moreover after the call the outcome is already known.
Including this variable in the model is an example of data leakage. Below are feature importances
for all four cases obtained after removing ‘duration’ from the dataset.
[11]: Text(0.5,0.98,'Variable Importance Comparison')
The three models that implement feature selection methods consider the categorical feature ‘loan’,
indicating whether the person has a personal loan or not, to be the most important feature followed
by the person’s marital status and the type of communicaation method used while the base model
considers the balance in their account to be the most important variable followed by their age and
the date on which they were contacted. This is confusing. Two different people using different
variable importance measures and interpreting them as actionable insights might end up taking
completely different actions. So which one should we trust?
5
In this in-depth study of default variable importances in the case of RandomForests it was found
that default variable importances can be biased, especially for features that vary in their scale
of measurement or the number of their categories. Instead they recommend using a different
importance measure using permutation feature importance. This procedure involves permuting a
feature’s values randomly and seeing how much drop-off there is in predictive accuracy. For a more
comprehensive discussion please read the article.
Feature importances only tell you about variables a model considered important. They don’t tell
you the magnitude of dependence of the output on that feature and not even the direction. In
order to obtain these kinds of partial dependences you might want to look into the interpretability
literature and consider methods such as Partial Dependence Plots, LIME and Shap values. In order
to derive a recommendation from this model you might want to think about the kind of actions you
want to take and what kind of effects they might have on the outcome but this requires estimating
the counterfactual, i.e. making predictions under intervention and that is a totally different analysis
altogether.
1.4 In conclusion:
1) Feature selection methods may not give you a lift in accuracy.
2) They reduce the number of features and decorrelate them but they don’t help you interpret
the model in any useful way for making actionable business recommendations.
3) The main reason to interpret models is to make causal inferences. Standard remedies for
measuring collinearity won’t help you do that. If all you care about is prediction, you can
just use regularization. If you want to make causal inferences about effects of a variable, while
it is useful to notice you don’t have much variation in that variable conditional on another
variable, you should still condition on it.
All the code that goes with this post is available on this github repository.
6

More Related Content

DOCX
Sensitivity analysis
PPTX
Sensitivity analysis
PDF
Sensitivity analysis
DOCX
All you want to know about sensitivity analysis
PPTX
Sensitivity Analysis
DOCX
Measurment and scaling
PPTX
Introduction to principal component analysis (pca)
PDF
Evaluation measures for models assessment over imbalanced data sets
Sensitivity analysis
Sensitivity analysis
Sensitivity analysis
All you want to know about sensitivity analysis
Sensitivity Analysis
Measurment and scaling
Introduction to principal component analysis (pca)
Evaluation measures for models assessment over imbalanced data sets

What's hot (16)

PPTX
Discriminant analysis
PPTX
Confirmatory Factor Analysis Presented by Mahfoudh Mgammal
DOC
Ash bus 308 week 2 problem set new
PPTX
Factor analysis
DOC
Ash bus 308 week 2 problem set new
DOC
Ash bus 308 week 2 problem set new
PPTX
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
PPTX
Factor analysis
PPT
Stat11t alq chapter03
PPTX
Sensitivity analysis
PDF
Lecture 4: NBERMetrics
PPT
Sensitivity Analysis
PPTX
Discriminant analysis
PDF
Factor analysis using spss 2005
PDF
An overview of fixed effects assumptions for meta analysis - Pubrica
Discriminant analysis
Confirmatory Factor Analysis Presented by Mahfoudh Mgammal
Ash bus 308 week 2 problem set new
Factor analysis
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
What if Analysis,Goal Seek Analysis,Sensitivity Analysis,Optimization Analysi...
Factor analysis
Stat11t alq chapter03
Sensitivity analysis
Lecture 4: NBERMetrics
Sensitivity Analysis
Discriminant analysis
Factor analysis using spss 2005
An overview of fixed effects assumptions for meta analysis - Pubrica
Ad

Similar to Feature selection (20)

PPTX
Evaluation measures Data Science Course.pptx
PDF
Regression analysis made easy
PDF
Campaign response modeling
PPTX
Discriminant Analysis.pptx
PPTX
Pharmacokinetic pharmacodynamic modeling
PDF
1. F A Using S P S S1 (Saq.Sav) Q Ti A
PDF
Factor analysis using SPSS
PDF
Analysis in Action 21 September 2021
PDF
copy for Gary Chin.
PPTX
Model Development And Evaluation in ML.pptx
PDF
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
PPTX
Store segmentation progresso
PDF
Data Science Interview Questions PDF By ScholarHat
PDF
Construction of composite index: process & methods
DOCX
A researcher in attempting to run a regression model noticed a neg.docx
DOCX
1. Click here to retrieve the Risk Management Template. Working wi.docx
PPTX
ML2_ML (1) concepts explained in details.pptx
PPTX
Data Mining Functionalities and data mining
PPTX
Intro to ml_2021
Evaluation measures Data Science Course.pptx
Regression analysis made easy
Campaign response modeling
Discriminant Analysis.pptx
Pharmacokinetic pharmacodynamic modeling
1. F A Using S P S S1 (Saq.Sav) Q Ti A
Factor analysis using SPSS
Analysis in Action 21 September 2021
copy for Gary Chin.
Model Development And Evaluation in ML.pptx
_Whitepaper-Ultimate-Guide-to-ML-Model-Performance_Fiddler.pdf
Store segmentation progresso
Data Science Interview Questions PDF By ScholarHat
Construction of composite index: process & methods
A researcher in attempting to run a regression model noticed a neg.docx
1. Click here to retrieve the Risk Management Template. Working wi.docx
ML2_ML (1) concepts explained in details.pptx
Data Mining Functionalities and data mining
Intro to ml_2021
Ad

Recently uploaded (20)

PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPT
Predictive modeling basics in data cleaning process
PDF
Introduction to the R Programming Language
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
DOCX
Factor Analysis Word Document Presentation
PPT
statistic analysis for study - data collection
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Transcultural that can help you someday.
PDF
Global Data and Analytics Market Outlook Report
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
DU, AIS, Big Data and Data Analytics.ppt
Predictive modeling basics in data cleaning process
Introduction to the R Programming Language
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Factor Analysis Word Document Presentation
statistic analysis for study - data collection
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
SAP 2 completion done . PRESENTATION.pptx
Introduction to Data Science and Data Analysis
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Navigating the Thai Supplements Landscape.pdf
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Transcultural that can help you someday.
Global Data and Analytics Market Outlook Report
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305

Feature selection

  • 1. FeatureSelection February 15, 2020 1 Feature Selection A discussion that often comes up doing applied Machine Learning work is whether and how to perform do feature selection. In this post, I will consider two standard justifications offered for doing so and evaluate whether they make sense. In many ways, this discussion centers on one of the core tradeoffs in Supervised Learning: does increasing predictive accuracyccomes at the expense of reducing interpretability? 1.1 Improve model accuracy? The typical way bias-variance tradeoff is introduced in textbooks and courses is in the context of linear regression. The story goes as follows: you can reduce in-sample error to a point as arbitrarily low as possible by increasing the number of parameters in your model. However, when you try to use the same model to predict out of sample your accuracy is going to be much lower. This is because the extra parameters get tuned to the in-sample noise and when you get data that doesn’t contain the same noise they don’t work so well. The suggested remedy is to regularize your model using ridge, lasso or a combination of the two called elastic net. Regularization proceeds by shrinking the coefficients of certain variables to very small values(ridge and elastic net) or zero (lasso) by imposing a constraint on how big the L2 (squared sum) or L1 (absolute sum) of the coefficients can get. But is this true in the case of non-parametric models like Random Forests as well? Although I wasn’t able to find any formal work that addresses this specific question (suggestions welcome), it’s possible that it doesn’t. One reason might be that during the fitting of each tree a certain number of variables are dropped. The overfitting problem in the context of Random Forests comes from growing a tree that is too deep or requiring too few samples fall into each leaf of the tree. This can be dealt with by ensembling together many trees so that the variance of the overall estimator is smaller than that for any individual estimator. I trained a Random Forest classifier following the standard Machine Learning workflow on a banking dataset from a Portugese bank that analyzed the effect of telemarketing campaigns on whether contacted customers subcribed to the product being marketed. I tried the model with the following four variations: • a base model without variable selection • using Variance Inflation factor for variable selection • using Hierarchical Clustering for variable selection 1
  • 2. • using a mix of 2 and 3, where numerical variables were selected by VIF and categorical variables by 3. Four metrics obtained from these models are presented below. 1.2 What is Collinearity? • In the most basic sense a variable is considered collinear if it can be written as a linear combination of other variables. In Linear Regression world this becomes a problem because it blows up your standard errors as it is not possible to attribute variation in the output variable to the collinear variables based on the given data alone. In some ways this is a problem with the dataset and people worrying about it are confusing the property of the dataset with the properties of the model. For a more comprehensive discussion consider reading this and this. • Variance Inflation Factor is a metric that allows you to quantify how much of the variation in one variable is explained by other covariates. This can be obtained by regressing each variable on the complement set and getting R-squared for each. VIF is defined as 1 1−R2 i Typical Feature Selection routines using VIF use some threshold of VIF to drop variables. • Hierarchical Clustering using Spearman’s Rank Correlation allows us to learn about depen- dencies between not just numerical features but also categorical features and also allows us to model non-linear dependencies. A typical routine using Hierarchical Clustering first fits a model, gets variable importances, gets Hierarchical cluster memberships and decides to drop the least important members of each cluster. [15]: Text(0.5,0.98,'Comparison of various feature selection methods on classification metrics') 2
  • 3. From looking at the results above it doesn’t appear as if variable selection improves model perfor- mance. Infact it seems as if having more variables results in better performance. Of course this is just one dataset and a more comprehensive assessment would repeat the same process over several datasets. 1.3 Improve Interpretability? Unless you’re building a system where accuracy is all that matters, you don’t care about accuracy alone. Models need to be interpretable. Interpretability means different things to different people and several different use cases are commonly lumped together. These might be as follows: • the end user should be able to understand how the model arrived at a prediction • the end user should be able to trust that the model is giving the right amoung of importance to the right variables in arriving at a prediction • the modeler should be able to debug the model if it starts making predictions that don’t seem correct • the end user should be able to derive recommendations for actions from the model. Having fewer variables in a model helps on all four counts but it doesn’t completely address all these issues. 3
  • 4. • Business recommendations could be derived from such a model by stratifying the population based on features the model considers important for prediction and applying business rules for taking the action relevant for optimizing the business metric under consideration. If these variables are non-overlapping, it is probably easier to apply this procedure. • Having fewer and uncorrelated variables doesn’t shed any insight into the mechanism for arriving at the prediction. • Having fewer and uncorrelated variables changes how Random Forest default importance measure. Below are the default variable importance measures from the model: [8]: Text(0.5,0.98,'Variable Importance Comparison') The base model considers duration of the call to the most important measure but duration is 4
  • 5. not known before a call is performed and moreover after the call the outcome is already known. Including this variable in the model is an example of data leakage. Below are feature importances for all four cases obtained after removing ‘duration’ from the dataset. [11]: Text(0.5,0.98,'Variable Importance Comparison') The three models that implement feature selection methods consider the categorical feature ‘loan’, indicating whether the person has a personal loan or not, to be the most important feature followed by the person’s marital status and the type of communicaation method used while the base model considers the balance in their account to be the most important variable followed by their age and the date on which they were contacted. This is confusing. Two different people using different variable importance measures and interpreting them as actionable insights might end up taking completely different actions. So which one should we trust? 5
  • 6. In this in-depth study of default variable importances in the case of RandomForests it was found that default variable importances can be biased, especially for features that vary in their scale of measurement or the number of their categories. Instead they recommend using a different importance measure using permutation feature importance. This procedure involves permuting a feature’s values randomly and seeing how much drop-off there is in predictive accuracy. For a more comprehensive discussion please read the article. Feature importances only tell you about variables a model considered important. They don’t tell you the magnitude of dependence of the output on that feature and not even the direction. In order to obtain these kinds of partial dependences you might want to look into the interpretability literature and consider methods such as Partial Dependence Plots, LIME and Shap values. In order to derive a recommendation from this model you might want to think about the kind of actions you want to take and what kind of effects they might have on the outcome but this requires estimating the counterfactual, i.e. making predictions under intervention and that is a totally different analysis altogether. 1.4 In conclusion: 1) Feature selection methods may not give you a lift in accuracy. 2) They reduce the number of features and decorrelate them but they don’t help you interpret the model in any useful way for making actionable business recommendations. 3) The main reason to interpret models is to make causal inferences. Standard remedies for measuring collinearity won’t help you do that. If all you care about is prediction, you can just use regularization. If you want to make causal inferences about effects of a variable, while it is useful to notice you don’t have much variation in that variable conditional on another variable, you should still condition on it. All the code that goes with this post is available on this github repository. 6