SlideShare a Scribd company logo
PyData London 2018 talk on feature selection
PyData London
28th April, 2018
Thomas Huijskens
Senior Data Scientist
How to get better
performance with less
data
3All content copyright © 2017 QuantumBlack, a McKinsey company
Feature collinearity and scarceness of data means we can't just give a model many features and let it decide
which ones are useful and which ones are not.
There are multiple reasons to do feature selection when developing machine learning models:
• Computational burden: Limiting the number of features may reduce the computational burden of processing the data in
the learning algorithm.
• Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that
are presumably redundant.
• Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the
seasoned practitioner as well as any business stakeholders.
It pays off to do feature selection as part of the model development process
4All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms should:
• remove variables that contain redundant information about the target variable; and
• reduce the overlap in information between the variables in the subset of selected features.
A good feature selection algorithm also shouldn't look at variables purely in isolation:
• Two variables that are useless by themselves can be useful together.
• Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity.
What are the components of a good feature selection algorithm?
5All content copyright © 2017 QuantumBlack, a McKinsey company
Two variables that are useless by themselves can be useful together
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
6All content copyright © 2017 QuantumBlack, a McKinsey company
Very high variable correlation (or anti-correlation) does not mean absence
of variable complementarity
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
7All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods Filter methods Embedded methods
8All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Filter methods Embedded methods
9All content copyright © 2017 QuantumBlack, a McKinsey company
Mlxtend is an open-source Python package that implements multiple
wrapper methods
10All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Embedded methods
11All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – mutual information
The mutual information quantifies the amount of information obtained about one random variable, through another
random variable. For two variables ! and ", the mutual information is given by
# !; " = &
'
&
(
) *, , log
)(*, ,)
) * )(,)
2* 2,.
It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
12All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
In the feature selection problem, we would like to maximise the mutual information between the selected variables
!" and the target #.
$% = arg max
"
, !"; # , /. 1. % = 2,
where 2 is the number of features we want to select.
This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
13All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
A popular heuristic in the literature is to use a greedy forward selection method, where features are selected
incrementally, one feature at a time.
Let !" #$
= &'(
, … , &'+ ,(
, be the set of selected features at time step - − 1. The greedy method selects the next
feature 0" such that
0" = arg max
6 ∉8+,(
9 :8+,( ⋃ 6; = .
14All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
One can show (proof omitted here) that this is equivalent to the following
!" = arg max
) ∉+,-.
/ 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) .
However, the quantities involving 6"78
quickly become intractable computationally because they are (: − 1)-
dimensional integrals!
15All content copyright © 2017 QuantumBlack, a McKinsey company
Mutual information based measures trade off relevancy of a variable
against the redundancy of the information a variable contains
We can use an approximation to the multidimensional integrals to make the computation more tractable:
arg max
& ∉()*+
, -& ; /
relevancy
− 1 2
345
6 75
,(-9:
; -&) − < 2
345
6 75
, -9:
; -& /)
redundancy
,
where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based
feature selection algorithms. The most prominent members of this family are:
1. Joint Mutual Information (JMI): 1 = < =
5
6 75
.
2. Maximum relevancy minimum redundancy(MRMR): 1 =
5
675
and < = 0.
3. Mutual information maximisation (MIM): 1 = < = 0.
16All content copyright © 2017 QuantumBlack, a McKinsey company
There are many open-source Python modules available that do filter-based
feature selection
17All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
18All content copyright © 2017 QuantumBlack, a McKinsey company
Embedded methods example – stability selection
• Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of
regularization.
• For every value of this parameter, we can get an estimate of which variables to select.
• Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which
variables get selected in every sample to form a set of ‘stable’ variables.
Generate
bootstrap
sample
Estimate
LASSO on
bootstrapped
sample
Record
features that
get selected
For each
bootstrap
sample and
each value of
penalization
parameter
Compute posterior
probability of inclusion
Select the
set of ’stable’
features
19All content copyright © 2017 QuantumBlack, a McKinsey company
Stability selection is straightforward to implement in Python, and mature
implementations exist for both Python and R
Iterate over
penalization
parameter
and bootstrap
samples
20All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
Advantages
• Takes interaction between feature subset
search and learning algorithm into account.
Disadvantages
• Computationally more expensive than filter
methods.
21All content copyright © 2017 QuantumBlack, a McKinsey company
Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being
speed of computation, and the chance of overfitting:
• In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers.
• In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods,
which in turn are more likely to overfit than filter methods.
All of this of course changes with extremes of data/feature availability.
What type of algorithm should I use in practice?

More Related Content

PPTX
Pydata presentation
PDF
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
PDF
Insurance risk pricing with XGBoost
PPTX
Pricing like a data scientist
PDF
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
PDF
Common Problems in Hyperparameter Optimization
PDF
SigOpt for Hedge Funds
PPTX
Machine Learning Fundamentals
Pydata presentation
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Insurance risk pricing with XGBoost
Pricing like a data scientist
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
Common Problems in Hyperparameter Optimization
SigOpt for Hedge Funds
Machine Learning Fundamentals

What's hot (20)

PDF
Tuning the Untunable - Insights on Deep Learning Optimization
PPTX
AWS Forcecast: DeepAR Predictor Time-series
PDF
Advanced Optimization for the Enterprise Webinar
PDF
Ad Click Prediction - Paper review
PDF
Python tutorial for ML
PDF
Modeling at scale in systematic trading
PPTX
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
PPTX
Musings of kaggler
PDF
Alpine Tech Talk: System ML by Berthold Reinwald
PDF
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
PDF
Iaetsd protecting privacy preserving for cost effective adaptive actions
PDF
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
PDF
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
PDF
Estimating project development effort using clustered regression approach
PDF
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
PDF
Analysis and Implementation of Efficient Association Rules using K-mean and N...
PDF
Presentation: Ad-Click Prediction, A Data-Intensive Problem
PDF
Tuning for Systematic Trading: Talk 1
PDF
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
PDF
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning the Untunable - Insights on Deep Learning Optimization
AWS Forcecast: DeepAR Predictor Time-series
Advanced Optimization for the Enterprise Webinar
Ad Click Prediction - Paper review
Python tutorial for ML
Modeling at scale in systematic trading
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Musings of kaggler
Alpine Tech Talk: System ML by Berthold Reinwald
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
Iaetsd protecting privacy preserving for cost effective adaptive actions
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
Estimating project development effort using clustered regression approach
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
Analysis and Implementation of Efficient Association Rules using K-mean and N...
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 2: Deep Learning
Ad

Similar to PyData London 2018 talk on feature selection (20)

PPT
feature selection slides share and types of features selection
PPTX
Feature Selections Methods
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
An introduction to variable and feature selection
PDF
Feature Selection.pdf
PPTX
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
PPTX
Feature Engineering Fundamentals Explained.pptx
PPTX
Data Engineer's Lunch #67: Machine Learning - Feature Selection
PDF
Optimization Technique for Feature Selection and Classification Using Support...
PDF
Variable and feature selection
PDF
Machine Learning.pdf
PPTX
dimentionalityreduction-241109090040-5290a6cd.pptx
PDF
A Review on Feature Selection Methods For Classification Tasks
PDF
Dimentionality Reduction PCA Version 1.pdf
PDF
Machine learning Mind Map
PDF
Machine Learning Notes for beginners ,Step by step
PPT
Machine Learning
PDF
JUNE-77.pdf
PPT
feature-selection.ppt on machine learning
feature selection slides share and types of features selection
Feature Selections Methods
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
An introduction to variable and feature selection
Feature Selection.pdf
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
Feature Engineering Fundamentals Explained.pptx
Data Engineer's Lunch #67: Machine Learning - Feature Selection
Optimization Technique for Feature Selection and Classification Using Support...
Variable and feature selection
Machine Learning.pdf
dimentionalityreduction-241109090040-5290a6cd.pptx
A Review on Feature Selection Methods For Classification Tasks
Dimentionality Reduction PCA Version 1.pdf
Machine learning Mind Map
Machine Learning Notes for beginners ,Step by step
Machine Learning
JUNE-77.pdf
feature-selection.ppt on machine learning
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
A Quantitative-WPS Office.pptx research study
PDF
Mega Projects Data Mega Projects Data
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Database Infoormation System (DBIS).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to machine learning and Linear Models
Computer network topology notes for revision
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
.pdf is not working space design for the following data for the following dat...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx
A Quantitative-WPS Office.pptx research study
Mega Projects Data Mega Projects Data
Miokarditis (Inflamasi pada Otot Jantung)
Logistic Regression ml machine learning.pptx
Business Acumen Training GuidePresentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Reliability_Chapter_ presentation 1221.5784
Supervised vs unsupervised machine learning algorithms
Database Infoormation System (DBIS).pptx

PyData London 2018 talk on feature selection

  • 2. PyData London 28th April, 2018 Thomas Huijskens Senior Data Scientist How to get better performance with less data
  • 3. 3All content copyright © 2017 QuantumBlack, a McKinsey company Feature collinearity and scarceness of data means we can't just give a model many features and let it decide which ones are useful and which ones are not. There are multiple reasons to do feature selection when developing machine learning models: • Computational burden: Limiting the number of features may reduce the computational burden of processing the data in the learning algorithm. • Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. • Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the seasoned practitioner as well as any business stakeholders. It pays off to do feature selection as part of the model development process
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms should: • remove variables that contain redundant information about the target variable; and • reduce the overlap in information between the variables in the subset of selected features. A good feature selection algorithm also shouldn't look at variables purely in isolation: • Two variables that are useless by themselves can be useful together. • Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity. What are the components of a good feature selection algorithm?
  • 5. 5All content copyright © 2017 QuantumBlack, a McKinsey company Two variables that are useless by themselves can be useful together 1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity 1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
  • 7. 7All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Filter methods Embedded methods
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Filter methods Embedded methods
  • 9. 9All content copyright © 2017 QuantumBlack, a McKinsey company Mlxtend is an open-source Python package that implements multiple wrapper methods
  • 10. 10All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Embedded methods
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – mutual information The mutual information quantifies the amount of information obtained about one random variable, through another random variable. For two variables ! and ", the mutual information is given by # !; " = & ' & ( ) *, , log )(*, ,) ) * )(,) 2* 2,. It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information In the feature selection problem, we would like to maximise the mutual information between the selected variables !" and the target #. $% = arg max " , !"; # , /. 1. % = 2, where 2 is the number of features we want to select. This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information A popular heuristic in the literature is to use a greedy forward selection method, where features are selected incrementally, one feature at a time. Let !" #$ = &'( , … , &'+ ,( , be the set of selected features at time step - − 1. The greedy method selects the next feature 0" such that 0" = arg max 6 ∉8+,( 9 :8+,( ⋃ 6; = .
  • 14. 14All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information One can show (proof omitted here) that this is equivalent to the following !" = arg max ) ∉+,-. / 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) . However, the quantities involving 6"78 quickly become intractable computationally because they are (: − 1)- dimensional integrals!
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Mutual information based measures trade off relevancy of a variable against the redundancy of the information a variable contains We can use an approximation to the multidimensional integrals to make the computation more tractable: arg max & ∉()*+ , -& ; / relevancy − 1 2 345 6 75 ,(-9: ; -&) − < 2 345 6 75 , -9: ; -& /) redundancy , where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based feature selection algorithms. The most prominent members of this family are: 1. Joint Mutual Information (JMI): 1 = < = 5 6 75 . 2. Maximum relevancy minimum redundancy(MRMR): 1 = 5 675 and < = 0. 3. Mutual information maximisation (MIM): 1 = < = 0.
  • 16. 16All content copyright © 2017 QuantumBlack, a McKinsey company There are many open-source Python modules available that do filter-based feature selection
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Advantages • Typically scale better to high-dimensional data sets than wrapper methods. • Independent of the learning algorithm. Disadvantages • Ignore interaction with learning algorithm. • Often employs lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features. Embedded methods Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process.
  • 18. 18All content copyright © 2017 QuantumBlack, a McKinsey company Embedded methods example – stability selection • Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of regularization. • For every value of this parameter, we can get an estimate of which variables to select. • Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which variables get selected in every sample to form a set of ‘stable’ variables. Generate bootstrap sample Estimate LASSO on bootstrapped sample Record features that get selected For each bootstrap sample and each value of penalization parameter Compute posterior probability of inclusion Select the set of ’stable’ features
  • 19. 19All content copyright © 2017 QuantumBlack, a McKinsey company Stability selection is straightforward to implement in Python, and mature implementations exist for both Python and R Iterate over penalization parameter and bootstrap samples
  • 20. 20All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Advantages • Typically scale better to high-dimensional data sets than wrapper methods. • Independent of the learning algorithm. Disadvantages • Ignore interaction with learning algorithm. • Often employs lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features. Embedded methods Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. Advantages • Takes interaction between feature subset search and learning algorithm into account. Disadvantages • Computationally more expensive than filter methods.
  • 21. 21All content copyright © 2017 QuantumBlack, a McKinsey company Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being speed of computation, and the chance of overfitting: • In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers. • In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods, which in turn are more likely to overfit than filter methods. All of this of course changes with extremes of data/feature availability. What type of algorithm should I use in practice?