PyData London 2018 talk on feature selection

PyData London
28th April, 2018
Thomas Huijskens
Senior Data Scientist
How to get better
performance with less
data

3All content copyright © 2017 QuantumBlack, a McKinsey company
Feature collinearity and scarceness of data means we can't just give a model many features and let it decide
which ones are useful and which ones are not.
There are multiple reasons to do feature selection when developing machine learning models:
• Computational burden: Limiting the number of features may reduce the computational burden of processing the data in
the learning algorithm.
• Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that
are presumably redundant.
• Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the
seasoned practitioner as well as any business stakeholders.
It pays off to do feature selection as part of the model development process

Feature selection algorithms should:
• remove variables that contain redundant information about the target variable; and
• reduce the overlap in information between the variables in the subset of selected features.
A good feature selection algorithm also shouldn't look at variables purely in isolation:
• Two variables that are useless by themselves can be useful together.
• Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity.
What are the components of a good feature selection algorithm?

Two variables that are useless by themselves can be useful together
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.

Very high variable correlation (or anti-correlation) does not mean absence
of variable complementarity
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.

Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods Filter methods Embedded methods

Set of all
features
Generate
a subset
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Filter methods Embedded methods

Mlxtend is an open-source Python package that implements multiple
wrapper methods

Set of all
features
Generate
a subset
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Wrapper methods
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Embedded methods

Filter methods example – mutual information
The mutual information quantifies the amount of information obtained about one random variable, through another
random variable. For two variables ! and ", the mutual information is given by
# !; " = &
'
&
(
) *, , log
)(*, ,)
) * )(,)
2* 2,.
It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .

Filter methods example – maximizing joint mutual information
In the feature selection problem, we would like to maximise the mutual information between the selected variables
!" and the target #.
$% = arg max
"
, !"; # , /. 1. % = 2,
where 2 is the number of features we want to select.
This is an NP-hard problem, as the set of possible combinations of features grows exponentially.

A popular heuristic in the literature is to use a greedy forward selection method, where features are selected
incrementally, one feature at a time.
Let !" #$
= &'(
, … , &'+ ,(
, be the set of selected features at time step - − 1. The greedy method selects the next
feature 0" such that
0" = arg max
6 ∉8+,(
9 :8+,( ⋃ 6; = .

One can show (proof omitted here) that this is equivalent to the following
!" = arg max
) ∉+,-.
/ 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) .
However, the quantities involving 6"78
quickly become intractable computationally because they are (: − 1)-
dimensional integrals!

Mutual information based measures trade off relevancy of a variable
against the redundancy of the information a variable contains
We can use an approximation to the multidimensional integrals to make the computation more tractable:
arg max
& ∉()*+
, -& ; /
relevancy
− 1 2
345
6 75
,(-9:
; -&) − < 2
345
6 75
, -9:
; -& /)
redundancy
,
where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based
feature selection algorithms. The most prominent members of this family are:
1. Joint Mutual Information (JMI): 1 = < =
5
6 75
.
2. Maximum relevancy minimum redundancy(MRMR): 1 =
5
675
and < = 0.
3. Mutual information maximisation (MIM): 1 = < = 0.

There are many open-source Python modules available that do filter-based
feature selection

Set of all
features
Generate
a subset
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Wrapper methods
Advantages
Disadvantages
intensive.
Filter methods
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.

Embedded methods example – stability selection
• Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of
regularization.
• For every value of this parameter, we can get an estimate of which variables to select.
• Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which
variables get selected in every sample to form a set of ‘stable’ variables.
Generate
bootstrap
sample
Estimate
LASSO on
bootstrapped
sample
Record
features that
get selected
For each
bootstrap
sample and
each value of
penalization
parameter
Compute posterior
probability of inclusion
Select the
set of ’stable’
features

Stability selection is straightforward to implement in Python, and mature
implementations exist for both Python and R
Iterate over
penalization
parameter
and bootstrap
samples

Set of all
features
Generate
a subset
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Wrapper methods
Advantages
Disadvantages
intensive.
Filter methods
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
Advantages
• Takes interaction between feature subset
search and learning algorithm into account.
Disadvantages
• Computationally more expensive than filter
methods.

Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being
speed of computation, and the chance of overfitting:
• In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers.
• In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods,
which in turn are more likely to overfit than filter methods.
All of this of course changes with extremes of data/feature availability.
What type of algorithm should I use in practice?

PyData London 2018 talk on feature selection

More Related Content

What's hot (20)

Similar to PyData London 2018 talk on feature selection (20)

Recently uploaded (20)

PyData London 2018 talk on feature selection