Week_8machine learning (feature selection).pptx

Feature Selection
• Selection of a subset of features from a larger pool of available features.
• Goal: to select features that are rich in discriminatory information with respect to
the classification problem at hand.
• A poor choice of features drives the classifier to perform badly.
• Selecting highly informative features is an attempt
• to place classes in the feature space far apart from each other (large between-class
distance)
• to position the data points within each class close to each other (small within-class
variance).

Feature Selection
• Another major issue in feature selection is choosing the number of features l to be
used out of an original n > l
• Reducing this number helps in avoiding overfitting to the specific training
data set and of designing classifiers that result in good generalization
performance—that is, classifiers that perform well when faced with data
outside the training set.
• Before feature selection techniques can be used, a preprocessing stage is
necessary for “housekeeping” purposes, such as removal of outlier points
and data normalization

Feature Selection
OUTLIER REMOVAL
• A point that lies far away from the mean value of the corresponding random variable;
• Points with values far from the rest of the data may cause large errors during the classifier
training phase.
• This is not desirable, especially when the outliers are the result of noisy measurements.
• For normally distributed data, a threshold of 1, 2, or 3 times the standard deviation is used
to define outliers.
• Points that lie away from the mean by a value larger than this threshold are removed.
• However, for non-normal distributions, more rigorous measures should be considered
(e.g., cost functions).`

Feature Selection
DATA NORMALIZATION
m

Feature Selection
• Three types of features selection
• Individual features selection
• Combination of features
• Features subset selection

Individual Feature Selection
• The first step in FS is to look at each feature individually and check whether or not
it is an informative one.
• If not, the feature is discarded.
• To this end, statistical tests are commonly used.
• The idea is to test whether the mean values of a feature differ significantly in two
classes .
• In the case of more than two classes, the test may be applied for each class pair.
• Assuming that the data in the classes are normally distributed, the t-test is a
popular choice.

HYPOTHESIS TESTING: THE t-TEST
• The goal of the statistical t-test is to determine which of the following two hypotheses is
true:
H0: The mean values of the feature in the two classes are equal. (null hypothesis)
H1: The mean values of the feature in the two classes are different. (Alternate hypothesis)
• If null hypothesis is true, the feature is discarded, i.e., no significant difference between
the means of two classes exists.
• The hypothesis test is carried out against the so-called significance level, α,
which corresponds to the probability of committing an error in our decision.
• Typical values used in practice are α = 0.05 and α = 0.001.
• A significance level of 0.05 indicates a 5% risk of concluding that a difference exists
when there is no actual difference.

HYPOTHESIS TESTING: THE t-TEST
• The t-test assumes that the values of the features are drawn from normal
distributions
• If the feature distributions turn out not to be normal, one should choose a
nonparametric statistical significance test, such as the Wilcoxon rank sum
test, or the Fisher ratio

FISHER’S DISCRIMINANT RATIO
• FDR is commonly employed to quantify the discriminatory power of
individual features between two equiprobable classes.
• It is independent of the type of class distribution.
•
associated with the values of a feature in two classes. The FDR is defined
as

CLASS SEPARABILITY MEASURES
• The previous measures quantify the class-discriminatory power of
individual features.
• In this se`ction, we turn our attention from individual features to
combinations of
• features (i.e., feature vectors) and describe measures that quantify class
separability in the respective feature space.
• Three class-separability measures are considered:
• Divergence
• Bhattacharyya distance and
• Scatter matrices

FEATURE SUBSET SELECTION
• Reduce the number of features by discarding the less informative ones, using
scalar feature selection.
• Consider the features that survive from the previous step in different combinations
in order to keep the “best” combination.
• Exhaustive search
• Sequential forward and backward selection

Confusion Matrix
• In a two-class (positive and negative) problem, a classifier’s ability to
predict a true or false state gives rise to four output possibilities:
• True positive (TP)
• If the actual class is positive and the classifier also predicts it as positive
• False negative (FN)
• If the actual class is positive, however, the classifier predicts it as negative
• True negative (TN)
• If the actual class is negative, however, the classifier predicts it as positive
• False positive (FP)
• If the actual class is negative and the classifier also predicts it as negative
•

Predicted (10)
Pred Positive
(PP = 7)
Pred Neg
(PN = 3)
Actual
(10)
Positive
(P = 5)
TP = 4
hit
FN = 1
Type II error,
miss
True positive rate (TPR),
recall, sensitivity,
probability of detection
= TP/P = 4/5
False negative rate
(FNR)
= FN/P = 1/5
Negative
(N = 5)
FP = 3
Type I error, false
alarm
TN = 2
Correct
rejection
False positive rate (FPR)
= FP/N = 3/5
True negative rate
(TNR), specificity
=TN/N = 2/5
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
=
𝑇𝑃 + 𝑇𝑁
𝑃 + 𝑁
= 0.60
Precision, Pxositive
predictive value (PPV)
=TP/PP = 4/7 =
False omission
rate (FOR)
= FN/PN = 1/3
F1 score
2 ×
PPV×TPR
PPV+TPR
=0.7
False discovery rate
(FDR)
= FP/PP = 3/7
Negative
predictive value
(NPV) = TN/PN =
2/3

Week_8machine learning (feature selection).pptx

1. Mean normalized features from a data are given below:
• x1 = [0.6 0 -0.6] and x2 = [0.5 -0.1 -0.4];
a. Write the data matrix X
b.Find the covariance matrix.
2. If eigenvalues and eigen vectors of the covariance matrix of the data
matrix are:
• 𝑙𝑎𝑚𝑏𝑑𝑎 = 0.1, 1.1
• 𝐸 =
0.5 −0.9
−0.9 −0.5
a. Write the transformed data matrix in terms of the projection on the
vector that explains maximum variance.

Week_8machine learning (feature selection).pptx

More Related Content

Similar to Week_8machine learning (feature selection).pptx (20)

More from muhammadsamroz (7)

Recently uploaded (20)

Week_8machine learning (feature selection).pptx