Working with the data for Machine Learning

WORKING WITH DATA FOR
MACHINE LEARNING
PROJECTS (CONT.)
By Dr. Mehwish

Miscellaneous Topics
2
 Feature selection
 Automatic feature selection
 Hand crafted features
 Feature engineering
 Feature extraction
 Curse of dimensionality
 Dimensionality reduction

Dimensionality Reduction
3
 Dimensionality reduction, or dimension reduction,
is the transformation of data from a high-
dimensional space into a low-dimensional space so
that the low-dimensional representation retains
some meaningful properties of the original data.

4
 Why?
 Increasing the number of inputs or features does not
always improve accuracy of classification.
 Performance of classifier may degrade with the
inclusion of irrelevant or redundant features.
 Curse of dimensionality; “Intrinsic” dimensionality of
the data may be smaller than the actual size of the
data.

5
 Benefits
 Improves the classification performance.
 Improves learning efficiency and enables faster
classification.
 Better understanding of the underlying process
mapping inputs to output.

6
 Benefits
 Dimensionality reduction avoids the problem
of overfitting
 Dimensionality reduction is extremely useful for data
visualization
 Dimensionality reduction removes noise in the data
 Dimensionality reduction can be used for image
compression

7
 Feature Selection:
 Select a subset of the existing features.
 Select the features in the subset that either improves
classification accuracy or maintain same accuracy.

8
 Data set:
 Five Boolean features
 y=x1 (or) x2
 x3 = (not) x2
 x4 = (not) x5
 Optimal subset: {x1, x2} or {x1, x3}
 Optimization in space of all feature subsets would have
 2d possibilities
 Can’t search over all possibilities and therefore we rely
on heuristic methods.

9
 How do we choose this subset?
 Feature selection can be considered as an optimization problem that
involves
 Searching of the space of possible feature subsets
 Choose the subset that is optimal or near-optimal with respect to some
objective function
 Filter Methods (unsupervised method)
 Evaluation is independent of the learning algorithm
 Consider the input only and select the subset that has the most
information
 Wrapper Methods (supervised method)
 evaluation is carried out using model selection the machine learning
algorithm
 Train on selected subset and estimate error on validation dataset

11
 Wrappers Method:
 Forward Search Feature Subset Selection Algorithm
 Start with empty set as feature subset
 Try adding one feature from the remaining features to the subset
 Estimate classification or regression error for adding each feature
 Add feature to the subset that gives max improvement
 Backward Search Feature Subset Selection Algorithm
 Start with full feature set as subset
 Try removing one feature from the subset
 Estimate classification or regression error for removing each
feature
 Remove/drop the feature that gives minimal impact on error or
reduces the error

12
 Feature Extraction:
 Transform existing features to obtain a set of new
features using some mapping function.

13
 Feature Extraction:
 The mapping function z=𝑓(x) can be linear or non-
linear.
 Can be interpreted as projection or mapping of the
data in the higher dimensional space to the lower
dimensional space.
 Mathematically, we want to find an optimum
mapping z=𝑓(x) that preserves the desired
information as much as possible.

14
 Finding optimum mapping is equivalent to
optimizing an objective function.
 We use different objective functions in different
methods;
 Minimize Information Loss: Mapping that represent the
data as accurately as possible in the lower-dimensional
space, e.g., Principal Components Analysis (PCA).
 Maximize Discriminatory Information: Mapping that
best discriminates the data in the lower-dimensional
space, e.g., Linear Discriminant Analysis (LDA).

19
 Step 3:
 Calculate the Covariance
Matrix
 What do the covariances that we have as entries of
the matrix tell us about the correlations between the
variables?
 It’s actually the sign of the covariance that matters :
 if positive then : the two variables increase or decrease
together (correlated)
 if negative then : One increases when the other
decreases (Inversely correlated)

20
 Step 4: compute the eigenvectors and eigenvalues of the
covariance matrix:
 Eigenvectors and eigenvalues are calculated to determine
the principal components of the data.
 The 10-dimensional data gives you 10 principal
components.
 Only the first components are preserved and most of the
information within the initial variables is squeezed or
compressed into the first components.
 To put all this simply, just think of principal components as
new axes that provide the best angle to see and evaluate
the data, so that the differences between the observations
are better visible..

22
 For example, let’s assume that the scatter plot of our
data set is as shown below, can we guess the first
principal component ?
 Yes, it’s approximately the line that matches the purple
marks because it goes through the origin and it’s the
line in which the projection of the points (red dots) is the
most spread out.
 Or mathematically speaking, it’s the line that maximizes
the variance (the average of the squared distances
from the projected points (red dots) to the origin).

Other Dimensionality Reduction
Techniques
 Missing Values Ratio.
 Data columns with too many missing values are
unlikely to carry much useful information.
 Thus data columns with number of missing values
greater than a given threshold can be removed.
 The higher the threshold, the more aggressive the
reduction.

Techniques
 Low Variance Filter.
 Similarly to the previous technique, data columns
with little changes in the data carry little
information.
 Thus all data columns with variance lower than a
given threshold are removed.
 A word of caution: variance is range dependent;
therefore normalization is required before applying
this technique.

Techniques
 High Correlation Filter.
 Data columns with very similar trends are also likely to carry very
similar information.
 In this case, only one of them will suffice to feed the machine
learning model.
 Here we calculate the correlation coefficient between numerical
columns and between nominal columns as the Pearson’s Product
Moment Coefficient and the Pearson’s chi square value respectively.
 Pairs of columns with correlation coefficient higher than a threshold
are reduced to only one.
 A word of caution: correlation is scale sensitive; therefore column
normalization is required for a meaningful correlation comparison.

Techniques
 Random Forests / Ensemble Trees.
 Decision Tree Ensembles, also referred to as random
forests, are useful for feature selection in addition
to being effective classifiers.
 One approach to dimensionality reduction is to
generate a large and carefully constructed set of
trees against a target attribute and then use each
attribute’s usage statistics to find the most
informative subset of features.

Techniques
 Random Forests / Ensemble Trees.
 Specifically, we can generate a large set (2000) of
very shallow trees (2 levels), with each tree being
trained on a small fraction (3) of the total number
of attributes. If an attribute is often selected as best
split, it is most likely an informative feature to
retain. A score calculated on the attribute usage
statistics in the random forest tells us ‒ relative to
the other attributes ‒ which are the most predictive
attributes.

Techniques
 Backward Feature Elimination.
 In this technique, at a given iteration, the selected
classification algorithm is trained on n input
features.
 Then we remove one input feature at a time and
train the same model on n-1 input features n times.
 The input feature whose removal has produced the
smallest increase in the error rate is removed,
leaving us with n-1 input features.

Techniques
 Backward Feature Elimination.
 The classification is then repeated using n-
2 features, and so on.
 Each iteration k produces a model trained on n-
k features and an error rate e(k).
 Selecting the maximum tolerable error rate, we
define the smallest number of features necessary to
reach that classification performance with the
selected machine learning algorithm.

Techniques
 Forward Feature Construction.
 This is the inverse process to the Backward Feature
Elimination.
 We start with 1 feature only, progressively adding 1
feature at a time, i.e. the feature that produces the
highest increase in performance.
 Both algorithms, Backward Feature Elimination and
Forward Feature Construction, are quite time and
computationally expensive. They are practically only
applicable to a data set with an already relatively low
number of input columns.

Techniques

Handcrafted features (Features from Online Signature)
41
 Height of the signature
 Width of the signature
 Height to width ratio
 Total time
 Total distance covered by pen/finger
 Average pressure
 Maximum pressure
 Total number of pen-ups
 Total duration for pen-ups
Online Signature illustration

HOG - Feature Extraction
Input Image
Orientation = 9
Pixels_per_cell=[8, 8]
Cells_per_block=[2, 2]
Length of Feature Vector = 40176
Orientation = 9
Feature Vector Length = 170496
Orientation = 9
Length of Feature Vector = 9180
HOG - Histogram of Oriented Gradients (HOG) technique for feature extraction of
Fingerprints 42

Working with the data for Machine Learning

More Related Content

Similar to Working with the data for Machine Learning (20)

Recently uploaded (20)

Working with the data for Machine Learning