SlideShare a Scribd company logo
WORKING WITH DATA FOR
MACHINE LEARNING
PROJECTS (CONT.)
By Dr. Mehwish
Miscellaneous Topics
2
 Feature selection
 Automatic feature selection
 Hand crafted features
 Feature engineering
 Feature extraction
 Curse of dimensionality
 Dimensionality reduction
Dimensionality Reduction
3
 Dimensionality reduction, or dimension reduction,
is the transformation of data from a high-
dimensional space into a low-dimensional space so
that the low-dimensional representation retains
some meaningful properties of the original data.
Dimensionality Reduction
4
 Why?
 Increasing the number of inputs or features does not
always improve accuracy of classification.
 Performance of classifier may degrade with the
inclusion of irrelevant or redundant features.
 Curse of dimensionality; “Intrinsic” dimensionality of
the data may be smaller than the actual size of the
data.
Dimensionality Reduction
5
 Benefits
 Improves the classification performance.
 Improves learning efficiency and enables faster
classification.
 Better understanding of the underlying process
mapping inputs to output.
Dimensionality Reduction
6
 Benefits
 Dimensionality reduction avoids the problem
of overfitting
 Dimensionality reduction is extremely useful for data
visualization
 Dimensionality reduction removes noise in the data
 Dimensionality reduction can be used for image
compression
Dimensionality Reduction
7
 Feature Selection:
 Select a subset of the existing features.
 Select the features in the subset that either improves
classification accuracy or maintain same accuracy.
Dimensionality Reduction
8
 Data set:
 Five Boolean features
 y=x1 (or) x2
 x3 = (not) x2
 x4 = (not) x5
 Optimal subset: {x1, x2} or {x1, x3}
 Optimization in space of all feature subsets would have
 2d possibilities
 Can’t search over all possibilities and therefore we rely
on heuristic methods.
Dimensionality Reduction
9
 How do we choose this subset?
 Feature selection can be considered as an optimization problem that
involves
 Searching of the space of possible feature subsets
 Choose the subset that is optimal or near-optimal with respect to some
objective function
 Filter Methods (unsupervised method)
 Evaluation is independent of the learning algorithm
 Consider the input only and select the subset that has the most
information
 Wrapper Methods (supervised method)
 evaluation is carried out using model selection the machine learning
algorithm
 Train on selected subset and estimate error on validation dataset
Dimensionality Reduction
10
Dimensionality Reduction
11
 Wrappers Method:
 Forward Search Feature Subset Selection Algorithm
 Start with empty set as feature subset
 Try adding one feature from the remaining features to the subset
 Estimate classification or regression error for adding each feature
 Add feature to the subset that gives max improvement
 Backward Search Feature Subset Selection Algorithm
 Start with full feature set as subset
 Try removing one feature from the subset
 Estimate classification or regression error for removing each
feature
 Remove/drop the feature that gives minimal impact on error or
reduces the error
Dimensionality Reduction
12
 Feature Extraction:
 Transform existing features to obtain a set of new
features using some mapping function.
Dimensionality Reduction
13
 Feature Extraction:
 The mapping function z=𝑓(x) can be linear or non-
linear.
 Can be interpreted as projection or mapping of the
data in the higher dimensional space to the lower
dimensional space.
 Mathematically, we want to find an optimum
mapping z=𝑓(x) that preserves the desired
information as much as possible.
Dimensionality Reduction
14
 Finding optimum mapping is equivalent to
optimizing an objective function.
 We use different objective functions in different
methods;
 Minimize Information Loss: Mapping that represent the
data as accurately as possible in the lower-dimensional
space, e.g., Principal Components Analysis (PCA).
 Maximize Discriminatory Information: Mapping that
best discriminates the data in the lower-dimensional
space, e.g., Linear Discriminant Analysis (LDA).
Dimensionality Reduction
15
Dimensionality Reduction
16
Dimensionality Reduction
17
Dimensionality Reduction
18
Dimensionality Reduction
19
 Step 3:
 Calculate the Covariance
Matrix
 What do the covariances that we have as entries of
the matrix tell us about the correlations between the
variables?
 It’s actually the sign of the covariance that matters :
 if positive then : the two variables increase or decrease
together (correlated)
 if negative then : One increases when the other
decreases (Inversely correlated)
Dimensionality Reduction
20
 Step 4: compute the eigenvectors and eigenvalues of the
covariance matrix:
 Eigenvectors and eigenvalues are calculated to determine
the principal components of the data.
 The 10-dimensional data gives you 10 principal
components.
 Only the first components are preserved and most of the
information within the initial variables is squeezed or
compressed into the first components.
 To put all this simply, just think of principal components as
new axes that provide the best angle to see and evaluate
the data, so that the differences between the observations
are better visible..
Dimensionality Reduction
21
Dimensionality Reduction
22
 For example, let’s assume that the scatter plot of our
data set is as shown below, can we guess the first
principal component ?
 Yes, it’s approximately the line that matches the purple
marks because it goes through the origin and it’s the
line in which the projection of the points (red dots) is the
most spread out.
 Or mathematically speaking, it’s the line that maximizes
the variance (the average of the squared distances
from the projected points (red dots) to the origin).
Dimensionality Reduction
23
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
Other Dimensionality Reduction
Techniques
 Missing Values Ratio.
 Data columns with too many missing values are
unlikely to carry much useful information.
 Thus data columns with number of missing values
greater than a given threshold can be removed.
 The higher the threshold, the more aggressive the
reduction.
Other Dimensionality Reduction
Techniques
 Low Variance Filter.
 Similarly to the previous technique, data columns
with little changes in the data carry little
information.
 Thus all data columns with variance lower than a
given threshold are removed.
 A word of caution: variance is range dependent;
therefore normalization is required before applying
this technique.
Other Dimensionality Reduction
Techniques
 High Correlation Filter.
 Data columns with very similar trends are also likely to carry very
similar information.
 In this case, only one of them will suffice to feed the machine
learning model.
 Here we calculate the correlation coefficient between numerical
columns and between nominal columns as the Pearson’s Product
Moment Coefficient and the Pearson’s chi square value respectively.
 Pairs of columns with correlation coefficient higher than a threshold
are reduced to only one.
 A word of caution: correlation is scale sensitive; therefore column
normalization is required for a meaningful correlation comparison.
Other Dimensionality Reduction
Techniques
 Random Forests / Ensemble Trees.
 Decision Tree Ensembles, also referred to as random
forests, are useful for feature selection in addition
to being effective classifiers.
 One approach to dimensionality reduction is to
generate a large and carefully constructed set of
trees against a target attribute and then use each
attribute’s usage statistics to find the most
informative subset of features.
Other Dimensionality Reduction
Techniques
 Random Forests / Ensemble Trees.
 Specifically, we can generate a large set (2000) of
very shallow trees (2 levels), with each tree being
trained on a small fraction (3) of the total number
of attributes. If an attribute is often selected as best
split, it is most likely an informative feature to
retain. A score calculated on the attribute usage
statistics in the random forest tells us ‒ relative to
the other attributes ‒ which are the most predictive
attributes.
Other Dimensionality Reduction
Techniques
 Backward Feature Elimination.
 In this technique, at a given iteration, the selected
classification algorithm is trained on n input
features.
 Then we remove one input feature at a time and
train the same model on n-1 input features n times.
 The input feature whose removal has produced the
smallest increase in the error rate is removed,
leaving us with n-1 input features.
Other Dimensionality Reduction
Techniques
 Backward Feature Elimination.
 The classification is then repeated using n-
2 features, and so on.
 Each iteration k produces a model trained on n-
k features and an error rate e(k).
 Selecting the maximum tolerable error rate, we
define the smallest number of features necessary to
reach that classification performance with the
selected machine learning algorithm.
Other Dimensionality Reduction
Techniques
 Forward Feature Construction.
 This is the inverse process to the Backward Feature
Elimination.
 We start with 1 feature only, progressively adding 1
feature at a time, i.e. the feature that produces the
highest increase in performance.
 Both algorithms, Backward Feature Elimination and
Forward Feature Construction, are quite time and
computationally expensive. They are practically only
applicable to a data set with an already relatively low
number of input columns.
Other Dimensionality Reduction
Techniques
Handcrafted features (Features from Online Signature)
41
 Height of the signature
 Width of the signature
 Height to width ratio
 Total time
 Total distance covered by pen/finger
 Average pressure
 Maximum pressure
 Total number of pen-ups
 Total duration for pen-ups
Online Signature illustration
HOG - Feature Extraction
Input Image
Orientation = 9
Pixels_per_cell=[8, 8]
Cells_per_block=[2, 2]
Length of Feature Vector = 40176
Orientation = 9
Pixels_per_cell=[4, 4]
Cells_per_block=[2, 2]
Feature Vector Length = 170496
Orientation = 9
Pixels_per_cell=[16, 16]
Cells_per_block=[2, 2]
Length of Feature Vector = 9180
HOG - Histogram of Oriented Gradients (HOG) technique for feature extraction of
Fingerprints 42

More Related Content

PPTX
Feature Engineering Fundamentals Explained.pptx
PPTX
Dimensionality Reduction.pptx
PPTX
Data Reduction
PDF
ML-Unit-4.pdf
PDF
Machine learning Mind Map
PDF
Machine Learning.pdf
PPTX
Dimensionality Reduction and feature extraction.pptx
PDF
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
Feature Engineering Fundamentals Explained.pptx
Dimensionality Reduction.pptx
Data Reduction
ML-Unit-4.pdf
Machine learning Mind Map
Machine Learning.pdf
Dimensionality Reduction and feature extraction.pptx
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB

Similar to Working with the data for Machine Learning (20)

PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
Dimensionality Reduction
PPTX
introduction to Statistical Theory.pptx
PPTX
Feature selection using PCA.pptx
PPT
dimension reduction.ppt
PPTX
data science module-3 power point presentation
PPTX
Singular Value Decomposition (SVD).pptx
PPTX
EDAB Module 5 Singular Value Decomposition (SVD).pptx
PPTX
Reuqired ppt for machine learning algirthms and part
PDF
debatrim_report (1)
PDF
Machine Learning Notes for beginners ,Step by step
PPTX
PCA Final.pptx
PDF
PPTX
Dimensionality Reduction in Machine Learning
PPTX
feature-Selection-Lab-8-20032024-111222am.pptx
PDF
Data Mining Module 2 Business Analytics.
PPTX
Application of Machine Learning in Agriculture
PDF
featurers_Machinelearning___________.pdf
DOCX
Deep Learning Vocabulary.docx
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
Dimensionality Reduction
introduction to Statistical Theory.pptx
Feature selection using PCA.pptx
dimension reduction.ppt
data science module-3 power point presentation
Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Reuqired ppt for machine learning algirthms and part
debatrim_report (1)
Machine Learning Notes for beginners ,Step by step
PCA Final.pptx
Dimensionality Reduction in Machine Learning
feature-Selection-Lab-8-20032024-111222am.pptx
Data Mining Module 2 Business Analytics.
Application of Machine Learning in Agriculture
featurers_Machinelearning___________.pdf
Deep Learning Vocabulary.docx
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Ad

Working with the data for Machine Learning

  • 1. WORKING WITH DATA FOR MACHINE LEARNING PROJECTS (CONT.) By Dr. Mehwish
  • 2. Miscellaneous Topics 2  Feature selection  Automatic feature selection  Hand crafted features  Feature engineering  Feature extraction  Curse of dimensionality  Dimensionality reduction
  • 3. Dimensionality Reduction 3  Dimensionality reduction, or dimension reduction, is the transformation of data from a high- dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data.
  • 4. Dimensionality Reduction 4  Why?  Increasing the number of inputs or features does not always improve accuracy of classification.  Performance of classifier may degrade with the inclusion of irrelevant or redundant features.  Curse of dimensionality; “Intrinsic” dimensionality of the data may be smaller than the actual size of the data.
  • 5. Dimensionality Reduction 5  Benefits  Improves the classification performance.  Improves learning efficiency and enables faster classification.  Better understanding of the underlying process mapping inputs to output.
  • 6. Dimensionality Reduction 6  Benefits  Dimensionality reduction avoids the problem of overfitting  Dimensionality reduction is extremely useful for data visualization  Dimensionality reduction removes noise in the data  Dimensionality reduction can be used for image compression
  • 7. Dimensionality Reduction 7  Feature Selection:  Select a subset of the existing features.  Select the features in the subset that either improves classification accuracy or maintain same accuracy.
  • 8. Dimensionality Reduction 8  Data set:  Five Boolean features  y=x1 (or) x2  x3 = (not) x2  x4 = (not) x5  Optimal subset: {x1, x2} or {x1, x3}  Optimization in space of all feature subsets would have  2d possibilities  Can’t search over all possibilities and therefore we rely on heuristic methods.
  • 9. Dimensionality Reduction 9  How do we choose this subset?  Feature selection can be considered as an optimization problem that involves  Searching of the space of possible feature subsets  Choose the subset that is optimal or near-optimal with respect to some objective function  Filter Methods (unsupervised method)  Evaluation is independent of the learning algorithm  Consider the input only and select the subset that has the most information  Wrapper Methods (supervised method)  evaluation is carried out using model selection the machine learning algorithm  Train on selected subset and estimate error on validation dataset
  • 11. Dimensionality Reduction 11  Wrappers Method:  Forward Search Feature Subset Selection Algorithm  Start with empty set as feature subset  Try adding one feature from the remaining features to the subset  Estimate classification or regression error for adding each feature  Add feature to the subset that gives max improvement  Backward Search Feature Subset Selection Algorithm  Start with full feature set as subset  Try removing one feature from the subset  Estimate classification or regression error for removing each feature  Remove/drop the feature that gives minimal impact on error or reduces the error
  • 12. Dimensionality Reduction 12  Feature Extraction:  Transform existing features to obtain a set of new features using some mapping function.
  • 13. Dimensionality Reduction 13  Feature Extraction:  The mapping function z=𝑓(x) can be linear or non- linear.  Can be interpreted as projection or mapping of the data in the higher dimensional space to the lower dimensional space.  Mathematically, we want to find an optimum mapping z=𝑓(x) that preserves the desired information as much as possible.
  • 14. Dimensionality Reduction 14  Finding optimum mapping is equivalent to optimizing an objective function.  We use different objective functions in different methods;  Minimize Information Loss: Mapping that represent the data as accurately as possible in the lower-dimensional space, e.g., Principal Components Analysis (PCA).  Maximize Discriminatory Information: Mapping that best discriminates the data in the lower-dimensional space, e.g., Linear Discriminant Analysis (LDA).
  • 19. Dimensionality Reduction 19  Step 3:  Calculate the Covariance Matrix  What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?  It’s actually the sign of the covariance that matters :  if positive then : the two variables increase or decrease together (correlated)  if negative then : One increases when the other decreases (Inversely correlated)
  • 20. Dimensionality Reduction 20  Step 4: compute the eigenvectors and eigenvalues of the covariance matrix:  Eigenvectors and eigenvalues are calculated to determine the principal components of the data.  The 10-dimensional data gives you 10 principal components.  Only the first components are preserved and most of the information within the initial variables is squeezed or compressed into the first components.  To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible..
  • 22. Dimensionality Reduction 22  For example, let’s assume that the scatter plot of our data set is as shown below, can we guess the first principal component ?  Yes, it’s approximately the line that matches the purple marks because it goes through the origin and it’s the line in which the projection of the points (red dots) is the most spread out.  Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points (red dots) to the origin).
  • 32. Other Dimensionality Reduction Techniques  Missing Values Ratio.  Data columns with too many missing values are unlikely to carry much useful information.  Thus data columns with number of missing values greater than a given threshold can be removed.  The higher the threshold, the more aggressive the reduction.
  • 33. Other Dimensionality Reduction Techniques  Low Variance Filter.  Similarly to the previous technique, data columns with little changes in the data carry little information.  Thus all data columns with variance lower than a given threshold are removed.  A word of caution: variance is range dependent; therefore normalization is required before applying this technique.
  • 34. Other Dimensionality Reduction Techniques  High Correlation Filter.  Data columns with very similar trends are also likely to carry very similar information.  In this case, only one of them will suffice to feed the machine learning model.  Here we calculate the correlation coefficient between numerical columns and between nominal columns as the Pearson’s Product Moment Coefficient and the Pearson’s chi square value respectively.  Pairs of columns with correlation coefficient higher than a threshold are reduced to only one.  A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.
  • 35. Other Dimensionality Reduction Techniques  Random Forests / Ensemble Trees.  Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers.  One approach to dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features.
  • 36. Other Dimensionality Reduction Techniques  Random Forests / Ensemble Trees.  Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being trained on a small fraction (3) of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain. A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other attributes ‒ which are the most predictive attributes.
  • 37. Other Dimensionality Reduction Techniques  Backward Feature Elimination.  In this technique, at a given iteration, the selected classification algorithm is trained on n input features.  Then we remove one input feature at a time and train the same model on n-1 input features n times.  The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features.
  • 38. Other Dimensionality Reduction Techniques  Backward Feature Elimination.  The classification is then repeated using n- 2 features, and so on.  Each iteration k produces a model trained on n- k features and an error rate e(k).  Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.
  • 39. Other Dimensionality Reduction Techniques  Forward Feature Construction.  This is the inverse process to the Backward Feature Elimination.  We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance.  Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns.
  • 41. Handcrafted features (Features from Online Signature) 41  Height of the signature  Width of the signature  Height to width ratio  Total time  Total distance covered by pen/finger  Average pressure  Maximum pressure  Total number of pen-ups  Total duration for pen-ups Online Signature illustration
  • 42. HOG - Feature Extraction Input Image Orientation = 9 Pixels_per_cell=[8, 8] Cells_per_block=[2, 2] Length of Feature Vector = 40176 Orientation = 9 Pixels_per_cell=[4, 4] Cells_per_block=[2, 2] Feature Vector Length = 170496 Orientation = 9 Pixels_per_cell=[16, 16] Cells_per_block=[2, 2] Length of Feature Vector = 9180 HOG - Histogram of Oriented Gradients (HOG) technique for feature extraction of Fingerprints 42