SlideShare a Scribd company logo
Using Feature Grouping as a
Stochastic Regularizer for
High Dimensional Noisy Data
Sergül Aydöre
Assistant Professor
Electrical and Computer Engineering
Stevens Institute of Technology
2
Landscape of Machine Learning Applications
https://research.hubspot.com/charts/simplified-ai-landscape​
• Data is High Dimensional, Noisy and Sample Size is
small as in NeuroImaging
3
But what if
PET acquisition process
wikipedia
Implantation of intracranial
electrodes.
Cleveland Epilepsy Clinic
An elastic EEG cap with 60
electrodes [Bai2012]
A typical MEG equipment [BML2001]
MRI Scanner and rs-fMRI time series acquisition [NVIDIA]
4
Other High Dimensional, Noisy Data and Small
Sample Size Situations
Genomics
Integrative Genomics Viewer, 2012
Seismology
https://www.mapnagroup.com
Astronomy
AstronomyMagazine,
2015
5
Challenges
1. High Dimensionality of the data due to rich temporal and
spatial structure
6
Challenges
1. High Dimensionality of the data due to rich temporal and
spatial structure
2. Noise in the data due to mechanical or physical artifacts.
7
Challenges
1. High Dimensionality of the data due to rich temporal and
spatial structure
2. Noise in the data due to mechanical or physical artifacts.
3. Difficulty and cost of data collection
8
Overfitting
• ML models with
large number of
parameters
require large
amount of data.
Otherwise,
overfitting can
occur!
http://scott.fortmann-roe.com/docs/MeasuringError.html
9
Regularization Methods to overcome Overfitting
• Early Stopping [Yao, 2007]
• Ridge Regression (ℓ2 regularization) [Tibshirami 1996]
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization ) [Tibshirami 1996]
• Dropout [Srivastana 2014]
• Group Lasso [Yuan 2016]
Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
SPARSITY
Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
STOCHASTICITY
SPARSITY
Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
STOCHASTICITY
STRUCTURE & SPARSITY
12
SPARSITY
Regularization Methods to overcome Overfitting
• Early Stopping
• Ridge Regression (ℓ2 regularization)
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
• PROPOSED: STRUCTURE & STOHASTICITY
STOCHASTICITY
STRUCTURE & SPARSITY
13
SPARSITY
14
Problem Setting: Supervised Learning
• Training samples:
drawn from
• Parameters of the model are estimated by:
Loss per sample
15
Multinomial Logistic Regression
• The class label probability of a given input is:
• Hence, the parameter space is
• The loss per sample is:
16
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]
17
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]
18
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]
19
FeatureDropoutMatrices
Randomly picked matrix
Dropout for Multinomial Logistic Regression
20
FeatureDropoutMatrices
Randomly picked matrix
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Dropout for Multinomial Logistic Regression
21
FeatureDropoutMatrices
Randomly picked matrix
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Forward Propagation
Dropout for Multinomial Logistic Regression
22
FeatureDropoutMatrices
Randomly picked matrix
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Forward Propagation
Back Propagation
Dropout for Multinomial Logistic Regression
23
StructuredProjectionMatrices
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Forward Propagation
Back Propagation
Replace Masking with Structured Matrices
Randomly picked matrix
24
Replace Masking with Structured Matrices
25
Replace Masking with Structured Matrices
Each is generated from
random samples (size r) with
replacement from the training
data set (size n).
26
Replace Masking with Structured Matrices
27
Replace Masking with Structured Matrices
28
Replace Masking with Structured Matrices
29
Replace Masking with Structured Matrices
30
Replace Masking with Structured Matrices
We project the training
samples onto a lower
dimensional space by
. Hence, weight matrix
becomes:
approximate x
31
Replace Masking with Structured Matrices
To update , we
project the gradients
back to the original
space
32
Replace Masking with Structured Matrices
No projection is
necessary for the
bias term.
33
Dimensionality Reduction Method by Feature
Grouping
Hoyos-Idrobo 2016
34
Dimensionality Reduction Method by Feature
Grouping
Hoyos-Idrobo 2016
35
Dimensionality Reduction Method by Feature
Grouping
Hoyos-Idrobo 2016
36
Recursive Nearest Agglomeration Clustering
(ReNA)
Hoyos-Idrobo 2016
• Agglomerative clustering schemes start off by placing every data
element in its own cluster.
• They proceed by merging repeatedly the closest pair of connected
clusters until finding the desired number of clusters.
37
Insights: Random Reductions While Fitting
• Let where is the deterministic
term and is the zero-mean noise term.
Loss on the
smoothed input
Regularization Cost
variance of the
model given the
smooth input
features
variance of the
estimated target due
to the randomization
Insights: Random Reductions While Fitting
• Regularization Cost:
• For dropout, we have and is diagonal matrix
where for .
• This is equivalent to ridge regression after “orthogonalizing” the
features.
Constant for linear
regression
39
Computational Complexity
Total
number
of epochs
40
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
41
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
42
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
43
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
44
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
45
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken
46
Experimental Results: Olivetti Faces
47
Experimental Results: Olivetti Faces
48
Experimental Results: Olivetti Faces
49
Experimental Results: Olivetti Faces
50
Experimental Results: Olivetti Faces
51
Experimental Results: Olivetti Faces
52
Experimental Results: Olivetti Faces
53
Experimental Results: Olivetti Faces
54
Experimental Results: Olivetti Faces
55
Experimental Results: Olivetti Faces
• Visualization of the learned weights for logistic regression for a single
Olivetti face with high noise using different regularizers.
56
Experimental Results: Olivetti Faces
• Performance in terms of loss as a function of computation time for
MLP with a single layer using feature grouping and best parameters
for other regularizers, for Olivetti face data with high noise.
57
Experimental Results: Neuroimaging Data Set
• Openly accessible fMRI data set from Human Connectome Project
• 500 subjects, 8 cognitive tasks to classify
• Feature dimension: 33854, training set: 3052 samples, test set: 791
samples
58
Experimental Results: Neuroimaging Data Set
59
Experimental Results: Neuroimaging Data Set
60
Summary – Stochastic Regularizer
• We introduced a stochastic regularizer
based on feature averaging that
captures the structure of data.
• Our approach leads to higher accuracy
at high noise settings without
additional computation time.
• Learned weights have more structure
at high noise settings.
61
Collaborators and References
• S. Aydore, B. Thirion, O. Grisel, G. Varoquaux. “Using Feature Grouping as a Stochastic
Regularizer for High-Dimensional Noisy Data”, Women in Machine Learning Workshop, NeurIPS
2018, Montreal, Canada, 2018, accessible at arXiv preprint: 1807.11718.
• S. Aydore, L. Dicker, D. Foster.“A local Regret in Nonconvex Online Learning”, Continual
Learning Workshop, NeurIPS 2018, Montreal, Canada, 2018, accessible at arXiv preprint:
1811.05095.
Bertrand Thirion
(INRIA, France)
Olivier Grisel
(INRIA, France)
Gaël Varoquaux
(INRIA, France)
Dean Foster
(Amazon & University of Pennsylvania)
Lee Dicker
(Amazon & University of Rutgers)
Thank You
More on my website…
http://www.sergulaydore.com

More Related Content

PPTX
An overview of gradient descent optimization algorithms
PDF
Introduction to Diffusion Models
PDF
Domain Transfer and Adaptation Survey
PDF
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
PDF
Deep learning and image analytics using Python by Dr Sanparit
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PDF
Explicit Density Models
An overview of gradient descent optimization algorithms
Introduction to Diffusion Models
Domain Transfer and Adaptation Survey
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Deep learning and image analytics using Python by Dr Sanparit
Score-Based Generative Modeling through Stochastic Differential Equations
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Explicit Density Models

What's hot (20)

PPTX
Clustering introduction
PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
PDF
Context-aware preference modeling with factorization
PPTX
Utilizing additional information in factorization methods (research overview,...
PDF
Improving neural question generation using answer separation
PPTX
Deep learning study 2
PPTX
Ashfaq Munshi, ML7 Fellow, Pepperdata
PPTX
Ot regularization and_gradient_descent
PDF
Deep Learning for Natural Language Processing
PDF
ddpg seminar
PPTX
Ml9 introduction to-unsupervised_learning_and_clustering_methods
PDF
Bayesian Model-Agnostic Meta-Learning
PPTX
08 neural networks
PDF
[CVPR2020] Simple but effective image enhancement techniques
PPTX
Robot, Learning From Data
PDF
QMIX: monotonic value function factorization paper review
PDF
Object-Region Video Transformers
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PPTX
Ml10 dimensionality reduction-and_advanced_topics
PPTX
InfoGAIL
Clustering introduction
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Context-aware preference modeling with factorization
Utilizing additional information in factorization methods (research overview,...
Improving neural question generation using answer separation
Deep learning study 2
Ashfaq Munshi, ML7 Fellow, Pepperdata
Ot regularization and_gradient_descent
Deep Learning for Natural Language Processing
ddpg seminar
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Bayesian Model-Agnostic Meta-Learning
08 neural networks
[CVPR2020] Simple but effective image enhancement techniques
Robot, Learning From Data
QMIX: monotonic value function factorization paper review
Object-Region Video Transformers
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Ml10 dimensionality reduction-and_advanced_topics
InfoGAIL
Ad

Similar to Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology (20)

PDF
Not Enough Measurements, Too Many Measurements
PDF
lec3_annotated.pdf ml csci 567 vatsal sharan
PDF
QMC: Transition Workshop - Discussion of "Representative Points for Small and...
PPTX
Modern classification techniques
PDF
Bill howe 7_machinelearning_2
PPTX
xai basic solutions , with some examples and formulas
PDF
Litv_Denmark_Weak_Supervised_Learning.pdf
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
MBIP-book.pdf
PDF
Epsrcws08 campbell kbm_01
PDF
Introduction to Supervised ML Concepts and Algorithms
PDF
Ai_Project_report
PPT
Introduction to Machine Learning STUDENTS.ppt
PPTX
495Poster
PDF
IEEE 2015 Matlab Projects
PPTX
Strengthening support vector classifiers based on fuzzy logic and evolutionar...
PDF
DMTM 2015 - 16 Data Preparation
PDF
nber_slides.pdf
PDF
Nonlinear dimension reduction
PDF
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
Not Enough Measurements, Too Many Measurements
lec3_annotated.pdf ml csci 567 vatsal sharan
QMC: Transition Workshop - Discussion of "Representative Points for Small and...
Modern classification techniques
Bill howe 7_machinelearning_2
xai basic solutions , with some examples and formulas
Litv_Denmark_Weak_Supervised_Learning.pdf
Maximum likelihood estimation of regularisation parameters in inverse problem...
MBIP-book.pdf
Epsrcws08 campbell kbm_01
Introduction to Supervised ML Concepts and Algorithms
Ai_Project_report
Introduction to Machine Learning STUDENTS.ppt
495Poster
IEEE 2015 Matlab Projects
Strengthening support vector classifiers based on fuzzy logic and evolutionar...
DMTM 2015 - 16 Data Preparation
nber_slides.pdf
Nonlinear dimension reduction
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
Ad

More from WiMLDSMontreal (11)

PPTX
The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
PPTX
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
PDF
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
PPTX
How to build a data science project in a corporate setting, by Soraya Christi...
PPTX
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
PPTX
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
PPTX
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
PDF
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
PPTX
Artistic Applications of AI, by Luba Elliott, AI Curator
PPTX
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
PPTX
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
How to build a data science project in a corporate setting, by Soraya Christi...
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
Artistic Applications of AI, by Luba Elliott, AI Curator
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
sap open course for s4hana steps from ECC to s4
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.

Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology

  • 1. Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data Sergül Aydöre Assistant Professor Electrical and Computer Engineering Stevens Institute of Technology
  • 2. 2 Landscape of Machine Learning Applications https://research.hubspot.com/charts/simplified-ai-landscape​
  • 3. • Data is High Dimensional, Noisy and Sample Size is small as in NeuroImaging 3 But what if PET acquisition process wikipedia Implantation of intracranial electrodes. Cleveland Epilepsy Clinic An elastic EEG cap with 60 electrodes [Bai2012] A typical MEG equipment [BML2001] MRI Scanner and rs-fMRI time series acquisition [NVIDIA]
  • 4. 4 Other High Dimensional, Noisy Data and Small Sample Size Situations Genomics Integrative Genomics Viewer, 2012 Seismology https://www.mapnagroup.com Astronomy AstronomyMagazine, 2015
  • 5. 5 Challenges 1. High Dimensionality of the data due to rich temporal and spatial structure
  • 6. 6 Challenges 1. High Dimensionality of the data due to rich temporal and spatial structure 2. Noise in the data due to mechanical or physical artifacts.
  • 7. 7 Challenges 1. High Dimensionality of the data due to rich temporal and spatial structure 2. Noise in the data due to mechanical or physical artifacts. 3. Difficulty and cost of data collection
  • 8. 8 Overfitting • ML models with large number of parameters require large amount of data. Otherwise, overfitting can occur! http://scott.fortmann-roe.com/docs/MeasuringError.html
  • 9. 9 Regularization Methods to overcome Overfitting • Early Stopping [Yao, 2007] • Ridge Regression (ℓ2 regularization) [Tibshirami 1996] • Least Absolute Shrinkage and Selection Operator (LASSO or ℓ1 regularization ) [Tibshirami 1996] • Dropout [Srivastana 2014] • Group Lasso [Yuan 2016]
  • 10. Regularization Methods to overcome Overfitting • Early Stopping • Ridge Regression (ℓ2 regularization) • Least Absolute Shrinkage and Selection Operator (LASSO or ℓ1 regularization ) • Dropout • Group Lasso SPARSITY
  • 11. Regularization Methods to overcome Overfitting • Early Stopping • Ridge Regression (ℓ2 regularization) • Least Absolute Shrinkage and Selection Operator (LASSO or ℓ1 regularization ) • Dropout • Group Lasso STOCHASTICITY SPARSITY
  • 12. Regularization Methods to overcome Overfitting • Early Stopping • Ridge Regression (ℓ2 regularization) • Least Absolute Shrinkage and Selection Operator (LASSO or ℓ1 regularization ) • Dropout • Group Lasso STOCHASTICITY STRUCTURE & SPARSITY 12 SPARSITY
  • 13. Regularization Methods to overcome Overfitting • Early Stopping • Ridge Regression (ℓ2 regularization) • Least Absolute Shrinkage and Selection Operator (LASSO or ℓ1 regularization ) • Dropout • Group Lasso • PROPOSED: STRUCTURE & STOHASTICITY STOCHASTICITY STRUCTURE & SPARSITY 13 SPARSITY
  • 14. 14 Problem Setting: Supervised Learning • Training samples: drawn from • Parameters of the model are estimated by: Loss per sample
  • 15. 15 Multinomial Logistic Regression • The class label probability of a given input is: • Hence, the parameter space is • The loss per sample is:
  • 16. 16 Dropout • Randomly removes units in the network during training. • Idea: Prevents units from co-adapting too much. • Attractive property: Can be used inside stochastic gradient descent without an additional computation cost. [Srivastana 2014]
  • 17. 17 Dropout • Randomly removes units in the network during training. • Idea: Prevents units from co-adapting too much. • Attractive property: Can be used inside stochastic gradient descent without an additional computation cost. [Srivastana 2014]
  • 18. 18 Dropout • Randomly removes units in the network during training. • Idea: Prevents units from co-adapting too much. • Attractive property: Can be used inside stochastic gradient descent without an additional computation cost. [Srivastana 2014]
  • 19. 19 FeatureDropoutMatrices Randomly picked matrix Dropout for Multinomial Logistic Regression
  • 20. 20 FeatureDropoutMatrices Randomly picked matrix PERSON A PERSON B PERSON X PERSON Y PERSON Z Dropout for Multinomial Logistic Regression
  • 21. 21 FeatureDropoutMatrices Randomly picked matrix PERSON A PERSON B PERSON X PERSON Y PERSON Z Forward Propagation Dropout for Multinomial Logistic Regression
  • 22. 22 FeatureDropoutMatrices Randomly picked matrix PERSON A PERSON B PERSON X PERSON Y PERSON Z Forward Propagation Back Propagation Dropout for Multinomial Logistic Regression
  • 23. 23 StructuredProjectionMatrices PERSON A PERSON B PERSON X PERSON Y PERSON Z Forward Propagation Back Propagation Replace Masking with Structured Matrices Randomly picked matrix
  • 24. 24 Replace Masking with Structured Matrices
  • 25. 25 Replace Masking with Structured Matrices Each is generated from random samples (size r) with replacement from the training data set (size n).
  • 26. 26 Replace Masking with Structured Matrices
  • 27. 27 Replace Masking with Structured Matrices
  • 28. 28 Replace Masking with Structured Matrices
  • 29. 29 Replace Masking with Structured Matrices
  • 30. 30 Replace Masking with Structured Matrices We project the training samples onto a lower dimensional space by . Hence, weight matrix becomes: approximate x
  • 31. 31 Replace Masking with Structured Matrices To update , we project the gradients back to the original space
  • 32. 32 Replace Masking with Structured Matrices No projection is necessary for the bias term.
  • 33. 33 Dimensionality Reduction Method by Feature Grouping Hoyos-Idrobo 2016
  • 34. 34 Dimensionality Reduction Method by Feature Grouping Hoyos-Idrobo 2016
  • 35. 35 Dimensionality Reduction Method by Feature Grouping Hoyos-Idrobo 2016
  • 36. 36 Recursive Nearest Agglomeration Clustering (ReNA) Hoyos-Idrobo 2016 • Agglomerative clustering schemes start off by placing every data element in its own cluster. • They proceed by merging repeatedly the closest pair of connected clusters until finding the desired number of clusters.
  • 37. 37 Insights: Random Reductions While Fitting • Let where is the deterministic term and is the zero-mean noise term. Loss on the smoothed input Regularization Cost variance of the model given the smooth input features variance of the estimated target due to the randomization
  • 38. Insights: Random Reductions While Fitting • Regularization Cost: • For dropout, we have and is diagonal matrix where for . • This is equivalent to ridge regression after “orthogonalizing” the features. Constant for linear regression
  • 40. 40 Experimental Results: Olivetti Faces • High Dimensional Data and the sample size is small • Consists of grayscale 64 x 64 face images from 40 subjects • For each subject , there are 10 different images with varying light. • Goal: Identification of the individual whose picture was taken
  • 41. 41 Experimental Results: Olivetti Faces • High Dimensional Data and the sample size is small • Consists of grayscale 64 x 64 face images from 40 subjects • For each subject , there are 10 different images with varying light. • Goal: Identification of the individual whose picture was taken
  • 42. 42 Experimental Results: Olivetti Faces • High Dimensional Data and the sample size is small • Consists of grayscale 64 x 64 face images from 40 subjects • For each subject , there are 10 different images with varying light. • Goal: Identification of the individual whose picture was taken
  • 43. 43 Experimental Results: Olivetti Faces • High Dimensional Data and the sample size is small • Consists of grayscale 64 x 64 face images from 40 subjects • For each subject , there are 10 different images with varying light. • Goal: Identification of the individual whose picture was taken
  • 44. 44 Experimental Results: Olivetti Faces • High Dimensional Data and the sample size is small • Consists of grayscale 64 x 64 face images from 40 subjects • For each subject , there are 10 different images with varying light. • Goal: Identification of the individual whose picture was taken
  • 45. 45 Experimental Results: Olivetti Faces • High Dimensional Data and the sample size is small • Consists of grayscale 64 x 64 face images from 40 subjects • For each subject , there are 10 different images with varying light. • Goal: Identification of the individual whose picture was taken
  • 55. 55 Experimental Results: Olivetti Faces • Visualization of the learned weights for logistic regression for a single Olivetti face with high noise using different regularizers.
  • 56. 56 Experimental Results: Olivetti Faces • Performance in terms of loss as a function of computation time for MLP with a single layer using feature grouping and best parameters for other regularizers, for Olivetti face data with high noise.
  • 57. 57 Experimental Results: Neuroimaging Data Set • Openly accessible fMRI data set from Human Connectome Project • 500 subjects, 8 cognitive tasks to classify • Feature dimension: 33854, training set: 3052 samples, test set: 791 samples
  • 60. 60 Summary – Stochastic Regularizer • We introduced a stochastic regularizer based on feature averaging that captures the structure of data. • Our approach leads to higher accuracy at high noise settings without additional computation time. • Learned weights have more structure at high noise settings.
  • 61. 61 Collaborators and References • S. Aydore, B. Thirion, O. Grisel, G. Varoquaux. “Using Feature Grouping as a Stochastic Regularizer for High-Dimensional Noisy Data”, Women in Machine Learning Workshop, NeurIPS 2018, Montreal, Canada, 2018, accessible at arXiv preprint: 1807.11718. • S. Aydore, L. Dicker, D. Foster.“A local Regret in Nonconvex Online Learning”, Continual Learning Workshop, NeurIPS 2018, Montreal, Canada, 2018, accessible at arXiv preprint: 1811.05095. Bertrand Thirion (INRIA, France) Olivier Grisel (INRIA, France) Gaël Varoquaux (INRIA, France) Dean Foster (Amazon & University of Pennsylvania) Lee Dicker (Amazon & University of Rutgers)
  • 62. Thank You More on my website… http://www.sergulaydore.com

Editor's Notes

  • #3: In the graphic below, the x-axis reflects the level of technical sophistication the AI tool has. The y-axis represents the mass appeal of the tool. Here is a landscape for the popular machine learning applications. It is of course very exciting to see such progress in AI. But all these applications require massive amounts of data to train machine learning models.
  • #4: Some fields such as brain imaging often does not have such massive amounts of samples whereas the dimension of the features is large due to the rich spatial and temporal information.
  • #5: This problem is not limited to brain imaging. There are other fields which also suffer from small-sample data situations.
  • #9: The performance of machine learning models is often evaluated by their prediction ability on unseen data. While each iteration of model training decreases the training risk, fitting the training data too well can lead to failure in generalization on future predictions. This phenomenon is often called ``overfitting’’ in the field of machine learning. The risk of overfitting is more severe for high-dimensional data-scarce situations. Such situations are common when the data collection is expensive, as in neuroscience, biology, or geology.
  • #34: Feature grouping defines a matrix Φ that extracts piece- wise constant approximations of the data Let ΦFG be a matrix composed with constant amplitude groups (clusters). Formally, the set of k clusters is given by P = {C1, C2, . . . , Ck}, where each cluster Cq ⊂ [p] contains a set of indexesthatdoesnotoverlapotherclusters,C ∩C =∅,for 􏰚ql all q ̸= l. Thus, (ΦFG x)q = αq j∈Cq xj yields a reduction of a data sample x on the q-th cluster, where αq is a constant for each cluster. With an appropriate permutation of the indexes of the data x, the matrix ΦFG can be written as We call ΦFG x ∈ Rk the reduced version of x and ΦTFGΦFG x ∈ Rp the approximation of x.
  • #35: Feature grouping defines a matrix Φ that extracts piece- wise constant approximations of the data Let ΦFG be a matrix composed with constant amplitude groups (clusters). Formally, the set of k clusters is given by P = {C1, C2, . . . , Ck}, where each cluster Cq ⊂ [p] contains a set of indexesthatdoesnotoverlapotherclusters,C ∩C =∅,for 􏰚ql all q ̸= l. Thus, (ΦFG x)q = αq j∈Cq xj yields a reduction of a data sample x on the q-th cluster, where αq is a constant for each cluster. With an appropriate permutation of the indexes of the data x, the matrix ΦFG can be written as We call ΦFG x ∈ Rk the reduced version of x and ΦTFGΦFG x ∈ Rp the approximation of x.
  • #36: Feature grouping defines a matrix Φ that extracts piece- wise constant approximations of the data Let ΦFG be a matrix composed with constant amplitude groups (clusters). Formally, the set of k clusters is given by P = {C1, C2, . . . , Ck}, where each cluster Cq ⊂ [p] contains a set of indexesthatdoesnotoverlapotherclusters,C ∩C =∅,for 􏰚ql all q ̸= l. Thus, (ΦFG x)q = αq j∈Cq xj yields a reduction of a data sample x on the q-th cluster, where αq is a constant for each cluster. With an appropriate permutation of the indexes of the data x, the matrix ΦFG can be written as We call ΦFG x ∈ Rk the reduced version of x and ΦTFGΦFG x ∈ Rp the approximation of x.