Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology

Using Feature Grouping as a
Stochastic Regularizer for
High Dimensional Noisy Data
Sergül Aydöre
Assistant Professor
Electrical and Computer Engineering
Stevens Institute of Technology

2
Landscape of Machine Learning Applications
https://research.hubspot.com/charts/simplified-ai-landscape

• Data is High Dimensional, Noisy and Sample Size is
small as in NeuroImaging
3
But what if
PET acquisition process
wikipedia
Implantation of intracranial
electrodes.
Cleveland Epilepsy Clinic
An elastic EEG cap with 60
electrodes [Bai2012]
A typical MEG equipment [BML2001]
MRI Scanner and rs-fMRI time series acquisition [NVIDIA]

4
Other High Dimensional, Noisy Data and Small
Sample Size Situations
Genomics
Integrative Genomics Viewer, 2012
Seismology
https://www.mapnagroup.com
Astronomy
AstronomyMagazine,
2015

5
Challenges
1. High Dimensionality of the data due to rich temporal and
spatial structure

6
Challenges
spatial structure
2. Noise in the data due to mechanical or physical artifacts.

7
Challenges
spatial structure
2. Noise in the data due to mechanical or physical artifacts.
3. Difficulty and cost of data collection

8
Overfitting
• ML models with
large number of
parameters
require large
amount of data.
Otherwise,
overfitting can
occur!
http://scott.fortmann-roe.com/docs/MeasuringError.html

9
Regularization Methods to overcome Overfitting
• Early Stopping [Yao, 2007]
• Ridge Regression (ℓ2 regularization) [Tibshirami 1996]
• Least Absolute Shrinkage and Selection Operator
(LASSO or ℓ1 regularization ) [Tibshirami 1996]
• Dropout [Srivastana 2014]
• Group Lasso [Yuan 2016]

• Early Stopping
• Ridge Regression (ℓ2 regularization)
(LASSO or ℓ1 regularization )
• Dropout
• Group Lasso
SPARSITY

• Early Stopping
• Dropout
• Group Lasso
STOCHASTICITY
SPARSITY

• Early Stopping
• Dropout
• Group Lasso
STOCHASTICITY
STRUCTURE & SPARSITY
12
SPARSITY

• Early Stopping
• Dropout
• Group Lasso
• PROPOSED: STRUCTURE & STOHASTICITY
STOCHASTICITY
STRUCTURE & SPARSITY
13
SPARSITY

14
Problem Setting: Supervised Learning
• Training samples:
drawn from
• Parameters of the model are estimated by:
Loss per sample

15
Multinomial Logistic Regression
• The class label probability of a given input is:
• Hence, the parameter space is
• The loss per sample is:

16
Dropout
• Randomly removes units in the network during training.
• Idea: Prevents units from co-adapting too much.
• Attractive property: Can be used inside stochastic gradient descent
without an additional computation cost.
[Srivastana 2014]

17
Dropout
[Srivastana 2014]

18
Dropout
[Srivastana 2014]

19
FeatureDropoutMatrices
Randomly picked matrix
Dropout for Multinomial Logistic Regression

20
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z

21
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Forward Propagation

22
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Forward Propagation
Back Propagation

23
StructuredProjectionMatrices
PERSON A
PERSON B
PERSON X
PERSON Y
PERSON Z
Forward Propagation
Back Propagation
Replace Masking with Structured Matrices

24

25
Each is generated from
random samples (size r) with
replacement from the training
data set (size n).

26

27

28

29

30
We project the training
samples onto a lower
dimensional space by
. Hence, weight matrix
becomes:
approximate x

31
To update , we
project the gradients
back to the original
space

32
No projection is
necessary for the
bias term.

33
Dimensionality Reduction Method by Feature
Grouping
Hoyos-Idrobo 2016

34
Grouping
Hoyos-Idrobo 2016

35
Grouping
Hoyos-Idrobo 2016

36
Recursive Nearest Agglomeration Clustering
(ReNA)
Hoyos-Idrobo 2016
• Agglomerative clustering schemes start off by placing every data
element in its own cluster.
• They proceed by merging repeatedly the closest pair of connected
clusters until finding the desired number of clusters.

37
Insights: Random Reductions While Fitting
• Let where is the deterministic
term and is the zero-mean noise term.
Loss on the
smoothed input
Regularization Cost
variance of the
model given the
smooth input
features
variance of the
estimated target due
to the randomization

Insights: Random Reductions While Fitting
• Regularization Cost:
• For dropout, we have and is diagonal matrix
where for .
• This is equivalent to ridge regression after “orthogonalizing” the
features.
Constant for linear
regression

39
Computational Complexity
Total
number
of epochs

40
Experimental Results: Olivetti Faces
• High Dimensional Data and the sample size is small
• Consists of grayscale 64 x 64 face images from 40 subjects
• For each subject , there are 10 different images with varying light.
• Goal: Identification of the individual whose picture was taken

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55
• Visualization of the learned weights for logistic regression for a single
Olivetti face with high noise using different regularizers.

56
• Performance in terms of loss as a function of computation time for
MLP with a single layer using feature grouping and best parameters
for other regularizers, for Olivetti face data with high noise.

57
Experimental Results: Neuroimaging Data Set
• Openly accessible fMRI data set from Human Connectome Project
• 500 subjects, 8 cognitive tasks to classify
• Feature dimension: 33854, training set: 3052 samples, test set: 791
samples

58

59

60
Summary – Stochastic Regularizer
• We introduced a stochastic regularizer
based on feature averaging that
captures the structure of data.
• Our approach leads to higher accuracy
at high noise settings without
additional computation time.
• Learned weights have more structure
at high noise settings.

61
Collaborators and References
• S. Aydore, B. Thirion, O. Grisel, G. Varoquaux. “Using Feature Grouping as a Stochastic
Regularizer for High-Dimensional Noisy Data”, Women in Machine Learning Workshop, NeurIPS
2018, Montreal, Canada, 2018, accessible at arXiv preprint: 1807.11718.
• S. Aydore, L. Dicker, D. Foster.“A local Regret in Nonconvex Online Learning”, Continual
Learning Workshop, NeurIPS 2018, Montreal, Canada, 2018, accessible at arXiv preprint:
1811.05095.
Bertrand Thirion
(INRIA, France)
Olivier Grisel
(INRIA, France)
Gaël Varoquaux
(INRIA, France)
Dean Foster
(Amazon & University of Pennsylvania)
Lee Dicker
(Amazon & University of Rutgers)

Thank You
More on my website…
http://www.sergulaydore.com

Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology

More Related Content

What's hot (20)

Similar to Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology (20)

More from WiMLDSMontreal (11)

Recently uploaded (20)

Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data, by Sergül Aydöre, Assistant Professor at Stevens Institute of Technology

Editor's Notes