PhD Defense Slides

1
Click to edit Master title style
On Transfer Learning Techniques for Machine Learning
Assistive Robotics Technology Laboratory
School of Electrical and Computer Engineering
Purdue University, West Lafayette, IN, USA
Debasmit Das
Advisory Committee
C. S. George Lee (Chair)
Stanley Chan
Guang Lin
Guang Cheng

2
Current ML Methods
[Canziani et al. ISCAS’17]
Evolution of Deep Architectures
INTRO
• Focus only on recognition performance.
• Highly resource intensive requiring lots of
labeled data, too much compute, memory
and energy.
• Cannot be deployed in resource constrained
environments – e.g. mobile devices, annotation
free novel environments, etc.
• Most of these models are closed set.
• Need efficient machine learning techniques.

3
Efficient ML Goals INTRO
Use less amount of
training data/labels
Produce models with
fewer parameters
Effects Effects
• Less training time.
• Less memory footprint.
(for data storage)
• Overall, less energy consumed.
• Less inference time.
• Less memory footprint.
(for model storage)
• Overall, less energy consumed.
• Data-efficient models imply model-efficient models but not the other way around.
Without sacrificing recognition performance

4
INTROLearning with less labels
Data-efficient
Learning
Transfer
Learning
Self-supervised
Learning
Generative
Learning
Learns to transfer knowledge
from data abundant source
domain to sparsely labeled
target domain. E.g. Domain
Adaptation, Few-shot learning Learn generative models to
generate synthetic data using
few-labeled and unlabeled
data. E.g. GANS, VAE
Define surrogate task from
unlabeled data to learn useful
features for a different task.
E.g. Predict Rotation, Relative
location
• Transfer Learning is similar to way humans learn from experience and apply to new situations.
(Focus of my thesis)

5
Transfer Learning (TL)
• Allows pre-trained machine learning models to
be adapted and applied to label-starved new
tasks and new domains.
• New tasks can be novel categories.
• New domain can be a novel variety of the same
category.
• Automatic Annotation : Reduces human effort
of labeling new domains/tasks.
• Faster Learning : Learning novel tasks from
less data prevents long training time.
• Data Efficiency : In some domains, obtaining
data is cumbersome. E.g. Medical tests, Robotics.
Added Benefits
INTRO

6
Transfer Learning Tasks
Transfer Learning
Domain
Adaptation
Small Sample
Learning
Unsupervised
Domain Adaptation
Semi-supervised
Domain Adaptation
Few-shot
Learning
Zero-shot
Learning
(Target Domain
Sparsely labeled)
(Same Categories) (Different Categories)
(Target Domain
Sparsely labeled)
(Target Domain
fully unlabeled)
(Target Domain
fully unlabeled)
Hypothesis
Transfer Learning
(Only source prototypes/
models available)
Completed
before prelim
Not included
in this thesis
UDA FSL ZSL
HTL
SSDA
INTRO
Completed
after prelim

7
Training and Testing conditions
UDA HTL
FSL ZSL
• Distribution discrepancy between training and testing
conditions.
• Testing data unlabeled but same categories as training. • Base (novel) categories contain models/prototypes
(few labeled data).
• Base categories used as training and novel categories
used for testing.
used for testing.
• Base categories contain abundant labeled data.
Novel categories contain few labeled data.
used for testing.
• Base (novel) categories contain abundant labeled
(unlabeled) data. Class-level semantic information
available.
INTRO

8
Graphs/Hyper-graphs Manifolds
Neural Networks
Proposed Approach
• Structural Priors constructed from source domain data.
• These priors learn a structure i.e. an encoding between different data entities.
• Structural Priors extract relational information and enable better transfer learning.
Problem Structure Entity
Unsupervised Domain
Adaptation (UDA)
Graphs and
Hyper-graphs
Sample - Sample
Few Shot Learning (FSL) Neural Network Sample - Class Prototype
Hypothesis Transfer
Learning (HTL)
Manifold Class Prototype - Class
Prototype
Zero Shot Learning (ZSL) Neural Network Sample - Semantics
INTRO

9
Publications
Unsupervised Domain Adaptation
Zero Shot Learning
Few Shot Learning
[J1] Debasmit, Das, and C. S. George Lee. "Sample-to-sample correspondence for unsupervised domain adaptation."
Engineering Applications of Artificial Intelligence (EAAI) (73) (2018): 80-91.
[C1] Debasmit, Das, and C. S. George Lee. "Graph Matching and Pseudo-Label Guided Deep Unsupervised Domain
Adaptation." Proceedings of the International Conference on Artificial Neural Networks (ICANN), 2018, pp. 342-352.
[C2] Debasmit, Das, and C. S. George Lee. "Unsupervised Domain Adaptation Using Regularized Hyper-Graph Matching."
Proceedings of the IEEE International Conference on Image Processing (ICIP), 2018, pp. 3758-3762.
[J3] Debasmit, Das, and C. S. George Lee. “A Constrained Generative Approach to Zero-shot Object Recognition." Under
review at IEEE Transactions on Neural Networks and Learning Systems (TNNLS).
[C3] Debasmit, Das, and C. S. George Lee. "Zero-shot Image Recognition Using Relational Matching, Adaptation and
Calibration," Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2019.
[J2] Debasmit, Das, and C. S. George Lee. "A Two-Stage Approach to Few-Shot Learning for Image Recognition." IEEE
Transactions on Image Processing (TIP), 2020.
Hypothesis Transfer Learning
[P1] Debasmit, Das, J.H. Moon and C. S. George Lee. “Parametric and Non-parametric approach to Few-shot Learning." To
be submitted.
INTRO

10
Approach Overview
Method Algorithm Architecture Parametric Non-
parametric
Main Concept
J1 [EAAI]     Graph Matching
C1 [ICANN]     Deep Graph Matching
C2 [ICIP]     Hyper-graph Matching
J2 [TIP]     Predictive Statistics
P1     Manifold Projection
J3 [TNNLS]     Constrained Generation
C3 [IJCNN]     Constrained Embedding
HTL FSL ZSLUDA
INTRO

11
Unsupervised Domain Adaptation (UDA)
OBJECTIVE
MOTIVATION
• Local information more useful.
• Structural information preserved
across domains.
• Pseudo-labels refine classifiers.
RESULTS
APPROACH
• Minimize domain discrepancy.
• Labeled source and unlabeled
target.
• Transform source to target.
Create maximum
margin classifier
• Better but slower than global methods.
• Representation learning slower but better.
• Third order matching better than second order
matching.
Graph matching to
minimize domain discrepancy
Source
Domain
Target
Domain
Class 2
Class 3
Sample 1
Sample 2
Sample 3
INTRO
Class 1

12
Few-Shot Learning (FSL)
OBJECTIVE
MOTIVATION
• Curse of dimensionality causes over-fitting.
• Ill-sampling produces incorrect prototype
estimation.
• Uncertain variance causes misclassification.
RESULTS
APPROACH
• Competitive results with respect to previous work.
• Relative feature extractor most effective.
• More discriminative feature space.
• Estimate Novel Class Prototypes.
• Labeled source and sparsely labeled
target.
• Extract prior from source.
INTRO

13
Hypothesis Transfer Learning (HTL)
OBJECTIVE
MOTIVATION
• Estimate Novel Class Prototypes.
• Source prototypes and sparsely
labeled target
• Extract prior from source.
• Neural network prior over-fits to less data
• Use non-parametric method using manifolds
• Or, simple parametric methods like Bayes.
RESULTS
APPROACH
• Manifold approach better than Bayesian approach.
• Both the approaches most effective in few-class regime.
• Closed-form Bayes method better than approximation
methods.
INTRO
Base class prototype
Novel class prototype
(Unknown)
Novel class sample
Transferrable Knowledge
Novel Knowledge
• Base prototypes used
for manifold construction.
• Base prototypes used for prior.
• Novel sample projected
on manifold.
• Novel sample used for
likelihood calculation.
OR
OR

14
Zero-Shot Learning (ZSL)
OBJECTIVE
MOTIVATION
• Recognize novel categories with zero
labeled data.
• Labeled source and unlabeled
target.
• Relate feature and semantics.
• Nearest Neighbor predictions produce hubs.
• Domain shift between predictions and
ground truth.
• Predictions biased towards seen classes.
RESULTS
APPROACH
• Generative approach performs better than embedding
approach.
• Domain adaptation is the most effective
• Structural Matching improves generalization.
• Discrimination between seen and unseen classes prevent
biasness.
INTRO
Embed
Generate
Feature
Space
Semantic
Space
Unlabeled
Test Data
Domain Adaptation
Generated/
Embedded data

15
Impact of proposed approaches
UDA [J1, C1, C2] HTL [P1]
FSL [J2] ZSL [J3, C3]
Core Idea : Match distribution
Using graphs/hyper-graphs.
Impact : Generative Models,
Anomaly detection.
Core Idea : Discriminative
Low-dimensional space and
generating statistics.
Impact : Discriminative/
Generative Learning.
Core Idea : Adaptive matching
between features and semantics.
Impact : Media Retrieval,
Description generation.
Core Idea : Estimate novel
class-prototype.
Impact : Manifold distance
Metric.
INTRO

16
BEFORE PRELIM

17
Graph Matching UDA
• Previous UDA methods use global information
to minimize domain discrepancy.
• Local information useful but can cause
ambiguous one-to-one matching.
• Therefore structural matching in the form of
higher-order information is proposed.
Motivation
Source Domain Class 1 sample
Source Domain Class 2 sample
Target domain sample
UDA

18
Proposed Approach
Matching Formulation
Method 1
• Set
• Uses Conditional
Gradient Descent
+ Network
Simplex for
optimization.
Method 2
• Initial preprocessing to
obtain exemplars.
• Uses Conditional
Gradient Descent
+ ADMM for
optimization.
Method 3
• Set
• Learn features as
well as matching.
• Optimization
using stochastic
gradient descent.
• Second refining stage to
obtain maximum margin
classifier.
UDA
Source Domain
Target Domain

19
Experimental Results
Ablation Studies
Without Adaptation
With Graph Matching
With Graph Matching
& Pseudo-labeling
UDA
Dataset: Office-Caltech

20
Two Stage FSL
• Curse of dimensionality: Addressed by using
a new representation that uses relative distances
between features.
• Ill-sampling of data: The novel class prototype is
estimated by learning a model that predicts the
mean.
• Uncertain Class Variance: The novel class
variance is estimated by learning a model that
predicts the variance.
Motivation
Feature Space
FSL

21
Proposed Approach
Relative Feature Probability with absolute
& relative features
Loss Function
FSL

22
Ablation Study
PN – Prototypical Network V – Variance Estimator R – Relative Feature T – Category-agnostic Transformer
Feature visualization
without (left) and
with (right) relative
features
Dataset: MiniImageNet
Performance change with no. of base
categories
FSL

23
Embedding-based ZSL
Feature Space
Semantic Space
• Hubness problem : Addressed using pairwise
structural matching.
• Domain Shift Problem : Addressed using our
sample-to-sample correspondence approach.
• Seen Class Biasness Problem : Addressed using
scaled calibration mechanism.
Motivation
ZSL
Each semantic vector of a class is a
histogram of attributes.

24
Proposed Approach
Relational Matching Calibration
Minimize loss using gradient descent
Seen
Unseen
Total
We use sample-to-sample
matching for domain
adaptation
ZSL

25
Hubness Measurement
Hubness measured using skewness of NN
prediction distribution
Effect of the
calibration factor
Effect of the
structural matching
weight
Without Domain
Adaptation
With Domain
AdaptationUnseen Features
Seen Features
Unseen Semantic
Embedding
Seen Semantic
Embedding
ZSL

26
AFTER PRELIM

27
Generative ZSL
Feature Space
Semantic Space
• Base Categories (source domain) contain
abundant labeled data.
• Novel Categories (target domain) contain
unlabeled data.
• However, class level semantic information
available for all categories.
• Need to relate the feature space and space.
ZSL
Each semantic vector of a class is a
histogram of attributes.

28
ZSLMotivation
• Relation between semantic and feature space is
biased towards seen classes because of no training
data for unseen classes.
• As a result, generative methods have been
proposed to generate data for unseen classes.
• However, the generative model itself maybe biased
towards seen classes.
• This is because no labeled data from unseen
classes used for learning the generative model.
• Need to constrain the generation process such that
seen and unseen classes distinguished from one
another.
• Need to also close a cycle between semantic and
feature space to preserve semantic consistency.
Base class semantic
embedding
Novel class semantic
embedding
Novel class
test sample

29
Constrained Training
• Discriminator is used to discriminate synthetic
unseen data and real seen data.
• The generated features are also reconstructed
back to their corresponding semantic descriptor.
Generator
Critic Fake or
Real ?
Reconstruct
Reconstruction Loss
Seen or
Unseen ?
D
Noise
Semantics
Real Feature
Fake Feature
Unseen Class
Fake Feature
Seen Class Real Feature D Discriminator of
seen and unseen class
C
D
R
ZSL

30
Selective Domain Adaptation
• Discriminator is used to separate test data into
seen and unseen from which the unseen are
selected.
• The generated data of the unseen classes are
adapted with respect to the selected unseen test
data.
DUnlabeled
Test Data
Unlabeled Seen
Test Data
Unlabeled Unseen
Test DataGenerated
Unseen Data
Sample-to-Sample
Domain Adaptation
Generator
Unseen Class Semantics
Reconstructed Semantics
Matching
Regularization
ZSL
Threshold used
>

31
Comparative Analysis
tr – Unseen class accuracy in traditional setting
u – Unseen class accuracy in generalized setting
s – Seen class accuracy in generalized setting
H – Harmonic mean of u and s
• Animals with Attributes (AwA)
[Lampert et al. TPAMI’14]
(Att – 85, Ysrc - 40 , Ytar - 10 )
• Pascal & Yahoo (aPY)
[Farhadi et al. CVPR’09]
(Att – 64, Ysrc - 20 , Ytar - 12 )
• Caltech-UCSD Birds (CUB)
[Welinder et al. ‘10]
(Att – 312, Ysrc - 150 , Ytar - 50 )
• Scene Understanding (SUN)
[Patterson et al. CVPR’12]
(Att – 102, Ysrc - 645, Ytar - 72 )
• Flowers Dataset (FLO)
[Nilsback et al. ICVGIP’08]
(Att – 1024, Ysrc - 82, Ytar - 20 )
Datasets
ZSL

32
ZSLFurther Analyses
Generated novel class
features without
domain adaptation
on the AwA dataset
Generated novel class
features with
domain adaptation
on the AwA dataset
Convergence study
with increasing epochs
on AWA dataset
Ablation study
B - WGAN Baseline
R - Reconstructor
D - Discriminator
A – Domain Adaptation

33
Sensitivity Studies
Sensitivity to threshold
Sensitivity to number of generated
features on FLO dataset
ZSL

34
Conclusion
• Results on standard image recognition datasets better than most of the
previous generative and non-generative approaches.
• Ablation studies shows that all of the contributions are important but
domain adaptation is the most effective.
• Results on fine-grained datasets show that there is lot of scope for
improvement.
• Need to involve fine-grained learning architectures into our framework.
• Can explore using the same set of architectural constraints on other
generative models like Variational Auto-encoders, Normalizing Flows etc.
ZSL

35
Hypothesis Transfer Learning (HTL)
Feature Space
• No access to base categories (source domain)
data.
• Only high-level information about source
categories available. E.g. Model parameters, class
prototypes etc.
• Novel Categories (target domain) contain
sparsely labeled data.
• Need to estimate the location of the novel class
parameters.
HTL

36
HTLMotivation
Relatively unexplored Topic. Constrained Target Models to
be some combination of source models.
• Linear Combination [Tommasi et al. TPAMI’14]
• Non-Linear Combination [Jie et al. ICCV’11]
• Feature Selection [Kurborskij et al. CVIU’17]
Source Model 1
Source Model 2
Source Model 3
Target Model
• Source models not consistent. Does not provide
reliable benchmark for comparison.
• Source class prototypes can produce reliable
benchmark because of data-dependency only.
Source
Prototype 1
Source Prototype 2
Source
Prototype 3
Target Prototype

37
Proposed Solution HTL
• Limited Information from source cannot be used for training neural networks
because of over-fitting.
• Need a non-parametric method or a parametric method with minimal parameters.
Manifold Approach Bayesian Approach
• Non-parametric approach.
• Inspired from the assumption that class
data lies on subspace [Basri et al. TPAMI’03].
• Construct manifold from source prototypes.
• Project novel class samples to obtain novel class
prototype.
• Parametric approach.
• Inspired from the assumption that class
data belongs to a model family.
• Construct prior distribution from source
prototypes and likelihood from novel class
samples.
• Posterior distribution used to obtain novel class
prototype.

38
Manifold Approach HTL
Predict using absorbing Markov chain (M2)Estimate novel class prototypes (M1)
Estimate novel class prototype as
Obtain manifold
mean. Closed form solution
exists.
• Choose novel class as transient and base class
as absorbing. Obtain most probable base class.
• Choose base class as transient and novel class
as absorbing. Obtain most probable novel class.
• Among the most probable base class and most
probable novel class, use nearest neighbor to
distinguish.

39
Bayesian Approach HTL
Prior distribution is
varied but
likelihood is fixed
Normal Prior on Mean Variance is fixed but
obtained heuristically
from source prototypes
Posterior density is normal with mean
Normal-Gamma Prior on Mean & Precision
Posterior density is normal-gamma with mode
Normal prior on Mean and Gamma prior on Precision
Gamma prior turns to Uniform prior when
Posterior distribution not closed-form.
Requires Variational-Bayes Approximation.
Case 1 (B1) Case 3 (B4)
Case 2
(B3)
(B2)

40
Experimental Results HTL
Recognition performance on ImageNet
as shots are varied
Recognition performance on CUB-200
as shots are varied
as total no. of classes are varied
as fraction of base classes are varied

41
Further Analyses HTL
For Model
B4

42
Conclusion
• Manifold approach performs better than Bayesian approach in few-shot and
few-class regime.
• Both the approaches are the most effective in few-class regime, that is when
the number of base classes is less.
• Closed-form Bayes method perform better than approximation methods.
• Markov-chain-based distance has incremental effect and can be used as a
metric during the training stage.
• Explore other priors and hyper-priors for the Bayesian model.
HTL

43
Research Summary
MOTIVATION APPROACH
RESULTS
• Current ML methods consume lots of resources.
• Goal is to make ML more data-efficient.
• TL simulates human learning by reusing models.
• Competitive when compared with previous work.
• Ablation studies show importance of each component.
• Need to improve for fine-grained datasets.
Graphs/Hyper-graphs
for UDA
Manifolds for HTL
Neural Networks
for FSL
Neural Networks
for ZSL
Sample
Prototype
Semantics
• Structural Prior. • Additional constraints
and post-processing.

44
Limitations of Current Work
• Abundant Source Data: Proposed transfer learning
approaches mostly require abundant labeled data from
source domain.
• Batched Target: Target domain data available as a batch
instead of in a sequential incremental manner.
• Same Modalities: Source and target domains have same
feature spaces and cannot have different modalities.
• Hence, there is a need to tackle more realistic transfer
learning settings.

45
Future Direction
Unsupervised Transfer Learning
Source
Target
Sequential Transfer Learning Heterogeneous Transfer Learning
Source Target
Source Target 1 Target 2 Target 3
• Use Self-supervised learning.
• Clustering for pre-processing.
• Need to develop incremental algorithms.
• Correspondence can be time-dependent.
• Need common subspace for two
domains.
• Metric minimization for subspace
required.

PhD Defense Slides

More Related Content

What's hot (20)

Similar to PhD Defense Slides (20)

More from Debasmit Das (9)

Recently uploaded (20)

PhD Defense Slides