Deep Domain Adaptation using Adversarial Learning and GAN

Domain Adaptation with Adversarial Learning
Rishiraj Chakraborty, Sourya Sengupta and Chengzhu Xu
Dept. of Applied Math and System Design Eng.
May 12, 2019
Chakraborty, R. et al. (UWaterloo) May 12, 2019 1 / 35

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
4 Adversarial Models
5 Generative models
6 Results

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
5 Generative models
6 Results

Introduction
Domain adaptation is the process of training a machine learning
model on one input domain, so that it also performs well on data
from a diﬀerent domain.
Domain adaptation is a common requirement in tasks like object
recognition, object detection, image categorization, speech
recognition, sentiment analysis etc. due to the excessive cost of
labelling newly obtained data.
This is done by ﬁnding a common embedding between the two
domains or by transferring one domain to the other domain.
We discuss domain adaptation techniques for both adversarial models
and generative models.

Setup for Adversarial Models
For simplicity, we consider a binary classiﬁcation problem.
A domain is a pair consisting of a distribution D on inputs X and a
labelling function f : X → [0, 1].
We consider a source domain Ds, fs and a target domain DT , fT .

Objective
We want to learn a hypothesis h : X → [0, 1] because f is unknown.
The expected value of disagreement between h and f based on Ds
(source risk), is
s(h, f ) = Ex∼Ds [|h(X) − f (X)|]
We want our hypothesis h to minimize the source risk s and also to
generalize well on DT , thereby minimizing T as well.

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
5 Generative models
6 Results

Domain Divergence
Domain divergence is a notion of the distance between the source and
the target distributions.
A natural measure of domain divergence is the L1 or the variation
divergence,
d1(DS , DT ) = 2sup
B⊂B
|PrDS
[B] − PrDT
[B]|
where B is the set of measurable subsets under DS and DT .

Bounding the Target Risk
For a hypothesis h,
T (h) ≤
S (h) + d1(DS , DT ) + min{EDS
[|fS (x)| − fT (x)]}.EDT
[|fS (x)| − fT (x)]
The target risk is bounded by the source risk, the domain divergence
and the diﬀerence in the labelling functions across the two domains.
It is safe to assume that the inherent Bayes error in the diﬀerence
between the labelling functions is small.
Hence the domain divergence term is critical for bounding the target
risk.

L1
Divergence Problems
The L1 divergence cannot be accurately estimated from finite samples of
arbitrary distributions.
Unnecessarily inflates the bound by taking supremum over all measurable
subsets.
We are only interested in subsets where there can be discrepancies on a
hypothesis from a hypothesis class of finite complexity.

H-divergence
Given a domain X with two probability distributions DS and DT , let H be
a hypothesis class on X and I(h) denote the set for which h ∈ H is the
characteristic function, i.e. x ∈ I(h) ⇔ h(x) = 1. The H-divergence
between DS and DT is given by,
dH(DS , DT ) = 2sup
h∈H
|PrDS
[I(h)] − PrDT
[I(h)]|
The H-divergence can be estimated from a ﬁnite number of samples
for a hypothesis class of ﬁnite VC dimension.
The H-divergence for any H is never larger than the L1 divergence.
This is because the H-divergence compares the two distributions on
data sets which satisfy a certain hypothesis at a time.

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
5 Generative models
6 Results

Empirircal H-divergence
For a symmetric hypothesis class H, it can be proved that one can
compute the empirical H-divergence between two samples S ∼ (DX
S )n
and T ∼ (DX
T )n as,
ˆdH = 2 1 − min
h∈H
1
n
n
i=1
I[h(xi ) = 0] +
1
n
N
i=n+1
I[h(xi ) = 1]

Proxy Distance
ˆdH = 2 1 − min
h∈H
1
n
n
i=1
I[h(xi ) = 0] +
1
n
N
i=n+1
I[h(xi ) = 1]
We can approximate the empirical H-divergence ˆdH by training a
neural network to classify between the source and the target domains.
We construct a new dataset U, with source samples labelled 0 and
the target samples labelled 1. The risk of a classiﬁer trained on U
can approximate the proxy distance between the two domains as
ˆdA = 2(1 − 2 ).

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
5 Generative models
6 Results

Shallow Neural Networks
Objective is to learn a classiﬁcation problem without discriminating
between the source and the target domains.
We take a hidden layer Gf that maps an input into a new D
dimensional representation.
We have a prediction layer which maps the D-dimensional
representation into probability components of the corresponding
classes; Gy : RD → [0, 1].
Given a source example (xi , yi ), the classiﬁcation cross-entropy loss is
Ly (Gy (Gf (xi )), yi ) = −log(Gy (Gf (x))yi )

Domain Regularizer
The optimization problem is
min
θf ,θy
1
n
n
i=1
Li
y (θf , θy ) + λ.R(θf )
where R(θf ) is an optional regularizer weighted by hyper-parameter λ.

Domain Regularizer
Empirical H-divergence
ˆdH = 2 1 − min
h∈H
1
n
n
i=1
I[h(xi ) = 0] +
1
n
N
i=n+1
I[h(xi ) = 1]
We use a domain classification layer Gd that maps the D-dimensional
representation from Gf into 0 and 1 for source and target respectively.
Cross-entropy loss is Ld (Gd (Gf (xi ))) = −log(Gd (Gf (xi )))
The min part in ˆdH can be approximated by
R(θd ) = max
θd
−
1
n
n
i=1
Li
d (θf , θd ) +
1
n
N
i=n+1
Li
d (θf , θd )
R(θd ) introduces a trade-off between the source risk and ˆdH and λ is
used to tune the trade off.

Complete Optimization Problem
E(θf , θy , θd ) =
1
n
n
i=1 Li
y (θf , θy ) − λ 1
n
n
i=1 Li
d (θf , θd ) + 1
n
N
i=n+1 Li
d (θf , θd )
(ˆθf , ˆθy ) = argmin
θf ,θy
E(θf , θy , θd )
(ˆθd ) = argmax
θd
E(θf , θy , θd )
The discriminator competes to minimize the domain classiﬁcation
error in approximate ˆdH and the hidden layer competes to learn
representation vectors irrespective of the domain.

Generalized Architectures
Instead of the example shallow network, we can use more
sophisticated architectures for both the feature extractor and the
discriminator.
Generalized weight updates,
θf ← θf − µ
∂Li
y
∂θf
− λ
∂Li
d
∂θf
θy ← θy − µ
∂Li
y
∂θf
θd ← θd − µλ
∂Li
d
∂θd

Generalized Architectures
We see that for the classiﬁer and the discriminator, the gradients are
subtracted like usual gradient descent, but for the feature extractor
the gradient of the discriminator loss is added.
To implement this with standard gradient descent optimizers we add
a gradient reversal layer which performs identity transformation
during the forward pass and changes sign of the incoming gradient
during backward propagation.
Figure: Deep Adversarial Neural Network (DANN)

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
5 Generative models
6 Results

Generative Models
So far, we talked about how to ﬁlter discriminative features between
two domains using adversarial learning. Domain adaptation is also
done by generating synthetic data with common high level semantics
between the two domains using Generative Adversarial Networks
(GANs).
A combination of discriminative and generative models is what
encompasses GANs. GAN based domain adaptation technique
involves source data, noise vectors or both to generate sample data
which are similar to target data.

Generative Models
Figure: GAN Architecture

Generative Models
We will be discussing on two architectures of GAN on domain adaptation
Coupled Generative Adversarial Network (CoGAN)

Generative Models
We will be discussing on two architectures of GAN on domain adaptation
Coupled Generative Adversarial Network (CoGAN)
CyCADA: Cycle-Consistent Adversarial Domain Adaptation
(CyCADA)

Generative Models
CoGAN
Figure: CoGAN architecture

Generative Models
Objective Function of CoGAN:

Generative Models
CoGAN
CoGAN consists of two GANs- GAN1 and GAN2.
GANs learn a joint distrubtion of multi-domain images by weight
sharing between two generators and two discriminators.
Both the GANs generate samples like source images and target
images where common features are more prevalent.

Generative Models
CyCADA
Figure: CyCADA architecture

Generative Models
CyCADA
CyCADA generates source samples styled as target samples (It means
the generator starts from the source image and tries to make it like
target image).
Other than GAN losses it incorporates other two important losses-
Cycle Loss and Semantic Loss.
It tries produce source target samples so that it preserve the features
of original source samples and also corresponding class labels do not
change.

Contents
1 Introduction
2 Domain Divergence
3 Proxy Distance
5 Generative models
6 Results

Results
Model Dataset Source only DA
DANN MNIST to MNIST-M 0.5185 0.7886
Coupled GAN MNIST to USPS 0.64 0.910
CyCADA MNIST to USPS 0.64 0.96
Table: Table caption

References I
S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and
J. W. Vaughan.
A theory of learning from diﬀerent domains.
Machine learning, 79(1-2):151–175, 2010.
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira.
Analysis of representations for domain adaptation.
In Advances in neural information processing systems, pages 137–144,
2007.
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
F. Laviolette, M. Marchand, and V. Lempitsky.
Domain-adversarial training of neural networks.
The Journal of Machine Learning Research, 17(1):2096–2030, 2016.

References II
J. Hoﬀman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A.
Efros, and T. Darrell.
Cycada: Cycle-consistent adversarial domain adaptation.
arXiv preprint arXiv:1711.03213, 2017.
M.-Y. Liu and O. Tuzel.
Coupled generative adversarial networks.
In Advances in neural information processing systems, pages 469–477,
2016.
M. Long, Z. Cao, J. Wang, and M. I. Jordan.
Conditional adversarial domain adaptation.
In Advances in Neural Information Processing Systems, pages
1640–1650, 2018.

Thank You

Deep Domain Adaptation using Adversarial Learning and GAN

More Related Content

What's hot (19)

Similar to Deep Domain Adaptation using Adversarial Learning and GAN (20)

Recently uploaded (20)

Deep Domain Adaptation using Adversarial Learning and GAN