Domain Invariant Representation Learning with Domain Density Transformations

Domain Invariant Representation Learning with Domain Density Transformations
A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin, arxiv 2102.05082
PR-320, Presented by Eddie

1. Domain Generalization
2. Domain Generalization과 Domain Adaptation
이전에 못 본 도메인(OOD)에 대응하기 위해 도메인에 구애받지 않는 모델을 만드는 것을 목표로 하는 학습 방법을 말한다.
Domain Adaptation은 타깃 도메인(Target Domain)의 레이블이 없는 데이터를 바탕으로 정보(information)를 얻는 것이 가능하지만, Domain Generalization은 그렇지 않다는 것이 차이점이다.
Train on the painting data in the Baroque period. Test on the painting data in the Modern period.
Model “Caravaggio” Model ?
Thanh-Dat Truong, et al., Recognition in Unseen Domains: Domain Generalization via Universal Non-volume Preserving Models
타깃 도메인(Target Domain)에 대한 정보가 없는 상태에서
예측을 해야 하기 때문에 어려운 과제이다.
“
”

Definition1.Marginal Distribution Alignment The representation z is said to satisfy the marginal distribution
alignment condition if p(z|d) is invariant w.r.t. d.
Definition2.Conditional Distribution Alignment The representaion z is said to satisfy the conditional distribution
alignment condition if p(y|z,d) is invariant w.r.t. d.
3. [Domian Invariance] Marginal and Conditional Alignment
4. Proposed Method
Ed E
[ Ed d,d 2
2
'
[ ]]]
l( )
y,gθ(x)
gθ(x) ,
:
gθ(x) (
- )
gθ f (x)
d,d'
f d d' ' '
:
' || ||
+
[
p(x,y|d) Ed ,
s
D
∈
, [ d,d 2
2
' ]
l(
d ,
=
D
d , where : 데이터 공간, ,
D D : 정의역이 될 수 있는 공간
- X : 공역이 될 수 있는 공간
Y
∈
d
∈
x ,
X
X Z Z
∈
∈
y Y
∈
- 입력 를 로 변환하는 함수(domain representation function)
예측 값과 레이블 간의 Loss
도메인, 다른도메인
예측값 도메인 변환후 예측값
z
→
→
x
(x) 입력 로 변환하는 함수(density transformation function)
x d
∈
를 x
,
-
{ }
s 1 d ,
2 ..., dK )
y,gθ(x) gθ(x) (
- )
gθ f (x)
d' || ||
+
p(x,y|d)
E [ 2
2
]
-
|| ||
+
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
1) Domain-invariant representation function이�존재하는가?
Q 2) 가 Domain-invariant 한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)

Theorem 1; Domain-invariant representation function이�존재하는가?
Theorem 1. The invariance of p(y|d) across domains is the necessary and sufficient condition for the existence of a domain-invariant representation (that aligns both
the marginal and conditional distribution).
‘p(y|d)의 도메인에 따른 불변함’과 ‘ Domain-invariant representation(function)이 존재하는 것’은 동치이다.
p(y,z|d) p(y|z,d)
p(z|d)
= = =
p(y|z,d')
p(z|d') p(y,z|d') p(y|d)
∴ = p(y|d') marginalizing
∵ over z
‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
ze this objective, while the discriminator D tries to
ze it.
n Classification Loss. For a given input image x
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
An important advantage of StarGAN is
neously incorporates multiple datasets conta
types of labels, so that StarGAN can contro
→
→
1)
A ⇔ B, A⇒B and B⇒A
‘ Domain-invariant representation(function)이 존재하는 것’
‘p(y|d)의 도메인에 따른 불변함’
If is unchanged w.r.t. the domain d, then we can always find a domain invariant representation(This is trivial).
p(y|d)
For example, for the deterministic case(that maps all x to 0), or for the probabilistic case.
p(z|x) (z|x)
= δ0 p(z|x) (z;0,1)
= N
→
2)
Domain-invariant representation(function)이 존재한다 Marginal and Conditional Distribution Alignment를 만족하는 표현 z 가 존재한다.

Theorem 2; 가 Domain-invariant한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
Theorem 2. Given an invertible and differentiable function
the representation z satisfies
Marginal Alignment
Conditional Alignment
. Then it aligns both the marginal and conditional of the data distribution for domain d and d
(with the inverse that transforms the data density from to (as described above). Assuming that
)
fd,d' f d '
d
d ,d
'
!
"#$ !"#$%&'"(%)*#!#)*+$%,!-)&"'()*'(#,$).)!')/
01,!2)!2+),$3+"%+)!
$#"4
%$ & !"#$'%"(
))%" ))%$
)*
+
, '%" (
+,
'%$
(
5'(#,$). 5'(#,$)/
main density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
1) .
applying variable substitution in multiple inte-

= fd,d (x))
p(x
|d
)

det Jfd,d
(x
)

−1
p(z|x
)

det Jfd,d
(x
)

dx
ce p(fd,d(x
)|d) = p(x
|d
)

det Jfd,d
(x
)

−1
Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
z in Eq 7)
p(x
|d
)p(z|x
)dx
|d
) (8)
ional alignment: ∀z, y we have:

(since p(fd,d(x
)|y, d) =
p(x
|y, d
)

det Jfd,d
(x
)

−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
01,!2)!2+),$3+%+)!
$#4
%$ !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|d
)

det Jfd,d
(x
)

−1
p(z|x
)

det Jfd,d
(x
)

dx
(since p(fd,d(x
)|d) = p(x
|d
)

det Jfd,d
(x
)

−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=

p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =

p(x|y, d)p(z|x)dx
=

p(fd,d(x
)|y, d)p(z|fd,d(x
))

det Jfd,d
(x
)

dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)

det Jfd,d
(x
)

−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
!
#$ !#$%'(%)*#!#)*+$%,!-)'()*'(#,$).)!')/
01,!2)!2+),$3+%+)!
$#4
%$ !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
gral: x
= fd,d (x))
=

p(x
|d
)

det Jfd,d
(x
)

−1
p(z|x
)

det Jfd,d
(x
)

dx
(since p(fd,d(x
)|d) = p(x
|d
)

det Jfd,d
(x
)

−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=

p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =

p(x|y, d)p(z|x)dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)

det Jfd,d
(x
)

−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
ng with Domain Density Transformations
$%,!-)'()*'(#,$).)!')/
3+%+)!
$#4
%$ !#$'%(
))%$
)*
+,
'%$
(
5'(#,$)/
on f1,2 that transforms the data density from domain 1 to domain 2,
enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)

det Jfd,d
(x
)

−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
d,d'
p(z|x)
-1
-1
p(z|d) ∫
= p(x|d)p(z|x) |d)
dx ∫
= p( (x')
fd',d z )
|
p( (x')
fd',d
p(z|y,d) p(z|y,d')
∫
= =
p(x|y,d)p(z|x)dx =
dx'
fd',d
det (x')
J
|y,d)
∫ p( (x')
fd',d z )
|
p( (x')
fd',d dx' x'
fd',d
det (x')
J = |y,d')
x'
∫ p( z )
|
p( dx'
fd',d
det (x')
J
fd',d
det (x')
J x'
= |y,d')
x'
∫ p( z )
|
p( dx'
|d')
x'
∫
= =
=
p( |x')
z
p( dx'
fd',d
det (x')
J fd',d
det (x')
J |d')
x'
∫ p( |d')
z
p(
|x')
z
p( dx'
f x'
d,d'
where is Jacobian matrix of the function evaluated at
p(x|y,d) (∵ marginalizing over z)
⇒
p(y|d) p(y|d')
p(x'|y,d')
= fd',d
det
-1
(x')
J p(x|d) p(x'|d')
= fd',d
det
-1
(x')
J
p(x|y,d) p(x'|y,d')
= fd',d
det
-1
(x')
J fd,d'(x')
J
(z|
= )
p ∀x
f (x),
c. To achieve this condition, we add an auxiliary
r on top of D and impose the domain classification
en optimizing both D and G. That is, we decompose
ctive into two terms: a domain classification loss of
ges used to optimize D, and a domain classification
ake images used to optimize G. In detail, the former
ed as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
he term Dcls(c
|x) represents a probability distribu-
er domain labels computed by D. By minimizing
ective, D learns to classify a real image x to its cor-
ing original domain c
. We assume that the input
nd domain label pair (x, c
) is given by the training
n the other hand, the loss function for the domain
ation of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
words, G tries to minimize this objective to gener-
ges that can be classified as the target domain c.
truction Loss. By minimizing the adversarial and
ation losses, G is trained to generate images that
stic and classified to its correct target domain. How-
nimizing the losses (Eqs. (1) and (3)) does not guar-
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
as a vector
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
|x)], (2)
where the term Dcls(c
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
image and domain label pair (x, c
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
tially known to each dataset. In the case of C
RaFD [13], while the former contains label
such as hair color and gender, it does not h
for facial expressions such as ‘happy’ and ‘a
versa for the latter. This is problematic bec
is
reconstructing the input image x from the tr
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, w
mask vector m that allows StarGAN to ign
labels and focus on the explicitly known lab
a particular dataset. In StarGAN, we use an
one-hot vector to represent m, with n being
datasets. In addition, we define a unified vers
as a vector
c̃ = [c1, ..., cn, m],
where [·] refers to concatenation, and ci repr
for the labels of the i-th dataset. The vecto
label ci can be represented as either a binar
nary attributes or a one-hot vector for catego
For the remaining n−1 unknown labels we
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
maximize it.
an output image y, which is properly classified to the target
domain c. To achieve this condition, we add an auxiliary
is defined as
Lr
|x)], (2)
maximize it.
3.2. Training with Multiple Data
An important advantage of StarG
neously incorporates multiple datase
types of labels, so that StarGAN can

d ,d
applying variable substitution in multiple inte-

= fd,d (x))
p(x
|y, d
)

det Jfd,d
(x
)

−1
p(z|x
)

det Jfd,d
(x
)

dx
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
gral: x
= fd,d (x))
=

p(x
|y, d
)

det Jfd,d
(x
)

−1
p(z|x
)

det Jfd,d
(x
)

dx
Ed

Ep(x,y|d)

2]

(12)
gral: x
= fd,d (x))
=

p(x
|y, d
)

det Jfd,d
(x
)

−1
p(z|x
)

det Jfd,d
(x
)

dx
Ed

Ep(x,y|d)

2]

(12)
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
couraging the representation to be invariant under all the
Ed

Ep(x,y|d)

2]

(12)
5. Domain Generalization with Generative Adversarial Networks (StarGAN; PR-152)
models
2
3
2
G32
G23
3
2
1
5
4 3
(b) StarGAN
on between cross-domain models and our pro-
GAN. (a) To handle multiple domains, cross-
ould be built for every pair of image domains.
able of learning mappings among multiple do-
e generator. The figure represents a star topol-
lti-domains.
ned from RaFD, as shown in the right-
Fig. 1. As far as our knowledge goes, our
o successfully perform multi-domain im-
ross different datasets.
ontributions are as follows:
StarGAN, a novel generative adversarial
learns the mappings among multiple do-
only a single generator and a discrimina-
effectively from images of all domains.
rate how we can successfully learn multi-
G
Input image
Target domain
Depth-wise concatenation
Fake image
G
Original
domain
Fake image
Depth-wise concatenation
Reconstructed
image
D
Fake image
Domain
classification
Real / Fake
(b) Original-to-target domain (c) Target-to-original domain (d) Fooling the discriminator
D
Domain
classification
Real / Fake
Fake image
Real image
(a) Training the discriminator
(1) (2)
(1), (2) (1)
Figure 3. Overview of StarGAN, consisting of two modules, a discriminator D and a generator G. (a) D learns to distinguish between
real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain
label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) G tries to
reconstruct the original image from the fake image given the original domain label. (d) G tries to generate images indistinguishable from
real images and classifiable as target domain by D.
vided both the discriminator and generator with class infor-
mation in order to generate samples conditioned on the class
[20, 21, 22]. Other recent approaches focused on generating
particular images highly relevant to a given text description
[25, 30]. The idea of conditional image generation has also
been successfully applied to domain transfer [9, 28], super-
resolution imaging[14], and photo editing [2, 27]. In this
paper, we propose a scalable GAN framework that can flex-
ibly steer the image translation to various target domains,
by providing conditional domain information.
Image-to-Image Translation. Recent work have achieved
impressive results in image-to-image translation [7, 9, 17,
33]. For instance, pix2pix [7] learns this task in a super-
vised manner using cGANs[20]. It combines an adver-
sarial loss with a L1 loss, thus requires paired data sam-
ples. To alleviate the problem of obtaining data pairs, un-
3. Star Generative Adversarial Networks
We first describe our proposed StarGAN, a framework to
address multi-domain image-to-image translation within a
single dataset. Then, we discuss how StarGAN incorporates
multiple datasets containing different label sets to flexibly
perform image translations using any of these labels.
3.1. Multi-Domain Image-to-Image Translation
Our goal is to train a single generator G that learns map-
pings among multiple domains. To achieve this, we train G
to translate an input image x into an output image y condi-
tioned on the target domain label c, G(x, c) → y. We ran-
domly generate the target domain label c so that G learns
to flexibly translate the input image. We also introduce an
auxiliary classifier [22] that allows a single discriminator to
control multiple domains. That is, our discriminator pro-
To alleviate this problem, we apply a cycle consis-
ss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
ain label c
as input and tries to reconstruct the orig-
ge x. We adopt the L1 norm as our reconstruction
te that we use a single generator twice, first to trans-
original image into an image in the target domain
n to reconstruct the original image from the trans-
age.
jective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
λcls and λrec are hyper-parameters that control the
importance of domain classification and reconstruc-
ses, respectively, compared to the adversarial loss.
λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
nator minimizes only classification errors for labels related
to CelebA attributes, and not facial expressions related to
RaFD. Under these settings, by alternating between CelebA
and RaFD the discriminator learns all of the discriminative
features for both datasets, and the generator learns to con-
trol all the labels in both datasets.
4
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
cls, (5)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
Training Strategy. When training StarGAN
datasets, we use the domain label c̃ defined i
put to the generator. By doing so, the gene
ignore the unspecified labels, which are ze
focus on the explicitly given label. The struc
erator is exactly the same as in training with a
except for the dimension of the input label c
hand, we extend the auxiliary classifier of
tor to generate probability distributions ove
datasets. Then, we train the model in a mul
setting, where the discriminator tries to min
classification error associated to the known
ample, when training with images in CelebA
nator minimizes only classification errors fo
to CelebA attributes, and not facial express
RaFD. Under these settings, by alternating b
and RaFD the discriminator learns all of the
features for both datasets, and the generator
4
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
p(x
|d
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
want G(., d
), which
|y, d
) ∀y)
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
p(x
|d

ation Learning with Domain Density Transformations
ain Ds =
becomes:
d (x))||2
2

(13)
porate this
lems with
ative
transform
e can use
rmalizing
017; Choi
advantage
s naturally
dition, the
on can be
hat we do
g process
se the use
particular,
, which is
p(x
|d
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
want G(., d
), which
|y, d
) ∀y)
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
) transforms
. Different from

p(x
|d
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
want G(., d
), which
|y, d
) ∀y)
.
5. Experiments
5.1. Datasets
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
) transforms
. Different from
p(x
|d
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
want G(., d
), which
|y, d
) ∀y)
.
5. Experiments
5.1. Datasets
generalization.
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
image and domain label pair (x, c ) is given by the training
Lf
images while changing only the domain-related part of the
inputs. To alleviate this problem, we apply a cycle consis-
tency loss [9, 33] to the generator, defined as
)||1], (4)
nal domain label c
lated image.
is defined as
Lr
|x)], (2)
responding original domain c
image and domain label pair (x, c
Lf
tially known to each dataset. In the ca
RaFD [13], while the former contain
such as hair color and gender, it doe
for facial expressions such as ‘happy’
versa for the latter. This is problem
plete information on the label vecto
reconstructing the input image x from
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this pro
mask vector m that allows StarGAN
labels and focus on the explicitly kno
a particular dataset. In StarGAN, we
one-hot vector to represent m, with n
datasets. In addition, we define a unifi
as a vector
c̃ = [c1, ..., cn, m
where [·] refers to concatenation, and
for the labels of the i-th dataset. Th
label ci can be represented as either
nary attributes or a one-hot vector for
For the remaining n−1 unknown lab
mize this objective, while the discriminator D tries to
mize it.
ain Classification Loss. For a given input image x
target domain label c, our goal is to translate x into
put image y, which is properly classified to the target
n c. To achieve this condition, we add an auxiliary
fier on top of D and impose the domain classification
hen optimizing both D and G. That is, we decompose
jective into two terms: a domain classification loss of
mages used to optimize D, and a domain classification
f fake images used to optimize G. In detail, the former
ned as
Lr
|x)], (2)
the term Dcls(c
ver domain labels computed by D. By minimizing
bjective, D learns to classify a real image x to its cor-
nding original domain c
and domain label pair (x, c
On the other hand, the loss function for the domain
fication of fake images is defined as
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
types of labels, so that StarGAN can control all the labels
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label

both qualitative and quantitative results on
ute transfer and facial expression synthe-
ng StarGAN, showing its superiority over
dels.
ork
ersarial Networks. Generative adversar-
ANs) [3] have shown remarkable results
ter vision tasks such as image generation
age translation [7, 9, 33], super-resolution
d face image synthesis [10, 16, 26, 31]. A
del consists of two modules: a discrimina-
or. The discriminator learns to distinguish
fake samples, while the generator learns to
mples that are indistinguishable from real
proach also leverages the adversarial loss
rated images as realistic as possible.
Ns. GAN-based conditional image gener-
n actively studied. Prior studies have pro-
distribution of images in cross domains. CycleGAN [33]
and DiscoGAN [9] preserve key attributes between the in-
put and the translated image by utilizing a cycle consistency
loss. However, all these frameworks are only capable of
learning the relations between two different domains at a
time. Their approaches have limited scalability in handling
multiple domains since different models should be trained
for each pair of domains. Unlike the aforementioned ap-
proaches, our framework can learn the relations among mul-
tiple domains using only a single model.
Adversarial Loss. To make the generated images indistin-
guishable from real images, we adopt an adversarial loss
Ladv = Ex [log Dsrc(x)] +
Ex,c[log (1 − Dsrc(G(x, c)))],
(1)
where G generates an image G(x, c) conditioned on both
the input image x and the target domain label c, while D
tries to distinguish between real and fake images. In this
paper, we refer to the term Dsrc(x) as a probability distri-
bution over sources given by D. The generator G tries to
3
) transforms
. Different from
main d
)
function in Eq 13.
are:
.
5. Experiments
5.1. Datasets
generalization.
, 30◦
, 45◦
, 60◦
and 75◦
) transforms
. Different from
main d
)
function in Eq 13.
are:
want G(., d , d) to be the inverse of G(., d, d ), which
|y, d
) ∀y)
.
5. Experiments
5.1. Datasets
generalization.
, 30◦
, 45◦
, 60◦
and 75◦
6. Experiments / Results
1) Dataset
2) Results
itioned on
ransforms
rent from
e image x
ut, in our
ain d and
e original
into think-
nation do-
StarGAN,
ccessfully
ain to that
G(., d, d
)
us section
objective
chitecture
urages the
y belongs
5. Experiments
5.1. Datasets
generalization.
, 30◦
, 45◦
, 60◦
and 75◦
main d
)
function in Eq 13.
are:
.
generalization.
, 30◦
, 45◦
, 60◦
and 75◦
)
function in Eq 13.
are:
.
, 30◦
, 45◦
, 60◦
and 75◦
Domains
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Domains
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
PACS
Domains
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
PACS
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma Ba, 2014), using the learning rate
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
PACS
OfficeHome
of the last epoch.
0.001 and minibatch size 64, and report performance on the
using the ImageNet channel means and standard deviations.
Domains
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
PACS
OfficeHome
of the last epoch.
Domains
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
PACS
OfficeHome
of the last epoch.
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
PACS
OfficeHome
of the last epoch.
0.001 and minibatch size 64, and report performance on the
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain D
Figure 4. Visualization of the representation space. Each point indicates a representa
and its color indicates the label y. Two left figures are for our method DIR-GAN and t
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
the general dis
distribution (fo
and green poin
PACS and Of
domain invaria
been applied w
puter vision d
cls, (5)
tency loss [9, 33] to the generator, defined as
)||1], (4)
nal domain label c
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
cls, (5)
Training Strategy. When training S
datasets, we use the domain label c̃ d
put to the generator. By doing so, t
ignore the unspecified labels, which
focus on the explicitly given label. Th
erator is exactly the same as in trainin
except for the dimension of the inpu
hand, we extend the auxiliary classi
tor to generate probability distributio
datasets. Then, we train the model in
setting, where the discriminator tries
classification error associated to the
ample, when training with images in
nator minimizes only classification e
to CelebA attributes, and not facial
RaFD. Under these settings, by altern
and RaFD the discriminator learns al
features for both datasets, and the ge
4
er words, G tries to minimize this objective to gener-
ages that can be classified as the target domain c.
nstruction Loss. By minimizing the adversarial and
fication losses, G is trained to generate images that
alistic and classified to its correct target domain. How-
minimizing the losses (Eqs. (1) and (3)) does not guar-
that translated images preserve the content of its input
s while changing only the domain-related part of the
. To alleviate this problem, we apply a cycle consis-
loss [9, 33] to the generator, defined as
)||1], (4)
G takes in the translated image G(x, c) and the origi-
main label c
mage x. We adopt the L1 norm as our reconstruction
Note that we use a single generator twice, first to trans-
n original image into an image in the target domain
hen to reconstruct the original image from the trans-
mage.
Objective. Finally, the objective functions to optimize
D are written, respectively, as
cls, (5)
f
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
For the remaining n−1 unknown labels we simply assign
zero values. In our experiments, we utilize the CelebA and
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-

3) Visualization of Representation
novel and scalable approach capable of learning mappings
among multiple domains. As demonstrated in Fig. 2 (b), our
model takes in training data of multiple domains, and learns
the mappings between all available domains using only a
single generator. The idea is simple. Instead of learning
a fixed translation (e.g., black-to-blond hair), our generator
takes in as inputs both image and domain information, and
learns to flexibly translate the image into the correspond-
ing domain. We use a label (e.g., binary or one-hot vector)
to represent domain information. During training, we ran-
domly generate a target domain label and train the model to
flexibly translate an input image into the target domain. By
doing so, we can control the domain label and translate the
image into any desired domain at testing phase.
We also introduce a simple but effective approach that
enables joint training between domains of different datasets
by adding a mask vector to the domain label. Our proposed
method ensures that the model can ignore unknown labels
and focus on the label provided by a particular dataset. In
this manner, our model can perform well on tasks such
as synthesizing facial expressions of CelebA images us-
• We provi
facial att
sis tasks
baseline
2. Related W
Generative A
ial networks
in various com
[6, 24, 32, 8],
imaging [14],
typical GAN m
tor and a gene
between real a
generate fake
samples. Our
to make the ge
Conditional G
ation has also
2
) (i.e., G is c
. D
the original StarGAN model that only takes
as its
implementation, we feed both the original d
together with
The generator’s goal is to fool a discriminator
ing that the transformed image belongs to the d
main d
. In other words, the equilibrium state
in which G completely fools D, is when G
transforms the data density of the original d
of the destination domain. After training, we
as the function fd,d (.) described in the pre
and perform the representation learning via
function in Eq 13.
Three important loss functions of the StarGAN
are:
• Domain classification loss Lcls that en
generator G to generate images that corr
.
In o
ate i
Rec
clas
are r
ever
ante
ima
inpu
tenc
whe
nal d
inal
loss
late
and
lated
Full
G a
Figure 4. Visualization of the representation space. Each point indicates a representation z of an image x in the two dimensional space
and its color indicates the label y. Two left figures are for our method DIR-GAN and two right figures are for the naive model DeepAll.
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
other architecture and implementation details.
the general distribution of the points) and the conditional
distribution (for example, the distributions of blue points
and green points).
PACS and OfficeHome. To the best of our knowledge,
domain invariant representation learning methods have not
been applied widely and successfully for real-world com-
puter vision datasets (e.g., PACS and OfficeHome) with
very deep neural networks such as Resnet, so the only rel-

Domain Invariant Representation Learning with Domain Density Transformations

More Related Content

What's hot (20)

Similar to Domain Invariant Representation Learning with Domain Density Transformations (20)

More from HyunKyu Jeon (20)

Recently uploaded (20)

Domain Invariant Representation Learning with Domain Density Transformations