SlideShare a Scribd company logo
Domain Invariant Representation Learning with Domain Density Transformations
A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin, arxiv 2102.05082
PR-320, Presented by Eddie
Domain Invariant Representation Learning with Domain Density Transformations
1. Domain Generalization
2. Domain Generalization과 Domain Adaptation
이전에 못 본 도메인(OOD)에 대응하기 위해 도메인에 구애받지 않는 모델을 만드는 것을 목표로 하는 학습 방법을 말한다.
Domain Adaptation은 타깃 도메인(Target Domain)의 레이블이 없는 데이터를 바탕으로 정보(information)를 얻는 것이 가능하지만, Domain Generalization은 그렇지 않다는 것이 차이점이다.
Train on the painting data in the Baroque period. Test on the painting data in the Modern period.
Model “Caravaggio” Model ?
Thanh-Dat Truong, et al., Recognition in Unseen Domains: Domain Generalization via Universal Non-volume Preserving Models
타깃 도메인(Target Domain)에 대한 정보가 없는 상태에서
예측을 해야 하기 때문에 어려운 과제이다.
“
”
Domain Invariant Representation Learning with Domain Density Transformations
Definition1.Marginal Distribution Alignment The representation z is said to satisfy the marginal distribution
alignment condition if p(z|d) is invariant w.r.t. d.
Definition2.Conditional Distribution Alignment The representaion z is said to satisfy the conditional distribution
alignment condition if p(y|z,d) is invariant w.r.t. d.
3. [Domian Invariance] Marginal and Conditional Alignment
4. Proposed Method
Ed E
[ Ed d,d 2
2
'
[ ]]]
l( )
y,gθ(x)
gθ(x) ,
:
gθ(x) (
- )
gθ f (x)
d,d'
f d d' ' '
:
' || ||
+
[
p(x,y|d) Ed ,
s
D
∈
, [ d,d 2
2
' ]
l(
d ,
=
D
d , where : 데이터 공간, ,
D D : 정의역이 될 수 있는 공간
- X : 공역이 될 수 있는 공간
Y
∈
d
∈
x ,
X
X Z Z
∈
∈
y Y
∈
- 입력 를 로 변환하는 함수(domain representation function)
예측 값과 레이블 간의 Loss
도메인, 다른도메인
예측값 도메인 변환후 예측값
z
→
→
x
(x) 입력 로 변환하는 함수(density transformation function)
x d
∈
를 x
,
-
{ }
s 1 d ,
2 ..., dK )
y,gθ(x) gθ(x) (
- )
gθ f (x)
d' || ||
+
p(x,y|d)
E [ 2
2
]
-
|| ||
+
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
Transformations
A. Tuan Nguyen 1
Toan Tran 2
Yarin Gal 1
Atılım Güneş Baydin 1
Abstract
Domain generalization refers to the problem
where we aim to train a model on data from a
set of source domains so that the model can gen-
eralize to unseen target domains. Naively training
a model on the aggregate set of data (pooled from
all source domains) has been shown to perform
suboptimally, since the information learned by
that model might be domain-specific and general-
ize imperfectly to target domains. To tackle this
problem, a predominant approach is to find and
learn some domain-invariant information in order
to use it for the prediction task. In this paper, we
propose a theoretically grounded method to learn
a domain-invariant representation by enforcing
the representation network to be invariant under
all transformation functions among domains. We
also show how to use generative adversarial net-
works to learn such domain transformations to
implement our method in practice. We demon-
strate the effectiveness of our method on several
widely used datasets for the domain generaliza-
tion problem, on all of which we achieve compet-
itive results with state-of-the-art models.
1. Introduction
Domain generalization refers to the machine learning sce-
!"#$
!"%$
!"#$
!"%$
!"#$%&'()'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # (
*%+,'-"."/'%&0%-$+%&1'2
Figure 1. An example of two domains. For each domain, x is
uniformly distributed on the outer circle (radius 2 for domain 1
and radius 3 for domain 2), with the color indicating class label y.
After the transformation z = x/||x||2, the marginal of z is aligned
(uniformly distributed on the unit circle for both domains), but the
conditional p(y|z) is not aligned. Thus, using this representation
for predicting y would not generalize well across domains.
In the representation learning framework, the prediction
y = f(x), where x is data and y is a label, is obtained as a
composition y = h ◦ g(x) of a deep representation network
z = g(x), where z is a learned representation of data x,
and a smaller classifier y = h(z), predicting label y given
representation z, both of which are shared across domains.
Current “domain-invariance”-based methods in domain gen-
eralization focus on either the marginal distribution align-
ment (Muandet et al., 2013) or the conditional distribution
alignment (Li et al., 2018b;c), which are still prone to distri-
v:2102.05082v2
[cs.LG]
14
Feb
2021
1) Domain-invariant representation function이�존재하는가?
Q 2) 가 Domain-invariant 한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
Domain Invariant Representation Learning with Domain Density Transformations
Theorem 1; Domain-invariant representation function이�존재하는가?
Theorem 1. The invariance of p(y|d) across domains is the necessary and sufficient condition for the existence of a domain-invariant representation (that aligns both
the marginal and conditional distribution).
‘p(y|d)의 도메인에 따른 불변함’과 ‘ Domain-invariant representation(function)이 존재하는 것’은 동치이다.
p(y,z|d) p(y|z,d)
p(z|d)
= = =
p(y|z,d')
p(z|d') p(y,z|d') p(y|d)
∴ = p(y|d') marginalizing
∵ over z
‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al.,
2015) and domain adaptation (Zhao et al., 2019; Zhang et al.,
2019; Combes et al., 2020; Tanwani, 2020) is that, in do-
main generalization, the learner does not have access to
(even a small amount of) data of the target domain, making
the problem much more challenging.
One of the most common domain generalization approaches
is to learn an invariant representation across domains, aim-
ing at a good generalization performance on target domains.
1
University of Oxford 2
VinAI Research. Correspondence to: A.
Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>.
alignment refers to making the representation distribution
p(z) to be the same across domains. This is essential since
if p(z) for the target domain is different from that of source
domains, the classification network h(z) would face out-
of-distribution data because the representation z it receives
as input at test time would be different from the ones it
was trained with in source domains. Conditional alignment
refers to aligning p(y|z), the conditional distribution of
the label given the representation, since if this conditional
for the target domain is different from that of the source
domains, the classification network (trained on the source
domains) would give inaccurate predictions at test time.
The formal definition of the two alignments is discussed in
Section 3.
Domain Invariant Representation Learning with Domain Density Transformations
ze this objective, while the discriminator D tries to
ze it.
n Classification Loss. For a given input image x
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
3.2. Training with Multiple Datasets
An important advantage of StarGAN is
neously incorporates multiple datasets conta
types of labels, so that StarGAN can contro
→
→
1)
A ⇔ B, A⇒B and B⇒A
‘ Domain-invariant representation(function)이 존재하는 것’
‘p(y|d)의 도메인에 따른 불변함’
If is unchanged w.r.t. the domain d, then we can always find a domain invariant representation(This is trivial).
p(y|d)
For example, for the deterministic case(that maps all x to 0), or for the probabilistic case.
p(z|x) (z|x)
= δ0 p(z|x) (z;0,1)
= N
→
2)
Domain-invariant representation(function)이 존재한다 Marginal and Conditional Distribution Alignment를 만족하는 표현 z 가 존재한다.
Domain Invariant Representation Learning with Domain Density Transformations
Theorem 2; 가 Domain-invariant한가?
d,d'
gθ(x) (
- =0
)
gθ f (x)
Theorem 2. Given an invertible and differentiable function
the representation z satisfies
Marginal Alignment
Conditional Alignment
. Then it aligns both the marginal and conditional of the data distribution for domain d and d
(with the inverse that transforms the data density from to (as described above). Assuming that
)
fd,d' f d '
d
d ,d
'
!
"#$ !"#$%&'"(%)*#!#)*+$%,!-)&"'()*'(#,$).)!')/
01,!2)!2+),$3+"%+)!
$#"4
%$ & !"#$'%"(
))%" ))%$
)*
+
, '%" (
+,
'%$
(
5'(#,$). 5'(#,$)/
main density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
1) .
applying variable substitution in multiple inte-

= fd,d (x))
p(x
|d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
ce p(fd,d(x
)|d) = p(x
|d
)


det Jfd,d
(x
)



−1
Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
z in Eq 7)
p(x
|d
)p(z|x
)dx
|d
) (8)
ional alignment: ∀z, y we have:

(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
01,!2)!2+),$3+%+)!
$#4
%$  !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
(since p(fd,d(x
)|d) = p(x
|d
)


det Jfd,d
(x
)



−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=

p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =

p(x|y, d)p(z|x)dx
=

p(fd,d(x
)|y, d)p(z|fd,d(x
))


det Jfd,d
(x
)


 dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
!
#$ !#$%'(%)*#!#)*+$%,!-)'()*'(#,$).)!')/
01,!2)!2+),$3+%+)!
$#4
%$  !#$'%(
))% ))%$
)*
+
, '% (
+,
'%$
(
Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2,
we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
x2 = f1,2(x1) .
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
(since p(fd,d(x
)|d) = p(x
|d
)


det Jfd,d
(x
)



−1
due to Eq 6 and p(z|fd,d(x
)) = p(z|x
) due to defini-
tion of z in Eq 7)
=

p(x
|d
)p(z|x
)dx
= p(z|d
) (8)
ii) Conditional alignment: ∀z, y we have:
p(z|y, d) =

p(x|y, d)p(z|x)dx
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
This theorem indicates that, if we can find the functions
ng with Domain Density Transformations
$%,!-)'()*'(#,$).)!')/
3+%+)!
$#4
%$  !#$'%(
))%$
)*
+,
'%$
(
5'(#,$)/
on f1,2 that transforms the data density from domain 1 to domain 2,
enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any
(since p(fd,d(x
)|y, d) =
p(x
|y, d
)


det Jfd,d
(x
)



−1
due to Eq 4 and
p(z|fd,d(x
)) = p(z|x
) due to definition of z in
Eq 7)
=

p(x
|y, d
)p(z|x
)dx
= p(z|y, d
) (9)
Note that
p(y|z, d) =
p(y, z|d)
p(z|d)
=
p(y|d)p(z|y, d)
p(z|d)
(10)
Since p(y|d) = p(y) = p(y|d
), p(z|y, d) = p(z|y, d
)
and p(z|d) = p(z|d
), we have:
p(y|z, d) =
p(y|d
)p(z|y, d
)
p(z|d)
= p(y|z, d
) (11)
d,d'
p(z|x)
-1
-1
p(z|d) ∫
= p(x|d)p(z|x) |d)
dx ∫
= p( (x')
fd',d z )
|
p( (x')
fd',d
p(z|y,d) p(z|y,d')
∫
= =
p(x|y,d)p(z|x)dx =
dx'
fd',d
det (x')
J
|y,d)
∫ p( (x')
fd',d z )
|
p( (x')
fd',d dx' x'
fd',d
det (x')
J = |y,d')
x'
∫ p( z )
|
p( dx'
fd',d
det (x')
J
fd',d
det (x')
J x'
= |y,d')
x'
∫ p( z )
|
p( dx'
|d')
x'
∫
= =
=
p( |x')
z
p( dx'
fd',d
det (x')
J fd',d
det (x')
J |d')
x'
∫ p( |d')
z
p(
|x')
z
p( dx'
f x'
d,d'
where is Jacobian matrix of the function evaluated at
p(x|y,d) (∵ marginalizing over z)
⇒
p(y|d) p(y|d')
p(x'|y,d')
= fd',d
det
-1
(x')
J p(x|d) p(x'|d')
= fd',d
det
-1
(x')
J
p(x|y,d) p(x'|y,d')
= fd',d
det
-1
(x')
J fd,d'(x')
J
(z|
= )
p ∀x
f (x),
c. To achieve this condition, we add an auxiliary
r on top of D and impose the domain classification
en optimizing both D and G. That is, we decompose
ctive into two terms: a domain classification loss of
ges used to optimize D, and a domain classification
ake images used to optimize G. In detail, the former
ed as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
he term Dcls(c
|x) represents a probability distribu-
er domain labels computed by D. By minimizing
ective, D learns to classify a real image x to its cor-
ing original domain c
. We assume that the input
nd domain label pair (x, c
) is given by the training
n the other hand, the loss function for the domain
ation of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
words, G tries to minimize this objective to gener-
ges that can be classified as the target domain c.
truction Loss. By minimizing the adversarial and
ation losses, G is trained to generate images that
stic and classified to its correct target domain. How-
nimizing the losses (Eqs. (1) and (3)) does not guar-
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
as a vector
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
. We assume that the input
image and domain label pair (x, c
) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
tially known to each dataset. In the case of C
RaFD [13], while the former contains label
such as hair color and gender, it does not h
for facial expressions such as ‘happy’ and ‘a
versa for the latter. This is problematic bec
plete information on the label vector c
is
reconstructing the input image x from the tr
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, w
mask vector m that allows StarGAN to ign
labels and focus on the explicitly known lab
a particular dataset. In StarGAN, we use an
one-hot vector to represent m, with n being
datasets. In addition, we define a unified vers
as a vector
c̃ = [c1, ..., cn, m],
where [·] refers to concatenation, and ci repr
for the labels of the i-th dataset. The vecto
label ci can be represented as either a binar
nary attributes or a one-hot vector for catego
For the remaining n−1 unknown labels we
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
an output image y, which is properly classified to the target
domain c. To achieve this condition, we add an auxiliary
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
minimize this objective, while the discriminator D tries to
maximize it.
Domain Classification Loss. For a given input image x
and a target domain label c, our goal is to translate x into
3.2. Training with Multiple Data
An important advantage of StarG
neously incorporates multiple datase
types of labels, so that StarGAN can
Domain Invariant Representation Learning with Domain Density Transformations
d ,d
applying variable substitution in multiple inte-

= fd,d (x))
p(x
|y, d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
gral: x
= fd,d (x))
=

p(x
|y, d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
(by applying variable substitution in multiple inte-
gral: x
= fd,d (x))
=

p(x
|y, d
)


det Jfd,d
(x
)



−1
p(z|x
)


det Jfd,d
(x
)


 dx
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
f’s that transform the data densities among the domains,
we can learn a domain-invariant representation z by en-
couraging the representation to be invariant under all the
transformations f’s. This idea is illustrated in Figure 3. We
therefore can use the following learning objective to learn a
domain-invariant representation z = gθ(x):
Ed

Ep(x,y|d)

l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2
2]

(12)
where l(y, gθ(x)) is the prediction loss of a network that pre-
dicts y given z = gθ(x), and the second term is to enforce
the invariant condition in Eq 7.
5. Domain Generalization with Generative Adversarial Networks (StarGAN; PR-152)
models
2
3
2
G32
G23
3
2
1
5
4 3
(b) StarGAN
on between cross-domain models and our pro-
GAN. (a) To handle multiple domains, cross-
ould be built for every pair of image domains.
able of learning mappings among multiple do-
e generator. The figure represents a star topol-
lti-domains.
ned from RaFD, as shown in the right-
Fig. 1. As far as our knowledge goes, our
o successfully perform multi-domain im-
ross different datasets.
ontributions are as follows:
StarGAN, a novel generative adversarial
learns the mappings among multiple do-
only a single generator and a discrimina-
effectively from images of all domains.
rate how we can successfully learn multi-
G
Input image
Target domain
Depth-wise concatenation
Fake image
G
Original
domain
Fake image
Depth-wise concatenation
Reconstructed
image
D
Fake image
Domain
classification
Real / Fake
(b) Original-to-target domain (c) Target-to-original domain (d) Fooling the discriminator
D
Domain
classification
Real / Fake
Fake image
Real image
(a) Training the discriminator
(1) (2)
(1), (2) (1)
Figure 3. Overview of StarGAN, consisting of two modules, a discriminator D and a generator G. (a) D learns to distinguish between
real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain
label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) G tries to
reconstruct the original image from the fake image given the original domain label. (d) G tries to generate images indistinguishable from
real images and classifiable as target domain by D.
vided both the discriminator and generator with class infor-
mation in order to generate samples conditioned on the class
[20, 21, 22]. Other recent approaches focused on generating
particular images highly relevant to a given text description
[25, 30]. The idea of conditional image generation has also
been successfully applied to domain transfer [9, 28], super-
resolution imaging[14], and photo editing [2, 27]. In this
paper, we propose a scalable GAN framework that can flex-
ibly steer the image translation to various target domains,
by providing conditional domain information.
Image-to-Image Translation. Recent work have achieved
impressive results in image-to-image translation [7, 9, 17,
33]. For instance, pix2pix [7] learns this task in a super-
vised manner using cGANs[20]. It combines an adver-
sarial loss with a L1 loss, thus requires paired data sam-
ples. To alleviate the problem of obtaining data pairs, un-
3. Star Generative Adversarial Networks
We first describe our proposed StarGAN, a framework to
address multi-domain image-to-image translation within a
single dataset. Then, we discuss how StarGAN incorporates
multiple datasets containing different label sets to flexibly
perform image translations using any of these labels.
3.1. Multi-Domain Image-to-Image Translation
Our goal is to train a single generator G that learns map-
pings among multiple domains. To achieve this, we train G
to translate an input image x into an output image y condi-
tioned on the target domain label c, G(x, c) → y. We ran-
domly generate the target domain label c so that G learns
to flexibly translate the input image. We also introduce an
auxiliary classifier [22] that allows a single discriminator to
control multiple domains. That is, our discriminator pro-
To alleviate this problem, we apply a cycle consis-
ss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
ain label c
as input and tries to reconstruct the orig-
ge x. We adopt the L1 norm as our reconstruction
te that we use a single generator twice, first to trans-
original image into an image in the target domain
n to reconstruct the original image from the trans-
age.
jective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
λcls and λrec are hyper-parameters that control the
importance of domain classification and reconstruc-
ses, respectively, compared to the adversarial loss.
λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
nator minimizes only classification errors for labels related
to CelebA attributes, and not facial expressions related to
RaFD. Under these settings, by alternating between CelebA
and RaFD the discriminator learns all of the discriminative
features for both datasets, and the generator learns to con-
trol all the labels in both datasets.
4
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
Training Strategy. When training StarGAN
datasets, we use the domain label c̃ defined i
put to the generator. By doing so, the gene
ignore the unspecified labels, which are ze
focus on the explicitly given label. The struc
erator is exactly the same as in training with a
except for the dimension of the input label c
hand, we extend the auxiliary classifier of
tor to generate probability distributions ove
datasets. Then, we train the model in a mul
setting, where the discriminator tries to min
classification error associated to the known
ample, when training with images in CelebA
nator minimizes only classification errors fo
to CelebA attributes, and not facial express
RaFD. Under these settings, by alternating b
and RaFD the discriminator learns all of the
features for both datasets, and the generator
trol all the labels in both datasets.
4
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
   
ation Learning with Domain Density Transformations
ain Ds =
becomes:
d (x))||2
2

(13)
porate this
lems with
ative
transform
e can use
rmalizing
017; Choi
advantage
s naturally
dition, the
on can be
hat we do
g process
se the use
particular,
, which is
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x

• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
Domain Invariant Representation Learning with Domain Density Transformations
Assume that we have a set of K sources domain Ds =
{d1, d2, ..., dK}, the objective function in Eq. 12 becomes:
Ed,d∈Ds,p(x,y|d)

l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2
2

(13)
In the next section, we show how one can incorporate this
idea into real-world domain generalization problems with
generative adversarial networks.
4. Domain Generalization with Generative
Adversarial Networks
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
• The adversarial loss Ladv that is the classification loss
of a discriminator D that tries to distinguish between
real images and the fake images generated by G. The
equilibrium state of StarGAN is when G completely
fools D, which means the distribution of the generated
images (via G(x, d, d
), x ∼ p(x|d)) becomes the dis-
tribution of the real images of the destination domain
p(x
|d
). This is our objective, i.e., to learn a function
that transforms domains’ densities.
• Reconstruction loss Lrec = Ex,d,d [||x −
G(x
, d
, d)||1] where x
= G(x, d, d
) to ensure
that the transformations preserve the image’s content.
Note that this also aligns with our interest since we
want G(., d
, d) to be the inverse of G(., d, d
), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
image and domain label pair (x, c ) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
images while changing only the domain-related part of the
inputs. To alleviate this problem, we apply a cycle consis-
tency loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
classifier on top of D and impose the domain classification
loss when optimizing both D and G. That is, we decompose
the objective into two terms: a domain classification loss of
real images used to optimize D, and a domain classification
loss of fake images used to optimize G. In detail, the former
is defined as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
where the term Dcls(c
|x) represents a probability distribu-
tion over domain labels computed by D. By minimizing
this objective, D learns to classify a real image x to its cor-
responding original domain c
. We assume that the input
image and domain label pair (x, c
) is given by the training
data. On the other hand, the loss function for the domain
classification of fake images is defined as
Lf
cls = Ex,c[− log Dcls(c|G(x, c))]. (3)
In other words, G tries to minimize this objective to gener-
ate images that can be classified as the target domain c.
Reconstruction Loss. By minimizing the adversarial and
classification losses, G is trained to generate images that
are realistic and classified to its correct target domain. How-
ever, minimizing the losses (Eqs. (1) and (3)) does not guar-
antee that translated images preserve the content of its input
tially known to each dataset. In the ca
RaFD [13], while the former contain
such as hair color and gender, it doe
for facial expressions such as ‘happy’
versa for the latter. This is problem
plete information on the label vecto
reconstructing the input image x from
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this pro
mask vector m that allows StarGAN
labels and focus on the explicitly kno
a particular dataset. In StarGAN, we
one-hot vector to represent m, with n
datasets. In addition, we define a unifi
as a vector
c̃ = [c1, ..., cn, m
where [·] refers to concatenation, and
for the labels of the i-th dataset. Th
label ci can be represented as either
nary attributes or a one-hot vector for
For the remaining n−1 unknown lab
mize this objective, while the discriminator D tries to
mize it.
ain Classification Loss. For a given input image x
target domain label c, our goal is to translate x into
put image y, which is properly classified to the target
n c. To achieve this condition, we add an auxiliary
fier on top of D and impose the domain classification
hen optimizing both D and G. That is, we decompose
jective into two terms: a domain classification loss of
mages used to optimize D, and a domain classification
f fake images used to optimize G. In detail, the former
ned as
Lr
cls = Ex,c [− log Dcls(c
|x)], (2)
the term Dcls(c
|x) represents a probability distribu-
ver domain labels computed by D. By minimizing
bjective, D learns to classify a real image x to its cor-
nding original domain c
. We assume that the input
and domain label pair (x, c
) is given by the training
On the other hand, the loss function for the domain
fication of fake images is defined as
3.2. Training with Multiple Datasets
An important advantage of StarGAN is that it simulta-
neously incorporates multiple datasets containing different
types of labels, so that StarGAN can control all the labels
at the test phase. An issue when learning from multiple
datasets, however, is that the label information is only par-
tially known to each dataset. In the case of CelebA [19] and
RaFD [13], while the former contains labels for attributes
such as hair color and gender, it does not have any labels
for facial expressions such as ‘happy’ and ‘angry’, and vice
versa for the latter. This is problematic because the com-
plete information on the label vector c
is required when
reconstructing the input image x from the translated image
G(x, c) (See Eq. (4)).
Mask Vector. To alleviate this problem, we introduce a
mask vector m that allows StarGAN to ignore unspecified
labels and focus on the explicitly known label provided by
a particular dataset. In StarGAN, we use an n-dimensional
one-hot vector to represent m, with n being the number of
datasets. In addition, we define a unified version of the label
Domain Invariant Representation Learning with Domain Density Transformations
both qualitative and quantitative results on
ute transfer and facial expression synthe-
ng StarGAN, showing its superiority over
dels.
ork
ersarial Networks. Generative adversar-
ANs) [3] have shown remarkable results
ter vision tasks such as image generation
age translation [7, 9, 33], super-resolution
d face image synthesis [10, 16, 26, 31]. A
del consists of two modules: a discrimina-
or. The discriminator learns to distinguish
fake samples, while the generator learns to
mples that are indistinguishable from real
proach also leverages the adversarial loss
rated images as realistic as possible.
Ns. GAN-based conditional image gener-
n actively studied. Prior studies have pro-
distribution of images in cross domains. CycleGAN [33]
and DiscoGAN [9] preserve key attributes between the in-
put and the translated image by utilizing a cycle consistency
loss. However, all these frameworks are only capable of
learning the relations between two different domains at a
time. Their approaches have limited scalability in handling
multiple domains since different models should be trained
for each pair of domains. Unlike the aforementioned ap-
proaches, our framework can learn the relations among mul-
tiple domains using only a single model.
Adversarial Loss. To make the generated images indistin-
guishable from real images, we adopt an adversarial loss
Ladv = Ex [log Dsrc(x)] +
Ex,c[log (1 − Dsrc(G(x, c)))],
(1)
where G generates an image G(x, c) conditioned on both
the input image x and the target domain label c, while D
tries to distinguish between real and fake images. In this
paper, we refer to the term Dsrc(x) as a probability distri-
bution over sources given by D. The generator G tries to
3
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
In practice, we will learn the functions f’s that transform
the data distributions between domains and one can use
several generative modeling frameworks, e.g., normalizing
flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi
et al., 2018; 2020) to learn such functions. One advantage
of normalizing flows is that this transformation is naturally
invertible by design of the neural network. In addition, the
determinant of the Jacobian of that transformation can be
efficiently computed. However, due to the fact that we do
not need access to the Jacobian when the training process
of the generative model is completed, we propose the use
of GANs to inherit its rich network capacity. In particular,
we use the StarGAN (Choi et al., 2018) model, which is
designed for image domain transformations.
The goal of StarGAN is to learn a unified network G that
transforms the data density among multiple domains. In
particular, the network G(x, d, d
) (i.e., G is conditioned on
the image x and the two different domains d, d
) transforms
an image x from domain d to domain d
. Different from
the original StarGAN model that only takes the image x
and the desired destination domain d
as its input, in our
implementation, we feed both the original domain d and
desired destination domain d
together with the original
image x to the generator G.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
want G(., d , d) to be the inverse of G(., d, d ), which
will minimize Lrec to zero.
We can enforce the generator G to transform the data distri-
bution within the class y (e.g., p(x|y, d) to p(x
|y, d
) ∀y)
by sampling each minibatch with data from the same class
y, so that the discriminator will distinguish the transformed
images with the real images from class y and domain d
.
However, we found that this constraint can be relaxed in
practice, and the generator almost always transforms the
image within the original class y.
As mentioned earlier, after training the StarGAN model,
we can use the generator G(., d, d
) as our fd,d (.) function
and learn a domain-invariant representation via the learning
objective in Eq 13. We name this implementation of our
method DIR-GAN (domain-invariant representation learn-
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
6. Experiments / Results
1) Dataset
2) Results
itioned on
ransforms
rent from
e image x
ut, in our
ain d and
e original
into think-
nation do-
StarGAN,
ccessfully
ain to that
G(., d, d
)
us section
objective
chitecture
urages the
y belongs
ing with generative adversarial networks).
5. Experiments
5.1. Datasets
To evaluate our method, we perform experiments in three
datasets that are commonly used in the literature for domain
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
The generator’s goal is to fool a discriminator D into think-
ing that the transformed image belongs to the destination do-
main d
. In other words, the equilibrium state of StarGAN,
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
generalization.
Rotated MNIST. In this dataset by Ghifary et al.
(2015), 1,000 MNIST images (100 per class) (LeCun 
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
in which G completely fools D, is when G successfully
transforms the data density of the original domain to that
of the destination domain. After training, we use G(., d, d
)
as the function fd,d (.) described in the previous section
and perform the representation learning via the objective
function in Eq 13.
Three important loss functions of the StarGAN architecture
are:
• Domain classification loss Lcls that encourages the
generator G to generate images that correctly belongs
to the desired destination domain d
.
Cortes, 2010) are chosen to form the first domain (de-
noted M0), then rotations of 15◦
, 30◦
, 45◦
, 60◦
and 75◦
are applied to create five additional domains, denoted
M15, M30, M45, M60 and M75. The task is classifica-
tion with ten classes (digits 0 to 9).
PACS (Li et al., 2017) contains 9,991 images from four
different domains: art painting, cartoon, photo, sketch. The
task is classification with seven classes.
OfficeHome (Venkateswara et al., 2017) has 15,500 im-
ages of daily objects from four domains: art, clipart, product
and real. There are 65 classes in this classification dataset.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma  Ba, 2014), using the learning rate
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma  Ba, 2014), using the learning rate
0.001 and minibatch size 64, and report performance on the
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
Domain Invariant Representation Learning with Domain Density Transformations
Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
Domains
Model M0 M15 M30 M45 M60 M75 Average
HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03
DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36
DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3
LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3
HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8
ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2
DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6
Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
PACS
Model Backbone Art Painting Cartoon Photo Sketch Average
DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38
JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14
MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70
MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70
CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40
DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46
DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62
Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs
OfficeHome
Model Backbone Art ClipArt Product Real Average
D-SAM (D’Innocente  Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77
JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20
DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18
5.2. Experimental Setting
For all datasets, we perform “leave-one-domain-out” exper-
iments, where we choose one domain as the target domain,
train the model on all remaining domains and evaluate it
on the chosen domain. Following standard practice, we use
90% of available data as training data and 10% as validation
data, except for the Rotated MNIST experiment where we
do not use a validation set and just report the performance
of the last epoch.
For the Rotated MNIST dataset, we use a network of two
3x3 convolutional layers and a fully connected layer as the
representation network gθ to get a representation z of 64
dimensions. A single linear layer is then used to map the
representation z to the ten output classes. This architecture
is the deterministic version of the network used by Ilse et al.
(2020). We train our network for 500 epochs with the Adam
optimizer (Kingma  Ba, 2014), using the learning rate
0.001 and minibatch size 64, and report performance on the
test domain after the last epoch.
For the PACS and OfficeHome datasets, we use a Resnet18
(He et al., 2016) network as the representation network gθ.
As a standard practice, the Resnet18 backbone is pre-trained
on ImageNet. We replace the last fully connected layer of
the Resnet with a linear layer of dimensions (512, 256) so
that our representation has 256 dimensions. As with the
Rotated MNIST experiment, we use a single layer to map
from the representation z to the output. We train the network
for 100 epochs with plain stochastic gradient descent (SGD)
using learning rate 0.001, momentum 0.9, minibatch size
64, and weight decay 0.001. Data augmentation is also
standard practice for real-world computer vision datasets
like PACS and OfficeHome, and during the training we
augment our data as follows: crops of random size and
aspect ratio, resizing to 224 × 224 pixels, random horizontal
flips, random color jitter, randomly converting the image
tile to grayscale with 10% probability, and normalization
using the ImageNet channel means and standard deviations.
Domain Invariant Representation Learning with Domain D
Figure 4. Visualization of the representation space. Each point indicates a representa
and its color indicates the label y. Two left figures are for our method DIR-GAN and t
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
the general dis
distribution (fo
and green poin
PACS and Of
domain invaria
been applied w
puter vision d
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
tency loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
where G takes in the translated image G(x, c) and the origi-
nal domain label c
as input and tries to reconstruct the orig-
inal image x. We adopt the L1 norm as our reconstruction
loss. Note that we use a single generator twice, first to trans-
late an original image into an image in the target domain
and then to reconstruct the original image from the trans-
lated image.
Full Objective. Finally, the objective functions to optimize
G and D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
LG = Ladv + λcls Lf
cls + λrec Lrec, (6)
where λcls and λrec are hyper-parameters that control the
relative importance of domain classification and reconstruc-
tion losses, respectively, compared to the adversarial loss.
We use λcls = 1 and λrec = 10 in all of our experiments.
RaFD datasets, where n is two.
Training Strategy. When training S
datasets, we use the domain label c̃ d
put to the generator. By doing so, t
ignore the unspecified labels, which
focus on the explicitly given label. Th
erator is exactly the same as in trainin
except for the dimension of the inpu
hand, we extend the auxiliary classi
tor to generate probability distributio
datasets. Then, we train the model in
setting, where the discriminator tries
classification error associated to the
ample, when training with images in
nator minimizes only classification e
to CelebA attributes, and not facial
RaFD. Under these settings, by altern
and RaFD the discriminator learns al
features for both datasets, and the ge
trol all the labels in both datasets.
4
er words, G tries to minimize this objective to gener-
ages that can be classified as the target domain c.
nstruction Loss. By minimizing the adversarial and
fication losses, G is trained to generate images that
alistic and classified to its correct target domain. How-
minimizing the losses (Eqs. (1) and (3)) does not guar-
that translated images preserve the content of its input
s while changing only the domain-related part of the
. To alleviate this problem, we apply a cycle consis-
loss [9, 33] to the generator, defined as
Lrec = Ex,c,c [||x − G(G(x, c), c
)||1], (4)
G takes in the translated image G(x, c) and the origi-
main label c
as input and tries to reconstruct the orig-
mage x. We adopt the L1 norm as our reconstruction
Note that we use a single generator twice, first to trans-
n original image into an image in the target domain
hen to reconstruct the original image from the trans-
mage.
Objective. Finally, the objective functions to optimize
D are written, respectively, as
LD = −Ladv + λcls Lr
cls, (5)
f
c̃ = [c1, ..., cn, m], (7)
where [·] refers to concatenation, and ci represents a vector
for the labels of the i-th dataset. The vector of the known
label ci can be represented as either a binary vector for bi-
nary attributes or a one-hot vector for categorical attributes.
For the remaining n−1 unknown labels we simply assign
zero values. In our experiments, we utilize the CelebA and
RaFD datasets, where n is two.
Training Strategy. When training StarGAN with multiple
datasets, we use the domain label c̃ defined in Eq. (7) as in-
put to the generator. By doing so, the generator learns to
ignore the unspecified labels, which are zero vectors, and
focus on the explicitly given label. The structure of the gen-
erator is exactly the same as in training with a single dataset,
except for the dimension of the input label c̃. On the other
hand, we extend the auxiliary classifier of the discrimina-
tor to generate probability distributions over labels for all
datasets. Then, we train the model in a multi-task learning
setting, where the discriminator tries to minimize only the
classification error associated to the known label. For ex-
ample, when training with images in CelebA, the discrimi-
Domain Invariant Representation Learning with Domain Density Transformations
3) Visualization of Representation
novel and scalable approach capable of learning mappings
among multiple domains. As demonstrated in Fig. 2 (b), our
model takes in training data of multiple domains, and learns
the mappings between all available domains using only a
single generator. The idea is simple. Instead of learning
a fixed translation (e.g., black-to-blond hair), our generator
takes in as inputs both image and domain information, and
learns to flexibly translate the image into the correspond-
ing domain. We use a label (e.g., binary or one-hot vector)
to represent domain information. During training, we ran-
domly generate a target domain label and train the model to
flexibly translate an input image into the target domain. By
doing so, we can control the domain label and translate the
image into any desired domain at testing phase.
We also introduce a simple but effective approach that
enables joint training between domains of different datasets
by adding a mask vector to the domain label. Our proposed
method ensures that the model can ignore unknown labels
and focus on the label provided by a particular dataset. In
this manner, our model can perform well on tasks such
as synthesizing facial expressions of CelebA images us-
• We provi
facial att
sis tasks
baseline
2. Related W
Generative A
ial networks
in various com
[6, 24, 32, 8],
imaging [14],
typical GAN m
tor and a gene
between real a
generate fake
samples. Our
to make the ge
Conditional G
ation has also
2
particular, the network G(x, d, d
) (i.e., G is c
the image x and the two different domains d, d
an image x from domain d to domain d
. D
the original StarGAN model that only takes
and the desired destination domain d
as its
implementation, we feed both the original d
desired destination domain d
together with
image x to the generator G.
The generator’s goal is to fool a discriminator
ing that the transformed image belongs to the d
main d
. In other words, the equilibrium state
in which G completely fools D, is when G
transforms the data density of the original d
of the destination domain. After training, we
as the function fd,d (.) described in the pre
and perform the representation learning via
function in Eq 13.
Three important loss functions of the StarGAN
are:
• Domain classification loss Lcls that en
generator G to generate images that corr
to the desired destination domain d
.
In o
ate i
Rec
clas
are r
ever
ante
ima
inpu
tenc
whe
nal d
inal
loss
late
and
lated
Full
G a
Domain Invariant Representation Learning with Domain Density Transformations
Figure 4. Visualization of the representation space. Each point indicates a representation z of an image x in the two dimensional space
and its color indicates the label y. Two left figures are for our method DIR-GAN and two right figures are for the naive model DeepAll.
The StarGAN (Choi et al., 2018) model implementation
is taken from the authors’ original source code with no
significant modifications. For each set of source domains,
we train the StarGAN model for 100,000 iterations with a
minibatch of 16 images per iteration.
The code for all of our experiments will be released for
reproducibility. Please also refer to the source code for any
other architecture and implementation details.
the general distribution of the points) and the conditional
distribution (for example, the distributions of blue points
and green points).
PACS and OfficeHome. To the best of our knowledge,
domain invariant representation learning methods have not
been applied widely and successfully for real-world com-
puter vision datasets (e.g., PACS and OfficeHome) with
very deep neural networks such as Resnet, so the only rel-

More Related Content

PDF
[PR12] Generative Models as Distributions of Functions
PDF
Matching Network
PDF
Variants of GANs - Jaejun Yoo
PDF
Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Anal...
PDF
[CVPR2020] Simple but effective image enhancement techniques
PDF
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PDF
Introduction to ambient GAN
PDF
[PR12] understanding deep learning requires rethinking generalization
[PR12] Generative Models as Distributions of Functions
Matching Network
Variants of GANs - Jaejun Yoo
Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Anal...
[CVPR2020] Simple but effective image enhancement techniques
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Introduction to ambient GAN
[PR12] understanding deep learning requires rethinking generalization

What's hot (20)

PDF
Super resolution in deep learning era - Jaejun Yoo
PDF
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
PPTX
Introduction to Interpretable Machine Learning
PPTX
Efficient Neural Network Architecture for Image Classfication
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
PDF
Representational Continuity for Unsupervised Continual Learning
PDF
PR243: Designing Network Design Spaces
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
PPTX
InfoGAIL
PDF
Overview of Convolutional Neural Networks
PPTX
Image classification with Deep Neural Networks
PDF
Understanding Convolutional Neural Networks
PDF
3D 딥러닝 동향
PDF
Transformer based approaches for visual representation learning
PDF
Learning to learn unlearned feature for segmentation
PPTX
AlexNet
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
PDF
Domain Transfer and Adaptation Survey
PDF
Learning with Relative Attributes
PPTX
Survey on contrastive self supervised l earning
Super resolution in deep learning era - Jaejun Yoo
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Introduction to Interpretable Machine Learning
Efficient Neural Network Architecture for Image Classfication
Online Coreset Selection for Rehearsal-based Continual Learning
Representational Continuity for Unsupervised Continual Learning
PR243: Designing Network Design Spaces
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
InfoGAIL
Overview of Convolutional Neural Networks
Image classification with Deep Neural Networks
Understanding Convolutional Neural Networks
3D 딥러닝 동향
Transformer based approaches for visual representation learning
Learning to learn unlearned feature for segmentation
AlexNet
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Domain Transfer and Adaptation Survey
Learning with Relative Attributes
Survey on contrastive self supervised l earning
Ad

Similar to Domain Invariant Representation Learning with Domain Density Transformations (20)

PDF
06 mlp
PDF
GAN(と強化学習との関係)
PDF
Face Anti Spoofing
PDF
論文紹介:Learning With Neighbor Consistency for Noisy Labels
PPTX
A CONVERGENCE ANALYSIS OF GRADIENT_version1
PPTX
Adversarial Variational Autoencoders to extend and improve generative model -...
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
PPTX
Adversarial Variational Autoencoders to extend and improve generative model
PDF
ssc_icml13
PDF
Presentation of Understanding Sharpness Dynamics in NN Training with a Minima...
DOCX
MTH 2001 Project 2Instructions• Each group must choos.docx
PDF
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
PDF
Paper Summary of Disentangling by Factorising (Factor-VAE)
PDF
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
PPTX
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
PDF
Web image annotation by diffusion maps manifold learning algorithm
PDF
Deep Domain Adaptation using Adversarial Learning and GAN
PPT
Instance Based Learning in Machine Learning
PDF
A simple framework for contrastive learning of visual representations
PDF
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
06 mlp
GAN(と強化学習との関係)
Face Anti Spoofing
論文紹介:Learning With Neighbor Consistency for Noisy Labels
A CONVERGENCE ANALYSIS OF GRADIENT_version1
Adversarial Variational Autoencoders to extend and improve generative model -...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Adversarial Variational Autoencoders to extend and improve generative model
ssc_icml13
Presentation of Understanding Sharpness Dynamics in NN Training with a Minima...
MTH 2001 Project 2Instructions• Each group must choos.docx
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
Paper Summary of Disentangling by Factorising (Factor-VAE)
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
Web image annotation by diffusion maps manifold learning algorithm
Deep Domain Adaptation using Adversarial Learning and GAN
Instance Based Learning in Machine Learning
A simple framework for contrastive learning of visual representations
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
Ad

More from HyunKyu Jeon (20)

PDF
[RAG Tutorial] 02. RAG 프로젝트 파이프라인.pdf
PDF
[RAG Tutorial] 01. RAG 프로젝트 개요.pdf
PDF
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
PPTX
Super tickets in pre trained language models
PDF
Synthesizer rethinking self-attention for transformer models
PDF
Meta back translation
PDF
Maxmin qlearning controlling the estimation bias of qlearning
PDF
Adversarial Attack in Neural Machine Translation
PPTX
십분딥러닝_19_ALL_ABOUT_CNN
PDF
십분수학_Entropy and KL-Divergence
PPTX
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
PPTX
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
PPTX
십분딥러닝_17_DIM(Deep InfoMax)
PPTX
십분딥러닝_16_WGAN (Wasserstein GANs)
PPTX
십분딥러닝_15_SSD(Single Shot Multibox Detector)
PPTX
십분딥러닝_14_YOLO(You Only Look Once)
PPTX
십분딥러닝_13_Transformer Networks (Self Attention)
PPTX
십분딥러닝_12_어텐션(Attention Mechanism)
PPTX
십분딥러닝_11_LSTM (Long Short Term Memory)
PPTX
십분딥러닝_10_R-CNN
[RAG Tutorial] 02. RAG 프로젝트 파이프라인.pdf
[RAG Tutorial] 01. RAG 프로젝트 개요.pdf
[PR-358] Training Differentially Private Generative Models with Sinkhorn Dive...
Super tickets in pre trained language models
Synthesizer rethinking self-attention for transformer models
Meta back translation
Maxmin qlearning controlling the estimation bias of qlearning
Adversarial Attack in Neural Machine Translation
십분딥러닝_19_ALL_ABOUT_CNN
십분수학_Entropy and KL-Divergence
(edited) 십분딥러닝_17_DIM(DeepInfoMax)
십분딥러닝_18_GumBolt (VAE with Boltzmann Machine)
십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_16_WGAN (Wasserstein GANs)
십분딥러닝_15_SSD(Single Shot Multibox Detector)
십분딥러닝_14_YOLO(You Only Look Once)
십분딥러닝_13_Transformer Networks (Self Attention)
십분딥러닝_12_어텐션(Attention Mechanism)
십분딥러닝_11_LSTM (Long Short Term Memory)
십분딥러닝_10_R-CNN

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Computer network topology notes for revision
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Miokarditis (Inflamasi pada Otot Jantung)
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
Qualitative Qantitative and Mixed Methods.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Computer network topology notes for revision
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
.pdf is not working space design for the following data for the following dat...

Domain Invariant Representation Learning with Domain Density Transformations

  • 1. Domain Invariant Representation Learning with Domain Density Transformations A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin, arxiv 2102.05082 PR-320, Presented by Eddie
  • 2. Domain Invariant Representation Learning with Domain Density Transformations 1. Domain Generalization 2. Domain Generalization과 Domain Adaptation 이전에 못 본 도메인(OOD)에 대응하기 위해 도메인에 구애받지 않는 모델을 만드는 것을 목표로 하는 학습 방법을 말한다. Domain Adaptation은 타깃 도메인(Target Domain)의 레이블이 없는 데이터를 바탕으로 정보(information)를 얻는 것이 가능하지만, Domain Generalization은 그렇지 않다는 것이 차이점이다. Train on the painting data in the Baroque period. Test on the painting data in the Modern period. Model “Caravaggio” Model ? Thanh-Dat Truong, et al., Recognition in Unseen Domains: Domain Generalization via Universal Non-volume Preserving Models 타깃 도메인(Target Domain)에 대한 정보가 없는 상태에서 예측을 해야 하기 때문에 어려운 과제이다. “ ”
  • 3. Domain Invariant Representation Learning with Domain Density Transformations Definition1.Marginal Distribution Alignment The representation z is said to satisfy the marginal distribution alignment condition if p(z|d) is invariant w.r.t. d. Definition2.Conditional Distribution Alignment The representaion z is said to satisfy the conditional distribution alignment condition if p(y|z,d) is invariant w.r.t. d. 3. [Domian Invariance] Marginal and Conditional Alignment 4. Proposed Method Ed E [ Ed d,d 2 2 ' [ ]]] l( ) y,gθ(x) gθ(x) , : gθ(x) ( - ) gθ f (x) d,d' f d d' ' ' : ' || || + [ p(x,y|d) Ed , s D ∈ , [ d,d 2 2 ' ] l( d , = D d , where : 데이터 공간, , D D : 정의역이 될 수 있는 공간 - X : 공역이 될 수 있는 공간 Y ∈ d ∈ x , X X Z Z ∈ ∈ y Y ∈ - 입력 를 로 변환하는 함수(domain representation function) 예측 값과 레이블 간의 Loss 도메인, 다른도메인 예측값 도메인 변환후 예측값 z → → x (x) 입력 로 변환하는 함수(density transformation function) x d ∈ 를 x , - { } s 1 d , 2 ..., dK ) y,gθ(x) gθ(x) ( - ) gθ f (x) d' || || + p(x,y|d) E [ 2 2 ] - || || + Transformations A. Tuan Nguyen 1 Toan Tran 2 Yarin Gal 1 Atılım Güneş Baydin 1 Abstract Domain generalization refers to the problem where we aim to train a model on data from a set of source domains so that the model can gen- eralize to unseen target domains. Naively training a model on the aggregate set of data (pooled from all source domains) has been shown to perform suboptimally, since the information learned by that model might be domain-specific and general- ize imperfectly to target domains. To tackle this problem, a predominant approach is to find and learn some domain-invariant information in order to use it for the prediction task. In this paper, we propose a theoretically grounded method to learn a domain-invariant representation by enforcing the representation network to be invariant under all transformation functions among domains. We also show how to use generative adversarial net- works to learn such domain transformations to implement our method in practice. We demon- strate the effectiveness of our method on several widely used datasets for the domain generaliza- tion problem, on all of which we achieve compet- itive results with state-of-the-art models. 1. Introduction Domain generalization refers to the machine learning sce- !"#$ !"%$ !"#$ !"%$ !"#$%&'()'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 Figure 1. An example of two domains. For each domain, x is uniformly distributed on the outer circle (radius 2 for domain 1 and radius 3 for domain 2), with the color indicating class label y. After the transformation z = x/||x||2, the marginal of z is aligned (uniformly distributed on the unit circle for both domains), but the conditional p(y|z) is not aligned. Thus, using this representation for predicting y would not generalize well across domains. In the representation learning framework, the prediction y = f(x), where x is data and y is a label, is obtained as a composition y = h ◦ g(x) of a deep representation network z = g(x), where z is a learned representation of data x, and a smaller classifier y = h(z), predicting label y given representation z, both of which are shared across domains. Current “domain-invariance”-based methods in domain gen- eralization focus on either the marginal distribution align- ment (Muandet et al., 2013) or the conditional distribution alignment (Li et al., 2018b;c), which are still prone to distri- v:2102.05082v2 [cs.LG] 14 Feb 2021 Transformations A. Tuan Nguyen 1 Toan Tran 2 Yarin Gal 1 Atılım Güneş Baydin 1 Abstract Domain generalization refers to the problem where we aim to train a model on data from a set of source domains so that the model can gen- eralize to unseen target domains. Naively training a model on the aggregate set of data (pooled from all source domains) has been shown to perform suboptimally, since the information learned by that model might be domain-specific and general- ize imperfectly to target domains. To tackle this problem, a predominant approach is to find and learn some domain-invariant information in order to use it for the prediction task. In this paper, we propose a theoretically grounded method to learn a domain-invariant representation by enforcing the representation network to be invariant under all transformation functions among domains. We also show how to use generative adversarial net- works to learn such domain transformations to implement our method in practice. We demon- strate the effectiveness of our method on several widely used datasets for the domain generaliza- tion problem, on all of which we achieve compet- itive results with state-of-the-art models. 1. Introduction Domain generalization refers to the machine learning sce- !"#$ !"%$ !"#$ !"%$ !"#$%&'()'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 !"#$%&'3)'% & #' # ( *%+,'-"."/'%&0%-$+%&1'2 Figure 1. An example of two domains. For each domain, x is uniformly distributed on the outer circle (radius 2 for domain 1 and radius 3 for domain 2), with the color indicating class label y. After the transformation z = x/||x||2, the marginal of z is aligned (uniformly distributed on the unit circle for both domains), but the conditional p(y|z) is not aligned. Thus, using this representation for predicting y would not generalize well across domains. In the representation learning framework, the prediction y = f(x), where x is data and y is a label, is obtained as a composition y = h ◦ g(x) of a deep representation network z = g(x), where z is a learned representation of data x, and a smaller classifier y = h(z), predicting label y given representation z, both of which are shared across domains. Current “domain-invariance”-based methods in domain gen- eralization focus on either the marginal distribution align- ment (Muandet et al., 2013) or the conditional distribution alignment (Li et al., 2018b;c), which are still prone to distri- v:2102.05082v2 [cs.LG] 14 Feb 2021 1) Domain-invariant representation function이�존재하는가? Q 2) 가 Domain-invariant 한가? d,d' gθ(x) ( - =0 ) gθ f (x)
  • 4. Domain Invariant Representation Learning with Domain Density Transformations Theorem 1; Domain-invariant representation function이�존재하는가? Theorem 1. The invariance of p(y|d) across domains is the necessary and sufficient condition for the existence of a domain-invariant representation (that aligns both the marginal and conditional distribution). ‘p(y|d)의 도메인에 따른 불변함’과 ‘ Domain-invariant representation(function)이 존재하는 것’은 동치이다. p(y,z|d) p(y|z,d) p(z|d) = = = p(y|z,d') p(z|d') p(y,z|d') p(y|d) ∴ = p(y|d') marginalizing ∵ over z ‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’ tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al., 2015) and domain adaptation (Zhao et al., 2019; Zhang et al., 2019; Combes et al., 2020; Tanwani, 2020) is that, in do- main generalization, the learner does not have access to (even a small amount of) data of the target domain, making the problem much more challenging. One of the most common domain generalization approaches is to learn an invariant representation across domains, aim- ing at a good generalization performance on target domains. 1 University of Oxford 2 VinAI Research. Correspondence to: A. Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>. alignment refers to making the representation distribution p(z) to be the same across domains. This is essential since if p(z) for the target domain is different from that of source domains, the classification network h(z) would face out- of-distribution data because the representation z it receives as input at test time would be different from the ones it was trained with in source domains. Conditional alignment refers to aligning p(y|z), the conditional distribution of the label given the representation, since if this conditional for the target domain is different from that of the source domains, the classification network (trained on the source domains) would give inaccurate predictions at test time. The formal definition of the two alignments is discussed in Section 3. tion (Khosla et al., 2012; Muandet et al., 2013; Ghifary et al., 2015) and domain adaptation (Zhao et al., 2019; Zhang et al., 2019; Combes et al., 2020; Tanwani, 2020) is that, in do- main generalization, the learner does not have access to (even a small amount of) data of the target domain, making the problem much more challenging. One of the most common domain generalization approaches is to learn an invariant representation across domains, aim- ing at a good generalization performance on target domains. 1 University of Oxford 2 VinAI Research. Correspondence to: A. Tuan Nguyen <tuan.nguyen@cs.ox.ac.uk>. alignment refers to making the representation distribution p(z) to be the same across domains. This is essential since if p(z) for the target domain is different from that of source domains, the classification network h(z) would face out- of-distribution data because the representation z it receives as input at test time would be different from the ones it was trained with in source domains. Conditional alignment refers to aligning p(y|z), the conditional distribution of the label given the representation, since if this conditional for the target domain is different from that of the source domains, the classification network (trained on the source domains) would give inaccurate predictions at test time. The formal definition of the two alignments is discussed in Section 3. Domain Invariant Representation Learning with Domain Density Transformations ze this objective, while the discriminator D tries to ze it. n Classification Loss. For a given input image x 3.2. Training with Multiple Datasets An important advantage of StarGAN is that it simulta- neously incorporates multiple datasets containing different minimize this objective, while the discriminator D tries to maximize it. Domain Classification Loss. For a given input image x and a target domain label c, our goal is to translate x into an output image y, which is properly classified to the target 3.2. Training with Multiple Datasets An important advantage of StarGAN is neously incorporates multiple datasets conta types of labels, so that StarGAN can contro → → 1) A ⇔ B, A⇒B and B⇒A ‘ Domain-invariant representation(function)이 존재하는 것’ ‘p(y|d)의 도메인에 따른 불변함’ If is unchanged w.r.t. the domain d, then we can always find a domain invariant representation(This is trivial). p(y|d) For example, for the deterministic case(that maps all x to 0), or for the probabilistic case. p(z|x) (z|x) = δ0 p(z|x) (z;0,1) = N → 2) Domain-invariant representation(function)이 존재한다 Marginal and Conditional Distribution Alignment를 만족하는 표현 z 가 존재한다.
  • 5. Domain Invariant Representation Learning with Domain Density Transformations Theorem 2; 가 Domain-invariant한가? d,d' gθ(x) ( - =0 ) gθ f (x) Theorem 2. Given an invertible and differentiable function the representation z satisfies Marginal Alignment Conditional Alignment . Then it aligns both the marginal and conditional of the data distribution for domain d and d (with the inverse that transforms the data density from to (as described above). Assuming that ) fd,d' f d ' d d ,d ' ! "#$ !"#$%&'"(%)*#!#)*+$%,!-)&"'()*'(#,$).)!')/ 01,!2)!2+),$3+"%+)! $#"4 %$ & !"#$'%"( ))%" ))%$ )* + , '%" ( +, '%$ ( 5'(#,$). 5'(#,$)/ main density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2, a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any 1) . applying variable substitution in multiple inte- = fd,d (x)) p(x |d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx ce p(fd,d(x )|d) = p(x |d ) det Jfd,d (x ) −1 Eq 6 and p(z|fd,d(x )) = p(z|x ) due to defini- z in Eq 7) p(x |d )p(z|x )dx |d ) (8) ional alignment: ∀z, y we have: (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) 01,!2)!2+),$3+%+)! $#4 %$ !#$'%( ))% ))%$ )* + , '% ( +, '%$ ( Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2, we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any x2 = f1,2(x1) . (by applying variable substitution in multiple inte- gral: x = fd,d (x)) = p(x |d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx (since p(fd,d(x )|d) = p(x |d ) det Jfd,d (x ) −1 due to Eq 6 and p(z|fd,d(x )) = p(z|x ) due to defini- tion of z in Eq 7) = p(x |d )p(z|x )dx = p(z|d ) (8) ii) Conditional alignment: ∀z, y we have: p(z|y, d) = p(x|y, d)p(z|x)dx = p(fd,d(x )|y, d)p(z|fd,d(x )) det Jfd,d (x ) dx (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) This theorem indicates that, if we can find the functions f’s that transform the data densities among the domains, we can learn a domain-invariant representation z by en- ! #$ !#$%'(%)*#!#)*+$%,!-)'()*'(#,$).)!')/ 01,!2)!2+),$3+%+)! $#4 %$ !#$'%( ))% ))%$ )* + , '% ( +, '%$ ( Figure 3. Domain density transformation. If we know the function f1,2 that transforms the data density from domain 1 to domain 2, we can learn a domain invariant representation network gθ(x) by enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any x2 = f1,2(x1) . (by applying variable substitution in multiple inte- gral: x = fd,d (x)) = p(x |d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx (since p(fd,d(x )|d) = p(x |d ) det Jfd,d (x ) −1 due to Eq 6 and p(z|fd,d(x )) = p(z|x ) due to defini- tion of z in Eq 7) = p(x |d )p(z|x )dx = p(z|d ) (8) ii) Conditional alignment: ∀z, y we have: p(z|y, d) = p(x|y, d)p(z|x)dx (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) This theorem indicates that, if we can find the functions ng with Domain Density Transformations $%,!-)'()*'(#,$).)!')/ 3+%+)! $#4 %$ !#$'%( ))%$ )* +, '%$ ( 5'(#,$)/ on f1,2 that transforms the data density from domain 1 to domain 2, enforcing it to be invariant under f1,2, i.e., gθ(x1) = gθ(x2) for any (since p(fd,d(x )|y, d) = p(x |y, d ) det Jfd,d (x ) −1 due to Eq 4 and p(z|fd,d(x )) = p(z|x ) due to definition of z in Eq 7) = p(x |y, d )p(z|x )dx = p(z|y, d ) (9) Note that p(y|z, d) = p(y, z|d) p(z|d) = p(y|d)p(z|y, d) p(z|d) (10) Since p(y|d) = p(y) = p(y|d ), p(z|y, d) = p(z|y, d ) and p(z|d) = p(z|d ), we have: p(y|z, d) = p(y|d )p(z|y, d ) p(z|d) = p(y|z, d ) (11) d,d' p(z|x) -1 -1 p(z|d) ∫ = p(x|d)p(z|x) |d) dx ∫ = p( (x') fd',d z ) | p( (x') fd',d p(z|y,d) p(z|y,d') ∫ = = p(x|y,d)p(z|x)dx = dx' fd',d det (x') J |y,d) ∫ p( (x') fd',d z ) | p( (x') fd',d dx' x' fd',d det (x') J = |y,d') x' ∫ p( z ) | p( dx' fd',d det (x') J fd',d det (x') J x' = |y,d') x' ∫ p( z ) | p( dx' |d') x' ∫ = = = p( |x') z p( dx' fd',d det (x') J fd',d det (x') J |d') x' ∫ p( |d') z p( |x') z p( dx' f x' d,d' where is Jacobian matrix of the function evaluated at p(x|y,d) (∵ marginalizing over z) ⇒ p(y|d) p(y|d') p(x'|y,d') = fd',d det -1 (x') J p(x|d) p(x'|d') = fd',d det -1 (x') J p(x|y,d) p(x'|y,d') = fd',d det -1 (x') J fd,d'(x') J (z| = ) p ∀x f (x), c. To achieve this condition, we add an auxiliary r on top of D and impose the domain classification en optimizing both D and G. That is, we decompose ctive into two terms: a domain classification loss of ges used to optimize D, and a domain classification ake images used to optimize G. In detail, the former ed as Lr cls = Ex,c [− log Dcls(c |x)], (2) he term Dcls(c |x) represents a probability distribu- er domain labels computed by D. By minimizing ective, D learns to classify a real image x to its cor- ing original domain c . We assume that the input nd domain label pair (x, c ) is given by the training n the other hand, the loss function for the domain ation of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) words, G tries to minimize this objective to gener- ges that can be classified as the target domain c. truction Loss. By minimizing the adversarial and ation losses, G is trained to generate images that stic and classified to its correct target domain. How- nimizing the losses (Eqs. (1) and (3)) does not guar- at the test phase. An issue when learning from multiple datasets, however, is that the label information is only par- tially known to each dataset. In the case of CelebA [19] and RaFD [13], while the former contains labels for attributes such as hair color and gender, it does not have any labels for facial expressions such as ‘happy’ and ‘angry’, and vice versa for the latter. This is problematic because the com- plete information on the label vector c is required when reconstructing the input image x from the translated image G(x, c) (See Eq. (4)). Mask Vector. To alleviate this problem, we introduce a mask vector m that allows StarGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset. In StarGAN, we use an n-dimensional one-hot vector to represent m, with n being the number of datasets. In addition, we define a unified version of the label as a vector c̃ = [c1, ..., cn, m], (7) where [·] refers to concatenation, and ci represents a vector for the labels of the i-th dataset. The vector of the known label ci can be represented as either a binary vector for bi- nary attributes or a one-hot vector for categorical attributes. classifier on top of D and impose the domain classification loss when optimizing both D and G. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize D, and a domain classification loss of fake images used to optimize G. In detail, the former is defined as Lr cls = Ex,c [− log Dcls(c |x)], (2) where the term Dcls(c |x) represents a probability distribu- tion over domain labels computed by D. By minimizing this objective, D learns to classify a real image x to its cor- responding original domain c . We assume that the input image and domain label pair (x, c ) is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) In other words, G tries to minimize this objective to gener- ate images that can be classified as the target domain c. Reconstruction Loss. By minimizing the adversarial and classification losses, G is trained to generate images that are realistic and classified to its correct target domain. How- ever, minimizing the losses (Eqs. (1) and (3)) does not guar- antee that translated images preserve the content of its input images while changing only the domain-related part of the tially known to each dataset. In the case of C RaFD [13], while the former contains label such as hair color and gender, it does not h for facial expressions such as ‘happy’ and ‘a versa for the latter. This is problematic bec plete information on the label vector c is reconstructing the input image x from the tr G(x, c) (See Eq. (4)). Mask Vector. To alleviate this problem, w mask vector m that allows StarGAN to ign labels and focus on the explicitly known lab a particular dataset. In StarGAN, we use an one-hot vector to represent m, with n being datasets. In addition, we define a unified vers as a vector c̃ = [c1, ..., cn, m], where [·] refers to concatenation, and ci repr for the labels of the i-th dataset. The vecto label ci can be represented as either a binar nary attributes or a one-hot vector for catego For the remaining n−1 unknown labels we Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . minimize this objective, while the discriminator D tries to maximize it. Domain Classification Loss. For a given input image x and a target domain label c, our goal is to translate x into an output image y, which is properly classified to the target domain c. To achieve this condition, we add an auxiliary classifier on top of D and impose the domain classification loss when optimizing both D and G. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize D, and a domain classification loss of fake images used to optimize G. In detail, the former is defined as Lr cls = Ex,c [− log Dcls(c |x)], (2) where the term Dcls(c |x) represents a probability distribu- tion over domain labels computed by D. By minimizing this objective, D learns to classify a real image x to its cor- minimize this objective, while the discriminator D tries to maximize it. Domain Classification Loss. For a given input image x and a target domain label c, our goal is to translate x into 3.2. Training with Multiple Data An important advantage of StarG neously incorporates multiple datase types of labels, so that StarGAN can
  • 6. Domain Invariant Representation Learning with Domain Density Transformations d ,d applying variable substitution in multiple inte- = fd,d (x)) p(x |y, d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx couraging the representation to be invariant under all the transformations f’s. This idea is illustrated in Figure 3. We therefore can use the following learning objective to learn a domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. gral: x = fd,d (x)) = p(x |y, d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. (by applying variable substitution in multiple inte- gral: x = fd,d (x)) = p(x |y, d ) det Jfd,d (x ) −1 p(z|x ) det Jfd,d (x ) dx transformations f’s. This idea is illustrated in Figure 3. We therefore can use the following learning objective to learn a domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. f’s that transform the data densities among the domains, we can learn a domain-invariant representation z by en- couraging the representation to be invariant under all the transformations f’s. This idea is illustrated in Figure 3. We therefore can use the following learning objective to learn a domain-invariant representation z = gθ(x): Ed Ep(x,y|d) l(y, gθ(x)) + Ed [||gθ(x) − gθ(fd,d (x))||2 2] (12) where l(y, gθ(x)) is the prediction loss of a network that pre- dicts y given z = gθ(x), and the second term is to enforce the invariant condition in Eq 7. 5. Domain Generalization with Generative Adversarial Networks (StarGAN; PR-152) models 2 3 2 G32 G23 3 2 1 5 4 3 (b) StarGAN on between cross-domain models and our pro- GAN. (a) To handle multiple domains, cross- ould be built for every pair of image domains. able of learning mappings among multiple do- e generator. The figure represents a star topol- lti-domains. ned from RaFD, as shown in the right- Fig. 1. As far as our knowledge goes, our o successfully perform multi-domain im- ross different datasets. ontributions are as follows: StarGAN, a novel generative adversarial learns the mappings among multiple do- only a single generator and a discrimina- effectively from images of all domains. rate how we can successfully learn multi- G Input image Target domain Depth-wise concatenation Fake image G Original domain Fake image Depth-wise concatenation Reconstructed image D Fake image Domain classification Real / Fake (b) Original-to-target domain (c) Target-to-original domain (d) Fooling the discriminator D Domain classification Real / Fake Fake image Real image (a) Training the discriminator (1) (2) (1), (2) (1) Figure 3. Overview of StarGAN, consisting of two modules, a discriminator D and a generator G. (a) D learns to distinguish between real and fake images and classify the real images to its corresponding domain. (b) G takes in as input both the image and target domain label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) G tries to reconstruct the original image from the fake image given the original domain label. (d) G tries to generate images indistinguishable from real images and classifiable as target domain by D. vided both the discriminator and generator with class infor- mation in order to generate samples conditioned on the class [20, 21, 22]. Other recent approaches focused on generating particular images highly relevant to a given text description [25, 30]. The idea of conditional image generation has also been successfully applied to domain transfer [9, 28], super- resolution imaging[14], and photo editing [2, 27]. In this paper, we propose a scalable GAN framework that can flex- ibly steer the image translation to various target domains, by providing conditional domain information. Image-to-Image Translation. Recent work have achieved impressive results in image-to-image translation [7, 9, 17, 33]. For instance, pix2pix [7] learns this task in a super- vised manner using cGANs[20]. It combines an adver- sarial loss with a L1 loss, thus requires paired data sam- ples. To alleviate the problem of obtaining data pairs, un- 3. Star Generative Adversarial Networks We first describe our proposed StarGAN, a framework to address multi-domain image-to-image translation within a single dataset. Then, we discuss how StarGAN incorporates multiple datasets containing different label sets to flexibly perform image translations using any of these labels. 3.1. Multi-Domain Image-to-Image Translation Our goal is to train a single generator G that learns map- pings among multiple domains. To achieve this, we train G to translate an input image x into an output image y condi- tioned on the target domain label c, G(x, c) → y. We ran- domly generate the target domain label c so that G learns to flexibly translate the input image. We also introduce an auxiliary classifier [22] that allows a single discriminator to control multiple domains. That is, our discriminator pro- To alleviate this problem, we apply a cycle consis- ss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) G takes in the translated image G(x, c) and the origi- ain label c as input and tries to reconstruct the orig- ge x. We adopt the L1 norm as our reconstruction te that we use a single generator twice, first to trans- original image into an image in the target domain n to reconstruct the original image from the trans- age. jective. Finally, the objective functions to optimize D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) λcls and λrec are hyper-parameters that control the importance of domain classification and reconstruc- ses, respectively, compared to the adversarial loss. λcls = 1 and λrec = 10 in all of our experiments. RaFD datasets, where n is two. Training Strategy. When training StarGAN with multiple datasets, we use the domain label c̃ defined in Eq. (7) as in- put to the generator. By doing so, the generator learns to ignore the unspecified labels, which are zero vectors, and focus on the explicitly given label. The structure of the gen- erator is exactly the same as in training with a single dataset, except for the dimension of the input label c̃. On the other hand, we extend the auxiliary classifier of the discrimina- tor to generate probability distributions over labels for all datasets. Then, we train the model in a multi-task learning setting, where the discriminator tries to minimize only the classification error associated to the known label. For ex- ample, when training with images in CelebA, the discrimi- nator minimizes only classification errors for labels related to CelebA attributes, and not facial expressions related to RaFD. Under these settings, by alternating between CelebA and RaFD the discriminator learns all of the discriminative features for both datasets, and the generator learns to con- trol all the labels in both datasets. 4 Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) where G takes in the translated image G(x, c) and the origi- nal domain label c as input and tries to reconstruct the orig- inal image x. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to trans- late an original image into an image in the target domain and then to reconstruct the original image from the trans- lated image. Full Objective. Finally, the objective functions to optimize G and D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) where λcls and λrec are hyper-parameters that control the relative importance of domain classification and reconstruc- tion losses, respectively, compared to the adversarial loss. We use λcls = 1 and λrec = 10 in all of our experiments. Training Strategy. When training StarGAN datasets, we use the domain label c̃ defined i put to the generator. By doing so, the gene ignore the unspecified labels, which are ze focus on the explicitly given label. The struc erator is exactly the same as in training with a except for the dimension of the input label c hand, we extend the auxiliary classifier of tor to generate probability distributions ove datasets. Then, we train the model in a mul setting, where the discriminator tries to min classification error associated to the known ample, when training with images in CelebA nator minimizes only classification errors fo to CelebA attributes, and not facial express RaFD. Under these settings, by alternating b and RaFD the discriminator learns all of the features for both datasets, and the generator trol all the labels in both datasets. 4 of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − ation Learning with Domain Density Transformations ain Ds = becomes: d (x))||2 2 (13) porate this lems with ative transform e can use rmalizing 017; Choi advantage s naturally dition, the on can be hat we do g process se the use particular, , which is • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the determinant of the Jacobian of that transformation can be efficiently computed. However, due to the fact that we do not need access to the Jacobian when the training process of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets Domain Invariant Representation Learning with Domain Density Transformations Assume that we have a set of K sources domain Ds = {d1, d2, ..., dK}, the objective function in Eq. 12 becomes: Ed,d∈Ds,p(x,y|d) l(y, gθ(x)) + ||gθ(x) − gθ(fd,d (x))||2 2 (13) In the next section, we show how one can incorporate this idea into real-world domain generalization problems with generative adversarial networks. 4. Domain Generalization with Generative Adversarial Networks In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the determinant of the Jacobian of that transformation can be efficiently computed. However, due to the fact that we do not need access to the Jacobian when the training process of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. • The adversarial loss Ladv that is the classification loss of a discriminator D that tries to distinguish between real images and the fake images generated by G. The equilibrium state of StarGAN is when G completely fools D, which means the distribution of the generated images (via G(x, d, d ), x ∼ p(x|d)) becomes the dis- tribution of the real images of the destination domain p(x |d ). This is our objective, i.e., to learn a function that transforms domains’ densities. • Reconstruction loss Lrec = Ex,d,d [||x − G(x , d , d)||1] where x = G(x, d, d ) to ensure that the transformations preserve the image’s content. Note that this also aligns with our interest since we want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs image and domain label pair (x, c ) is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) In other words, G tries to minimize this objective to gener- ate images that can be classified as the target domain c. Reconstruction Loss. By minimizing the adversarial and classification losses, G is trained to generate images that are realistic and classified to its correct target domain. How- ever, minimizing the losses (Eqs. (1) and (3)) does not guar- antee that translated images preserve the content of its input images while changing only the domain-related part of the inputs. To alleviate this problem, we apply a cycle consis- tency loss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) where G takes in the translated image G(x, c) and the origi- nal domain label c as input and tries to reconstruct the orig- inal image x. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to trans- late an original image into an image in the target domain and then to reconstruct the original image from the trans- lated image. classifier on top of D and impose the domain classification loss when optimizing both D and G. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize D, and a domain classification loss of fake images used to optimize G. In detail, the former is defined as Lr cls = Ex,c [− log Dcls(c |x)], (2) where the term Dcls(c |x) represents a probability distribu- tion over domain labels computed by D. By minimizing this objective, D learns to classify a real image x to its cor- responding original domain c . We assume that the input image and domain label pair (x, c ) is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as Lf cls = Ex,c[− log Dcls(c|G(x, c))]. (3) In other words, G tries to minimize this objective to gener- ate images that can be classified as the target domain c. Reconstruction Loss. By minimizing the adversarial and classification losses, G is trained to generate images that are realistic and classified to its correct target domain. How- ever, minimizing the losses (Eqs. (1) and (3)) does not guar- antee that translated images preserve the content of its input tially known to each dataset. In the ca RaFD [13], while the former contain such as hair color and gender, it doe for facial expressions such as ‘happy’ versa for the latter. This is problem plete information on the label vecto reconstructing the input image x from G(x, c) (See Eq. (4)). Mask Vector. To alleviate this pro mask vector m that allows StarGAN labels and focus on the explicitly kno a particular dataset. In StarGAN, we one-hot vector to represent m, with n datasets. In addition, we define a unifi as a vector c̃ = [c1, ..., cn, m where [·] refers to concatenation, and for the labels of the i-th dataset. Th label ci can be represented as either nary attributes or a one-hot vector for For the remaining n−1 unknown lab mize this objective, while the discriminator D tries to mize it. ain Classification Loss. For a given input image x target domain label c, our goal is to translate x into put image y, which is properly classified to the target n c. To achieve this condition, we add an auxiliary fier on top of D and impose the domain classification hen optimizing both D and G. That is, we decompose jective into two terms: a domain classification loss of mages used to optimize D, and a domain classification f fake images used to optimize G. In detail, the former ned as Lr cls = Ex,c [− log Dcls(c |x)], (2) the term Dcls(c |x) represents a probability distribu- ver domain labels computed by D. By minimizing bjective, D learns to classify a real image x to its cor- nding original domain c . We assume that the input and domain label pair (x, c ) is given by the training On the other hand, the loss function for the domain fication of fake images is defined as 3.2. Training with Multiple Datasets An important advantage of StarGAN is that it simulta- neously incorporates multiple datasets containing different types of labels, so that StarGAN can control all the labels at the test phase. An issue when learning from multiple datasets, however, is that the label information is only par- tially known to each dataset. In the case of CelebA [19] and RaFD [13], while the former contains labels for attributes such as hair color and gender, it does not have any labels for facial expressions such as ‘happy’ and ‘angry’, and vice versa for the latter. This is problematic because the com- plete information on the label vector c is required when reconstructing the input image x from the translated image G(x, c) (See Eq. (4)). Mask Vector. To alleviate this problem, we introduce a mask vector m that allows StarGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset. In StarGAN, we use an n-dimensional one-hot vector to represent m, with n being the number of datasets. In addition, we define a unified version of the label
  • 7. Domain Invariant Representation Learning with Domain Density Transformations both qualitative and quantitative results on ute transfer and facial expression synthe- ng StarGAN, showing its superiority over dels. ork ersarial Networks. Generative adversar- ANs) [3] have shown remarkable results ter vision tasks such as image generation age translation [7, 9, 33], super-resolution d face image synthesis [10, 16, 26, 31]. A del consists of two modules: a discrimina- or. The discriminator learns to distinguish fake samples, while the generator learns to mples that are indistinguishable from real proach also leverages the adversarial loss rated images as realistic as possible. Ns. GAN-based conditional image gener- n actively studied. Prior studies have pro- distribution of images in cross domains. CycleGAN [33] and DiscoGAN [9] preserve key attributes between the in- put and the translated image by utilizing a cycle consistency loss. However, all these frameworks are only capable of learning the relations between two different domains at a time. Their approaches have limited scalability in handling multiple domains since different models should be trained for each pair of domains. Unlike the aforementioned ap- proaches, our framework can learn the relations among mul- tiple domains using only a single model. Adversarial Loss. To make the generated images indistin- guishable from real images, we adopt an adversarial loss Ladv = Ex [log Dsrc(x)] + Ex,c[log (1 − Dsrc(G(x, c)))], (1) where G generates an image G(x, c) conditioned on both the input image x and the target domain label c, while D tries to distinguish between real and fake images. In this paper, we refer to the term Dsrc(x) as a probability distri- bution over sources given by D. The generator G tries to 3 of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. In practice, we will learn the functions f’s that transform the data distributions between domains and one can use several generative modeling frameworks, e.g., normalizing flows (Grover et al., 2020) or GANs (Zhu et al., 2017; Choi et al., 2018; 2020) to learn such functions. One advantage of normalizing flows is that this transformation is naturally invertible by design of the neural network. In addition, the determinant of the Jacobian of that transformation can be efficiently computed. However, due to the fact that we do not need access to the Jacobian when the training process of the generative model is completed, we propose the use of GANs to inherit its rich network capacity. In particular, we use the StarGAN (Choi et al., 2018) model, which is designed for image domain transformations. The goal of StarGAN is to learn a unified network G that transforms the data density among multiple domains. In particular, the network G(x, d, d ) (i.e., G is conditioned on the image x and the two different domains d, d ) transforms an image x from domain d to domain d . Different from the original StarGAN model that only takes the image x and the desired destination domain d as its input, in our implementation, we feed both the original domain d and desired destination domain d together with the original image x to the generator G. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the want G(., d , d) to be the inverse of G(., d, d ), which will minimize Lrec to zero. We can enforce the generator G to transform the data distri- bution within the class y (e.g., p(x|y, d) to p(x |y, d ) ∀y) by sampling each minibatch with data from the same class y, so that the discriminator will distinguish the transformed images with the real images from class y and domain d . However, we found that this constraint can be relaxed in practice, and the generator almost always transforms the image within the original class y. As mentioned earlier, after training the StarGAN model, we can use the generator G(., d, d ) as our fd,d (.) function and learn a domain-invariant representation via the learning objective in Eq 13. We name this implementation of our method DIR-GAN (domain-invariant representation learn- ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- 6. Experiments / Results 1) Dataset 2) Results itioned on ransforms rent from e image x ut, in our ain d and e original into think- nation do- StarGAN, ccessfully ain to that G(., d, d ) us section objective chitecture urages the y belongs ing with generative adversarial networks). 5. Experiments 5.1. Datasets To evaluate our method, we perform experiments in three datasets that are commonly used in the literature for domain generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. The generator’s goal is to fool a discriminator D into think- ing that the transformed image belongs to the destination do- main d . In other words, the equilibrium state of StarGAN, in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . generalization. Rotated MNIST. In this dataset by Ghifary et al. (2015), 1,000 MNIST images (100 per class) (LeCun Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. in which G completely fools D, is when G successfully transforms the data density of the original domain to that of the destination domain. After training, we use G(., d, d ) as the function fd,d (.) described in the previous section and perform the representation learning via the objective function in Eq 13. Three important loss functions of the StarGAN architecture are: • Domain classification loss Lcls that encourages the generator G to generate images that correctly belongs to the desired destination domain d . Cortes, 2010) are chosen to form the first domain (de- noted M0), then rotations of 15◦ , 30◦ , 45◦ , 60◦ and 75◦ are applied to create five additional domains, denoted M15, M30, M45, M60 and M75. The task is classifica- tion with ten classes (digits 0 to 9). PACS (Li et al., 2017) contains 9,991 images from four different domains: art painting, cartoon, photo, sketch. The task is classification with seven classes. OfficeHome (Venkateswara et al., 2017) has 15,500 im- ages of daily objects from four domains: art, clipart, product and real. There are 65 classes in this classification dataset. Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. For the Rotated MNIST dataset, we use a network of two 3x3 convolutional layers and a fully connected layer as the representation network gθ to get a representation z of 64 dimensions. A single linear layer is then used to map the representation z to the ten output classes. This architecture is the deterministic version of the network used by Ilse et al. (2020). We train our network for 500 epochs with the Adam optimizer (Kingma Ba, 2014), using the learning rate test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) using learning rate 0.001, momentum 0.9, minibatch size 64, and weight decay 0.001. Data augmentation is also standard practice for real-world computer vision datasets like PACS and OfficeHome, and during the training we augment our data as follows: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, randomly converting the image tile to grayscale with 10% probability, and normalization HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. For the Rotated MNIST dataset, we use a network of two 3x3 convolutional layers and a fully connected layer as the representation network gθ to get a representation z of 64 dimensions. A single linear layer is then used to map the representation z to the ten output classes. This architecture is the deterministic version of the network used by Ilse et al. (2020). We train our network for 500 epochs with the Adam optimizer (Kingma Ba, 2014), using the learning rate 0.001 and minibatch size 64, and report performance on the test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) using learning rate 0.001, momentum 0.9, minibatch size 64, and weight decay 0.001. Data augmentation is also standard practice for real-world computer vision datasets like PACS and OfficeHome, and during the training we augment our data as follows: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, randomly converting the image tile to grayscale with 10% probability, and normalization using the ImageNet channel means and standard deviations. Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) Domain Invariant Representation Learning with Domain Density Transformations Table 1. Rotated Mnist leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs Domains Model M0 M15 M30 M45 M60 M75 Average HIR (Wang et al., 2020) 90.34 99.75 99.40 96.17 99.25 91.26 96.03 DIVA (Ilse et al., 2020) 93.5 99.3 99.1 99.2 99.3 93.0 97.2 DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) DGER (Zhao et al., 2020) 90.09 99.24 99.27 99.31 99.45 90.81 96.36 DA (Ganin et al., 2016) 86.7 98.0 97.8 97.4 96.9 89.1 94.3 LG (Shankar et al., 2018) 89.7 97.8 98.0 97.1 96.6 92.1 95.3 HEX (Wang et al., 2019) 90.1 98.9 98.9 98.8 98.3 90.0 95.8 ADV (Wang et al., 2019) 89.9 98.6 98.8 98.7 98.6 90.4 95.2 DIR-GAN (ours) 97.2(±0.3) 99.4(±0.1) 99.3(±0.1) 99.3(±0.1) 99.2(±0.1) 97.1(±0.3) 98.6 Table 2. PACS leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs PACS Model Backbone Art Painting Cartoon Photo Sketch Average DGER (Zhao et al., 2020) Resnet18 80.70 76.40 96.65 71.77 81.38 JiGen (Carlucci et al., 2019) Resnet18 79.42 75.25 96.03 71.35 79.14 MLDG (Li et al., 2018a) Resnet18 79.50 77.30 94.30 71.50 80.70 MetaReg (Balaji et al., 2018) Resnet18 83.70 77.20 95.50 70.40 81.70 CSD (Piratla et al., 2020) Resnet18 78.90 75.80 94.10 76.70 81.40 DMG (Chattopadhyay et al., 2020) Resnet18 76.90 80.38 93.35 75.21 81.46 DIR-GAN (ours) Resnet18 82.56(± 0.4) 76.37(± 0.3) 95.65(± 0.5) 79.89(± 0.2) 83.62 Table 3. OfficeHome leave-one-domain-out experiment. Reported numbers are mean accuracy and standard deviation among 5 runs OfficeHome Model Backbone Art ClipArt Product Real Average D-SAM (D’Innocente Caputo, 2018) Resnet18 58.03 44.37 69.22 71.45 60.77 JiGen (Carlucci et al., 2019) Resnet18 53.04 47.51 71.47 72.79 61.20 DIR-GAN (ours) Resnet18 56.69(±0.4) 50.49(±0.2) 71.32(±0.4) 74.23(±0.5) 63.18 5.2. Experimental Setting For all datasets, we perform “leave-one-domain-out” exper- iments, where we choose one domain as the target domain, train the model on all remaining domains and evaluate it on the chosen domain. Following standard practice, we use 90% of available data as training data and 10% as validation data, except for the Rotated MNIST experiment where we do not use a validation set and just report the performance of the last epoch. For the Rotated MNIST dataset, we use a network of two 3x3 convolutional layers and a fully connected layer as the representation network gθ to get a representation z of 64 dimensions. A single linear layer is then used to map the representation z to the ten output classes. This architecture is the deterministic version of the network used by Ilse et al. (2020). We train our network for 500 epochs with the Adam optimizer (Kingma Ba, 2014), using the learning rate 0.001 and minibatch size 64, and report performance on the test domain after the last epoch. For the PACS and OfficeHome datasets, we use a Resnet18 (He et al., 2016) network as the representation network gθ. As a standard practice, the Resnet18 backbone is pre-trained on ImageNet. We replace the last fully connected layer of the Resnet with a linear layer of dimensions (512, 256) so that our representation has 256 dimensions. As with the Rotated MNIST experiment, we use a single layer to map from the representation z to the output. We train the network for 100 epochs with plain stochastic gradient descent (SGD) using learning rate 0.001, momentum 0.9, minibatch size 64, and weight decay 0.001. Data augmentation is also standard practice for real-world computer vision datasets like PACS and OfficeHome, and during the training we augment our data as follows: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, randomly converting the image tile to grayscale with 10% probability, and normalization using the ImageNet channel means and standard deviations. Domain Invariant Representation Learning with Domain D Figure 4. Visualization of the representation space. Each point indicates a representa and its color indicates the label y. Two left figures are for our method DIR-GAN and t The StarGAN (Choi et al., 2018) model implementation is taken from the authors’ original source code with no significant modifications. For each set of source domains, we train the StarGAN model for 100,000 iterations with a minibatch of 16 images per iteration. The code for all of our experiments will be released for reproducibility. Please also refer to the source code for any the general dis distribution (fo and green poin PACS and Of domain invaria been applied w puter vision d LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) where λcls and λrec are hyper-parameters that control the relative importance of domain classification and reconstruc- tion losses, respectively, compared to the adversarial loss. We use λcls = 1 and λrec = 10 in all of our experiments. tency loss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) where G takes in the translated image G(x, c) and the origi- nal domain label c as input and tries to reconstruct the orig- inal image x. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to trans- late an original image into an image in the target domain and then to reconstruct the original image from the trans- lated image. Full Objective. Finally, the objective functions to optimize G and D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) LG = Ladv + λcls Lf cls + λrec Lrec, (6) where λcls and λrec are hyper-parameters that control the relative importance of domain classification and reconstruc- tion losses, respectively, compared to the adversarial loss. We use λcls = 1 and λrec = 10 in all of our experiments. RaFD datasets, where n is two. Training Strategy. When training S datasets, we use the domain label c̃ d put to the generator. By doing so, t ignore the unspecified labels, which focus on the explicitly given label. Th erator is exactly the same as in trainin except for the dimension of the inpu hand, we extend the auxiliary classi tor to generate probability distributio datasets. Then, we train the model in setting, where the discriminator tries classification error associated to the ample, when training with images in nator minimizes only classification e to CelebA attributes, and not facial RaFD. Under these settings, by altern and RaFD the discriminator learns al features for both datasets, and the ge trol all the labels in both datasets. 4 er words, G tries to minimize this objective to gener- ages that can be classified as the target domain c. nstruction Loss. By minimizing the adversarial and fication losses, G is trained to generate images that alistic and classified to its correct target domain. How- minimizing the losses (Eqs. (1) and (3)) does not guar- that translated images preserve the content of its input s while changing only the domain-related part of the . To alleviate this problem, we apply a cycle consis- loss [9, 33] to the generator, defined as Lrec = Ex,c,c [||x − G(G(x, c), c )||1], (4) G takes in the translated image G(x, c) and the origi- main label c as input and tries to reconstruct the orig- mage x. We adopt the L1 norm as our reconstruction Note that we use a single generator twice, first to trans- n original image into an image in the target domain hen to reconstruct the original image from the trans- mage. Objective. Finally, the objective functions to optimize D are written, respectively, as LD = −Ladv + λcls Lr cls, (5) f c̃ = [c1, ..., cn, m], (7) where [·] refers to concatenation, and ci represents a vector for the labels of the i-th dataset. The vector of the known label ci can be represented as either a binary vector for bi- nary attributes or a one-hot vector for categorical attributes. For the remaining n−1 unknown labels we simply assign zero values. In our experiments, we utilize the CelebA and RaFD datasets, where n is two. Training Strategy. When training StarGAN with multiple datasets, we use the domain label c̃ defined in Eq. (7) as in- put to the generator. By doing so, the generator learns to ignore the unspecified labels, which are zero vectors, and focus on the explicitly given label. The structure of the gen- erator is exactly the same as in training with a single dataset, except for the dimension of the input label c̃. On the other hand, we extend the auxiliary classifier of the discrimina- tor to generate probability distributions over labels for all datasets. Then, we train the model in a multi-task learning setting, where the discriminator tries to minimize only the classification error associated to the known label. For ex- ample, when training with images in CelebA, the discrimi-
  • 8. Domain Invariant Representation Learning with Domain Density Transformations 3) Visualization of Representation novel and scalable approach capable of learning mappings among multiple domains. As demonstrated in Fig. 2 (b), our model takes in training data of multiple domains, and learns the mappings between all available domains using only a single generator. The idea is simple. Instead of learning a fixed translation (e.g., black-to-blond hair), our generator takes in as inputs both image and domain information, and learns to flexibly translate the image into the correspond- ing domain. We use a label (e.g., binary or one-hot vector) to represent domain information. During training, we ran- domly generate a target domain label and train the model to flexibly translate an input image into the target domain. By doing so, we can control the domain label and translate the image into any desired domain at testing phase. We also introduce a simple but effective approach that enables joint training between domains of different datasets by adding a mask vector to the domain label. Our proposed method ensures that the model can ignore unknown labels and focus on the label provided by a particular dataset. In this manner, our model can perform well on tasks such as synthesizing facial expressions of CelebA images us- • We provi facial att sis tasks baseline 2. Related W Generative A ial networks in various com [6, 24, 32, 8], imaging [14], typical GAN m tor and a gene between real a generate fake samples. Our to make the ge Conditional G ation has also 2 particular, the network G(x, d, d ) (i.e., G is c the image x and the two different domains d, d an image x from domain d to domain d . D the original StarGAN model that only takes and the desired destination domain d as its implementation, we feed both the original d desired destination domain d together with image x to the generator G. The generator’s goal is to fool a discriminator ing that the transformed image belongs to the d main d . In other words, the equilibrium state in which G completely fools D, is when G transforms the data density of the original d of the destination domain. After training, we as the function fd,d (.) described in the pre and perform the representation learning via function in Eq 13. Three important loss functions of the StarGAN are: • Domain classification loss Lcls that en generator G to generate images that corr to the desired destination domain d . In o ate i Rec clas are r ever ante ima inpu tenc whe nal d inal loss late and lated Full G a Domain Invariant Representation Learning with Domain Density Transformations Figure 4. Visualization of the representation space. Each point indicates a representation z of an image x in the two dimensional space and its color indicates the label y. Two left figures are for our method DIR-GAN and two right figures are for the naive model DeepAll. The StarGAN (Choi et al., 2018) model implementation is taken from the authors’ original source code with no significant modifications. For each set of source domains, we train the StarGAN model for 100,000 iterations with a minibatch of 16 images per iteration. The code for all of our experiments will be released for reproducibility. Please also refer to the source code for any other architecture and implementation details. the general distribution of the points) and the conditional distribution (for example, the distributions of blue points and green points). PACS and OfficeHome. To the best of our knowledge, domain invariant representation learning methods have not been applied widely and successfully for real-world com- puter vision datasets (e.g., PACS and OfficeHome) with very deep neural networks such as Resnet, so the only rel-