文献紹介：Learning From Noisy Labels With Deep Neural Networks: A Survey

Learning From Noisy
Labels With Deep Neural
Networks: A Survey
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil
Lee, IEEE TNNLS 2022
橋口凌大（名工大玉木研）
2022/10/28

概要
nノイズの多いデータから学習させる方法のサーベイ
• ノイズにロバストなモデル
n分類タスクに焦点を当ててロバストな学習
• ロバストなアーキテクチャ
• ロバストな正則化
• ロバストな損失関数の設計
• サンプル選択

ラベルノイズ下での教師あり学習
n問題点
• ノイズラベルを記憶し，汎化性能が低下
nアプローチ
• ノイズの影響を軽減する学習方法の確立
nノイズの種類についての調査
• インスタンス依存
• インスタンス非依存
Clean labels Wrong labels
Cross Entropy
Early-learning
Regularization
[Liu+, NeurIPS2020]

深層学習とノイジーラベル
n深層学習ではラベルノイズの影響を受けやすい
nロバスト性を実現するために複数の学習手法が誕生
(a) Noise added on classification inputs. (b) Noise added on classification labels.
Figure 7. Accuracy (left in each pair, solid is train, dotted is validation) and Critical sample ratios (right in each pair) for MNIST.
(a) Noise added on classification inputs. (b) Noise added on classification labels.
Figure 8. Accuracy (left in each pair, solid is train, dotted is validation) and Critical sample ratios (right in each pair) for CIFAR10.
Algorithm 1 Langevin Adversarial Sample Search (LASS)
Require: x 2 Rn
, ↵, , r, noise process ⌘
Ensure: x̂
1: converged = FALSE
2: x̃ x; x̂ ;
3: while not converged or max iter reached do
4: = ↵ · sign(@fk(x)
@x ) + · ⌘
5: x̃ x̃ +
6: for i 2 [n] do
⇢ i
[Zhang+, ICLR2017]

深層学習によるノイズロバストな学習
nロバストな学習方法の分類
EE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3
Robust Training
Robust Architecture (§III-A)
Robust Regularization (§III-B)
Loss Adjustment (§III-D)
Sample Selection (§III-E)
Robust Loss Function (§III-C)
Robust Loss Design
Noise Adaptation Layer
Dedicated Architecture
Explicit Regularization
Implicit Regularization
Loss Correction
Loss Reweighting
Label Refurbishment
Multi-network Learning
Multi-round Learning
Meta Learning
Hybrid Approach
g. 3. A high level research overview of robust deep learning for noisy labels. The research directions that are actively contributed by the machine learning
mmunity are categorized into five groups in blue italic.
Here, the risk minimization process is no longer noise- C. Non-deep Learning Approaches

Robust Architecture
nノイズ適応層
• ラベル遷移パターンを学習
stness of
hes [16],
research
nity. All
pervised
ion layer
ransition
reliably
o overfit
ly;
unction;
p(ỹ = j|x)=
X
i=1
p(ỹ = j, y = i|x)=
X
i=1
Tijp(y = i|x),
where Tij = p(ỹ = j|y = i, x).
(3)
In light of this, the noise adaptation layer is intended to
mimic the label transition behavior in learning a DNN. Let
p(y|x; ⇥) be the output of the base DNN with a softmax output
layer. Then, following Eq. (3), the probability of an example
x being predicted as its noisy label ỹ is parameterized by
p(ỹ = j|x; ⇥, W) =
c
X
i=1
p(ỹ = j, y=i|x; ⇥, W)
=
c
X
i=1
p(ỹ = j|y=i; W)
| {z }
Noise Adaptation Layer
p(y=i|x; ⇥)
| {z }
Base Model
.
(4)
Here, the noisy label ỹ is assumed to be conditionally
independent of the input x in general. Accordingly, as
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4
[57]. In addition, several decision tree models are extended
using new split criteria to solve the overfitting issue when
the training data are not fully reliable [58], [59]. However, it
is infeasible to apply their design principles to deep learning.
Meanwhile, deep learning is more susceptible to label noises
than traditional machine learning owing to its high expressive
power, as proven by many researchers [21], [60], [61]. There
has been significant effort to understand why noisy labels
negatively affect the performance of DNNs [22], [61]–[63].
This theoretical understanding has led to the algorithmic de-
sign which achieves higher robustness than non-deep learning
methods. A detailed analysis of theoretical understanding for
robust deep learning was provided by Han et al. [30].
D. Regression with Noisy Labels
In addition to classification, regression is another main topic
of supervised machine learning, which aims to model the rela-
tionship between a number of features and a continuous target
variable. Unlike the classification task with a discrete label
space, the regression task considers the continuous variable as
its target label [64], and thus it learns the mapping function
Loss ℒ(𝑓(𝑥; Θ, 𝒲), ෤
𝑦)
True Label y ∈ 𝒴 Input 𝑥 ∈ 𝒳
Base Model 𝜃 with Softmax Layer
Noise Adaptation Layer 𝒑(෥
𝒚|𝒚; 𝓦)
𝑝(𝑦|𝑥; Θ)
𝑝(෤
𝑦|𝑥; Θ, 𝒲)
Noisy Label ෤
y ∈ ෨
𝒴
Label Corruption
Fig. 4. Noise modeling process using the noise adaptation layer.
Overall, we categorize all recent deep learning methods into
five groups corresponding to popular research directions, as
shown in Figure 3. In §III-D, meta learning is also discussed
because it finds the optimal hyperparameters for loss reweight-
ing. In §III-E, we discuss the recent efforts for combining
sample selection with other orthogonal directions or semi-
supervised learning toward the state-of-the-art performance.
Figure 2 illustrates the categorization of robust training
methods using these five groups.
A. Robust Architecture

Robust Architecture
n専用アーキテクチャ
• ラベル遷移確率の推定の信頼性の向上
• 2つのネットワークを学習
• ラベル遷移確率の予測
• ノイズの種類の予測
Noise Free 41%
Random 3%
Confusing 56%
p(z | x)
5 Layers of
Conv +
Pool + Norm
3 FC Layers
of Size
4096 4096 14
5 Layers of
Conv +
Pool + Norm
3 FC Layers
of Size
4096 1024 3
Label Noise
Model Layer
Down Coat
Windbreaker 4%
Jacket 1%
……
94%
p(y | x)
Noisy Label:
Windbreaker
Down Coat
Windbreaker 11%
Jacket 4%
……
75%
p(y | y
!,x)
Noise Free 11%
Random 4%
Confusing 85%
p(z | y
!,x)
Data with
Clean Labels
Figure 5. System diagram of our method. Two CNNs are used to predict the class label p(y|x) and the noise type p(z|x), respectively. The
[Xiao+, CVPR2015]

Robust Regularization
n明示的な正則化
• 学習損失を修正する
2 S. Jenni and P. Favaro
validation mini-batches
data set
training mini-batches
i2T t
!iì(✓)
j2Vt
`j(✓) ✓ = ✓t
✏
i2T t
!irì(✓t
)
stochastic gradient descent
with mini-batch adaptive weights
!i
j2Vt
r`j(✓t
)>
rì(✓t
)
|rì(✓t)|2 + µ̂
mini-batch weights
r`j(✓t
)
rì(✓t
)
if gradients agree
the mini-batch weights
are positive and large
Fig. 1. The training procedure of our bilevel formulation. At each iteration we sample
[Jenni&Favaro, ECCV2018]

n暗黙的な正則化
• 確率的な効果をもたらす正則化
• データ拡張
• ミニバッチ確率的勾配降下法
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
with a uniform mixture over all possible labels,
ȳ =
⌦
ȳ(1), ȳ(2), . . . , ȳ(c)
↵
,
where ȳ(i) = (1 ↵) · [ỹ = i] + ↵/c and ↵ 2 [0, 1].
(5)
Here, [·] is the Iverson bracket and ↵ is the smoothing degree.
In contrast, mixup [95] regularizes the DNN to favor simple
linear behaviors in between training examples. First, the mini-
batch is constructed using virtual training examples, each of
which is formed by the linear interpolation of two noisy
training examples (xi, ỹi) and (xj, ỹj) obtained at random
from noisy training data D̃,
xmix = xi + (1 )xj and ymix = ỹi + (1 )ỹj, (6)
where 2 [0, 1] is the balance parameter between two
examples. Thus, mixup extends the training distribution by
where C is a cons
classifier trained on
probability as tha
the specified assu
classification was pr
RD(f⇤
) = 0, then t
asymmetric noise, w
For the classific
tropy (CCE) loss is
to its fast converg
However, in the pr
[68] showed that the
better generalization
loss satisfies the af
the MAE loss is tha

Robust Loss Function
n未知のクリーンなデータに対してもロバストな損失を設計
• ベイズリスク最小化がノイズ耐性となるような損失関数を証明
[Manwani&Sastry, IEEE Transactions on Cybernetics 2013]
n以下が成り立つときにノイズ耐性があると定義
in the training data [68], [97]–[101].
Technical Detail: Initially, Manwani and Sastry [48] theo-
retically proved a sufficient condition for the loss function
such that risk minimization with that function becomes noise-
tolerant for binary classification. Subsequently, the sufficient
condition was extended for multi-class classification using
deep learning [68]. Specifically, a loss function is defined to
be noise-tolerant for a c-class classification under symmetric
noise if the function satisfies the noise rate ⌧ < c 1
c and
c
X
j=1
` f(x; ⇥), y = j = C, 8x 2 X, 8f, (8)
correction that es
the forward or bac
different importan
scheme, 3) label
refurbished label
and predicted lab
infers the optima
loss function new
methods aims to
robust to label n
update rule is adj
noise is minimize

Loss Adjustment
nパラメータ更新の前にすべての学習性の損失を調整
1. ノイズ遷移行列を推定して損失補正
2. 各例に異なる重要度を付与する損失再重み付け
3. ノイズラベルと予測ラベルから生成したラベルを用いて損失を調整
4. 損失調整の最適ルールを自動的に推測するメタ学習
n 利点
• 学習データ全てに対して損失の調整が適用できる
n 欠点
• クラス数やノイズラベルが多い時，誤補正による誤差が蓄積される [Han+,
NeurIPS2018]

Loss Adjustment
n損失再重み付け
nラベル更新
• αは信頼度
RKS AND LEARNING SYSTEMS 7
ows for a full exploration
he loss of every example.
correction is accumulated,
classes or the number of
.
the noise adaptation layer
pproach modifies the loss
e estimated label transition
pecified DNN. The main
he transition probability is
tion [62] initially approx-
using the softmax output
orrection. Subsequently, it
the original loss based on
d loss of a example (x, ỹ)
tion of its loss values for
nt is the inverse transition
weights to those with true labels. Accordingly, the reweighted
loss on the mini-batch Bt is used to update the DNN,
⇥t+1 = ⇥t ⌘r
⇣ 1
|Bt|
X
(x,ỹ)2Bt
Reweighted Loss
z }| {
w(x, ỹ)` f(x; ⇥t), ỹ
⌘
, (11)
where w(x, ỹ) is the weight of an example x with its noisy
label ỹ. Hence, the examples with smaller weights do not
significantly affect the DNN learning.
Technical Detail: In importance reweighting [108], the ratio
of two joint data distributions w(x, ỹ) = PD(x, ỹ)/PD̃(x, ỹ)
determines the contribution of the loss of each noisy example.
An approximate solution to estimate the ratio was developed
because the two distributions are difficult to determine from
noisy data. Meanwhile, active bias [109] emphasizes uncertain
examples with inconsistent label predictions by assigning their
prediction variances as the weights for training. DualGraph
[118] employs graph neural networks and reweights the ex-
c
E>
, (9)
near com-
plying the
performed
y with the
ep,
ỹ
⌘
(10)
the avail-
s for loss
rix is ob-
ich further
the weighting function as well as there additional hyper-
parameters, which is fairly hard to be applied in practice due
to the significant variation of appropriate weighting schemes
that rely on the noise type and training data.
3) Label Refurbishment: Refurbishing a noisy label ỹ
effectively prevents overfitting to false labels. Let ŷ be the
current prediction of the DNN f(x; ⇥). Therefore, the refur-
bished label yrefurb
can be obtained by a convex combination
of the noisy label ỹ and the DNN prediction ŷ,
yrefurb
= ↵ỹ + (1 ↵)ŷ, (12)
where ↵ 2 [0, 1] is the label confidence of ỹ. To mitigate the
damage of incorrect labeling, this approach backpropagates the
loss for the refurbished label instead of the noisy one, thereby
yielding substantial robustness to noisy labels.
Technical Detail: Bootstrapping [69] is the first method that
sition matrix is ob-
mation, which further
rrection. Recently, T-
an infer the transition
al T [114] factorizes
-to-estimate matrices
class posterior. Be-
umption, Zhang et al.
embedding to model
the transition matrix.
osed to use the Bayes
the distilled examples
on matrix.
ches is highly depen-
atrix is estimated. To
quire prior knowledge
n validation data.
the concept of im-
yielding substantial robustness to noisy labels.
Technical Detail: Bootstrapping [69] is the first method that
proposes the concept of label refurbishment to update the
target label of training examples. It develops a more coherent
network that improves its ability to evaluate the consistency
of noisy labels, with the label confidence ↵ obtained via
cross-validation. Dynamic bootstrapping [110] dynamically
adjusts the confidence ↵ of individual training examples. The
confidence ↵ is obtained by fitting a two-component and one-
dimensional beta mixture model to the loss distribution of
all training examples. Self-adaptive training [119] applies the
exponential moving average to alleviate the instability issue of
using instantaneous prediction of the current DNN,
yrefurb
t+1 = ↵yrefurb
t + (1 ↵)ŷ, where yrefurb
0 = ỹ (13)
D2L [111] trains a DNN using a dimensionality-driven
learning strategy to avoid overfitting to false labels. A simple
measure called local intrinsic dimensionality [120] is adopted
to evaluate the confidence ↵ in considering that the overfitting

Sample Selection
n真のラベルデータのみを選択する
• 擬似的にクリーンなデータセットを生成できるので汎用性が向上 [Song+,
ICML2019]

比較
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12
TABLE III
COMPARISON OF ROBUST DEEP LEARNING CATEGORIES FOR OVERCOMING NOISY LABELS.
Category
P1 P2 P3 P4 P5 P6
Flexibility No Pre-train Full Exploration No Supervision Heavy Noise Complex Noise
Robust Architecture
(§III-A)
Noise Adaptation Layer 4 5 5
Dedicated Architecture 5 4 4 4
(§III-B)
Implicit Regularization 4 4
Explicit Regularization 5 4
Robust Loss Function (§III-C) 5 5
Loss Adjustment
(§III-D)
Loss Correction 5 5 5 5
Loss Reweighting 5 4
Label Refurbishment 4 4
Meta Learning 5 4 4
Sample Selection
(§III-E)
Multi-Network Learning 5 5 4
Multi-Round Learning 5 4
Hybrid Approach 4
the method marked with “5” only deals with the instance-
independent noise, while the method marked with “ ” deals
with both instance-independent and -dependent noises. The
remaining properties (i.e., P2, P3, and P4) are only assigned
k 6= i. Thus, let Ai be the set of anchor points with label i, then
the element of the noise transition matrix Tij is estimated by
T̂ij =
1
|A |
X c
X
p(ỹ = j|y = k)p(y = k|x)

まとめ
nノイズの多いラベルから学習するための手法のサーベイ
• ロバストなアーキテクチャ
• ロバストな正則化
• ロバストな損失関数の設計
• サンプル選択
n汎用的なフレームワークは存在しない
• 問題に応じて適切に設計する必要がある

文献紹介：Learning From Noisy Labels With Deep Neural Networks: A Survey

More Related Content

What's hot (20)

Similar to 文献紹介：Learning From Noisy Labels With Deep Neural Networks: A Survey (20)

More from Toru Tamaki (20)

Recently uploaded (20)

文献紹介：Learning From Noisy Labels With Deep Neural Networks: A Survey