SlideShare a Scribd company logo
5
Most read
11
Most read
13
Most read
Learning With Neighbor
Consistency for Noisy Labels
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, CVPR2022
橋口凌大(名工大)
2023/6/30
概要
nノイズの多いラベルからの学習方法の提案
nNeighbor Consistency Regularization (NCR)を提案
• 類似の特徴表現を持つデータは類似の予測をするように促す
nベースラインよりも高い精度を達成する
nノイズの多いデータからの学習手法 [Song+, IEEE TNNLS 2022]
• ロバストなアーキテクチャ
• ロバストな正則化
• ロバストな損失関数の設計
• サンプル選択
従来研究
n予測値を用いた修正 [Han+, NeurIPS2018][Li+, ICLR2020]
nラベル伝播アルゴリズム [Iscen+, CVPR2019]
Mini-batch 1 A
A
A
A
A
A
B
B
B
A
A
A
B
B
B
!=
!=
M-Net Decoupling Co-teaching
Mini-batch 2
Mini-batch 3
gure 1: Comparison of error flow among MentorNet (M-Net) [17], Decoupling [26] and Co-
aching. Assume that the error flow comes from the biased selection of training instances, and error
ow from network A or B is denoted by red arrows or blue arrows, respectively. Left panel: M-Net
aintains only one network (A). Middle panel: Decoupling maintains two networks (A & B). The
rameters of two networks are updated, when the predictions of them disagree (!=). Right panel:
o-teaching maintains two networks (A & B) simultaneously. In each mini-batch data, each network
mples its small-loss instances as the useful knowledge, and teaches such useful instances to its peer
twork for the further training. Thus, the error flow in Co-teaching displays the zigzag shape.
multaneously, and then updates models only using the instances that have different predictions from
ese two networks. Nonetheless, noisy labels are evenly spread across the whole space of examples.
hus, the disagreement area includes a number of noisy labels, where the Decoupling approach cannot
ndle noisy labels explicitly. Although MentorNet and Decoupling are representative approaches in
is promising direction, there still exist the above discussed issues, which naturally motivates us to
mprove them in our research.
eanwhile, an interesting observation for deep models is that they can memorize easy instances
st, and gradually adapt to hard instances as training epochs become large [2]. When noisy labels
ist, deep learning models will eventually memorize these wrongly given labels [45], which leads to
e poor generalization performance. Besides, this phenomenon does not change with the choice of
aining optimizations (e.g., Adagrad [9] and Adam [18]) or network architectures (e.g., MLP [15],
exnet [20] and Inception [37]) [17, 45].
this paper, we propose a simple but effective learning paradigm called “Co-teaching”, which allows
to train deep networks robustly even with extremely noisy labels (e.g., 45% of noisy labels occur
the fine-grained classification with multiple classes [8]). Our idea stems from the Co-training
Published as a conference paper at ICLR 2020
Mini-batch 1
…
…
Mini-batch 2
!"
($)
, '"
($)
A
B
Epoch e
!"
(()
, '"
(()
)
MixMatch
MixMatch
MixMatch
MixMatch
A
B
Co-Divide
…
…
A
B
Epoch e-1
GMM
GMM
Figure 1: DivideMix trains two networks (A and B) simultaneously. At each epoch, a network models
its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an
unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide). At each
mini-batch, a network performs semi-supervised training using an improved MixMatch method. We perform
label co-refinement on the labeled samples and label co-guessing on the unlabeled samples.
2.2 SEMI-SUPERVISED LEARNING
SSL methods aim to improve the model’s performance by leveraging unlabeled data. Current
state-of-the-art SSL methods mostly involve adding an additional loss term on unlabeled data to
regularize training. The regularization falls into two classes: consistency regularization (Laine &
Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2019) enforces the model to produce consistent
predictions on augmented input data; entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013)
encourages the model to give high-confidence predictions on unlabeled data. Recently, Berthelot et al.
(2019) propose MixMatch, which unifies consistency regularization, entropy minimization, and the
MixUp (Zhang et al., 2018) regularization into one framework.
Feature extractor ✓
FC
+
softmax
Network f✓
Phase 1:
Train for T epochs with
Ls(XL, YL; ✓)
(labeled examples only)
Train for 1 epoch with
Lw(X, YL, ŶU ; ✓)
(all examples)
Extract descriptors V
Compute affinity A (9)
W A + A>
W D 1/2
W D 1/2
Use ✓
Solve (10)
Label propagation
Phase 2: Iterate T0 times
: labels : missing labels : pseudo-labels (size proportional to certainty !i)
モデル概観
n特徴抽出器gに画像を入れる
n識別器hでクラス識別
ng(x)の特徴空間上でk近傍のサンプルに近づくように強制(NCR)
hotdog
Neighbor
Consistency
Regularization
Cross
Entropy
Backbone Classifier Loss Function
Neighborhood in feature
space
Minibatch
NCR
nラベルノイズの記憶[Liu+, NeurIPS2020]を防ぐためにNCRを定義
• K近傍のデータの分布に類似度で重み付け
• 自分の分布と足し合わせた分布を近づけるように学習
n類似度の計算
nネットワークの学習
al is
o in-
ssify
t ex-
time
tice.
fea-
atrix
fore-
[15]
opti-
tion.
s the
aga-
g the
Sec-
hich
ss in
cting
the classifier hW . More specifically, hW (vi) and hW (vj)
should behave similarly if si,j is high, regardless of their la-
bels yi and yj. This would prevent the network from over-
fitting to an incorrect mapping between an example xi and
a label yi, if either (or both) yi and yj are noisy.
To enforce NCR, we design an objective function which
minimizes the distance between logits zi and zj, if the cor-
responding feature representations vi and vj are similar:
LNCR(X, Y ; ✓, W) :=
1
m
m
X
i=1
DKL
✓
(zi/T)
X
j2NNk(vi)
si,j
P
k si,k
· (zj/T)
◆
,
(3)
where DKL is the KL-divergence loss to measure the dif-
ference between two distributions, T is the temperature and
NNk(vi) denotes the set of k nearest neighbors of i in the
feature space. We set T = 2 throughout our experiments.
We normalize the similarity values so that the second term
of the KL-divergence loss remains a probability distribu-
hotdog
Neighbor
Consistency
Regularization
Cross
Entropy
Backbone Classifier Loss Function
Neighborhood in feature
space
Minibatch
Figure 1. To address the problem of noisy labels in the training set, we propose Neighbor Consisteny Regularizatio
encourages examples with similar feature representations to have similar outputs, thus mitigating the impact of train
incorrect labels.
match [3] proposed variants of mixup for semi-supervised
learning in which predictions replace the labels for unsu-
pervised examples. Xie et al. [44] introduced Unsupervised
Data Augmentation for semi-supervised image classifica-
tion, where a model is encouraged to be robust to label-
preserving transformations even when labels are not avail-
able by minimizing the divergence between predictions for
transformed and non-transformed images. Most relevant to
our work, [10] used prediction consistency with respect to
image transformations for the express purpose of learning
with noisy labels. While these forms of consistency are ef-
fective regularizers, neighbor consistency offers the ability
to transfer supervision directly to mislabelled examples.
3. Preliminaries
respond to the feature extractor and class
The feature extractor maps an image xi to
vector vi := g✓(xi) 2 Rd
. The class
dimensional vector to class scores zi :=
Typically, the network parameters are lea
ing a loss function for supervised classific
LS(X, Y ; ✓, W) :=
1
m
m
X
i=1
` ( (
where X and Y correspond to the set of
mini-batch, m = |X| = |Y | denotes the
batch, is the softmax function and `(q
<latexit sha1_base64="bU2ZsQABHGEIlRU836R53gOIS5A=">AAACtHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQv4p8fHlGSkliRqxOQmlmQkpVVX1MZnairYKsD4ZUA+V0wMVwa6wjI0hVUghfECygZ6BmCggMkwhDKUGaAgIF9gOUMMQwpDPkMyQylDLkMqQx5DCZCdw5DIUAyE0QyGDAYMBUCxWIZqoFgRkJUJlk9lqGXgAuotBapKBapIBIpmA8l0IC8aKpoH5IPMLAbrTgbakgPERUCdCgyqBlcNVhp8NjhhsNrgpcEfnGZVg80AuaUSSCdB9KYWxPN3SQR/J6grF0iXMGQgdOF1cwlDGoMF2K2ZQLcXgEVAvkiG6C+rmv452CpItVrNYJHBa6D7FxrcNDgM9EFe2ZfkpYGpQbMZQBFgiB7cmIwwIz1DMz2TQBNlBydoVHAwSDMoMWgAw9ucwYHBgyGAIRRo71aG2wxPGJ4ymTHFMCUzpUKUMjFC9QgzoACmPABWNKjA</latexit>
g✓(xi) = vi
h✓(vi) = zi
ng
(2)
entries
fication
s for all
bjective
which
ssigned
outputs
To overcome this issue, we propose Neighbor Consis-
tency Regularization (NCR). Our main assumption is that
the over-fitting occurs less dramatically before the classi-
fier hW . This is supported by MOIT [30], which shows
that feature representations are robust enough to discrimi-
nate between noisy and clean examples when training a net-
work. With that assumption, we can design a smoothness
constraint similar to label propagation (2) when training the
network. The overview of our method is shown in Figure 1.
Let us define the similarity between two examples by
the cosine similarity of their feature representations, i.e.
si,j = cos(vi, vj) = vT
i vj/(kvikkvjk). Note that the fea-
ture representations contain non-negative values when ob-
tained after a ReLU non-linearity, and therefore the cosine
similarity is bounded in the interval [0, 1]. Our goal is to
enforce neighbor consistency regularization by leveraging
-
e
-
h
n
g
-
-
.
-
o
where DKL is the KL-divergence loss to measure the dif-
ference between two distributions, T is the temperature and
NNk(vi) denotes the set of k nearest neighbors of i in the
feature space. We set T = 2 throughout our experiments.
We normalize the similarity values so that the second term
of the KL-divergence loss remains a probability distribu-
tion. We set the self-similarity si,i = 0 so that it does not
dominate the normalized similarity. Gradients will be back-
propagated to all inputs.
The objective (3) ensures that the output of xi will be
consistent with the output of its neighbors regardless of its
potentially noisy label yi. We combine it with the super-
vised classification loss function (1) to obtain the final ob-
jective to minimized during the training:
L(X, Y ; ✓, W) := (1 ↵) · LS(X, Y ; ✓, W)
+ ↵ · LNCR(X, Y ; ✓, W), (4)
実験設定
nデータセット
• CIFAR10, 100 [Krizhevsky, 2009]
• mini-ImageNet Blue, Red [Jiang+, PMLR2020]
• Mini-WebVision [Li+, ICLR2020]
• WebVision [Li+, arXiv2017]
• Clothing1M [Xiao+, CVPR2015]
nバックボーン
• ResNet-18, 50
Learning with Neighbor C
Supplement
Table 4. List of network hyperparameters used in our experiments.
CIFAR-{10, 100} mini-{Red, Blue} mini-Webvision Clothing1M
Opt. SGD
Momentum 0.9
Batch 256 128 256 128
LR 0.1 0.1 0.1 0.002
LR Sch. cosine decay with linear warmup
Warmup 5
Epochs 250 130 130 80
Weight Dec. 5e 4 5e 4 1e 3 1e 3
Arch. ResNet-18 ResNet-50
ハイパーパラメータの影響
nNCR項αの影響
• αが高いと有効
n近傍数kの影響
• K=10で高精度
n初期エポックeの影響
• e=0でより良い性能を示す
0.2 0.4 0.6 0.8
60
70
80
90
↵
Accuracy
(%)
1 10 100
70
80
90
k
0 50 100 150
70
80
90
e
0% Noise 20% Noise 40% Noise
Figure 2. Ablation study. Impact of hyperparameters ↵, k and e are evaluated on the CIFAR-10 validation set with ResNet-18.
esent our method Neighbor Consistency Reg-
d compare it to classical label propagation.
ight its relationship to similar, online tech-
r Consistency Regularization
ing with noisy labels, the network is prone to
morize, the mapping from xi to a noisy label
ing data [26]. This behavior typically results
al classification performance in a clean eval-
he network does not generalize well.
consistent with the output of its neighbors regardless of its
potentially noisy label yi. We combine it with the super-
vised classification loss function (1) to obtain the final ob-
jective to minimized during the training:
L(X, Y ; ✓, W) := (1 ↵) · LS(X, Y ; ✓, W)
+ ↵ · LNCR(X, Y ; ✓, W), (4)
where the hyper-parameter ↵ 2 [0, 1] controls the impact
of the each loss term. Similar to label propagation, the fi-
nal loss objective (4) has two terms. The first term is the
classification loss term LS. This is analogous to the fitting
4675
of nearby points in the graph to be similar.
One of the main limitations of label propagation is its
transductive property. In transductive learning, the goal is
to classify seen unlabeled examples. This is different to in-
ductive learning, which learns a generic classifier to classify
any unseen data. To apply label propagation on new test ex-
amples, a new graph W needs to be constructed each time
a test example is seen. This makes it inefficient in practice.
Another requirement for label propagation is that the fea-
ture space needs to be fixed to compute the affinity matrix
W. This requires the feature extractor to be learned before-
hand, potentially from the noisy data. Existing work [15]
has tried to overcome this issue by alternating between opti-
mizing the feature space, and performing label propagation.
However, this does not directly enforce smoothness, as the
optimization of two components are done separately.
Our goal is to overcome the limitations of label propaga-
tion by 1) adapting it to an inductive setting 2) applying the
smoothness constraint directly during optimization. In Sec-
tion 4, we propose a simple and efficient approach which
generalizes label propagation by enforcing smoothness in
the form of a regularizer. As a result, we avoid constructing
an explicit graph to propagate the information, and infer-
ence can be performed on any unseen test example.
4. Method
We now present our method Neighbor Consistency Reg-
enforce neighbor consistency regularization by leveraging
the structure of the feature space produced by g✓ to enhance
the classifier hW . More specifically, hW (vi) and hW (vj)
should behave similarly if si,j is high, regardless of their la-
bels yi and yj. This would prevent the network from over-
fitting to an incorrect mapping between an example xi and
a label yi, if either (or both) yi and yj are noisy.
To enforce NCR, we design an objective function which
minimizes the distance between logits zi and zj, if the cor-
responding feature representations vi and vj are similar:
LNCR(X, Y ; ✓, W) :=
1
m
m
X
i=1
DKL
✓
(zi/T)
X
j2NNk(vi)
si,j
P
k si,k
· (zj/T)
◆
,
(3)
where DKL is the KL-divergence loss to measure the dif-
ference between two distributions, T is the temperature and
NNk(vi) denotes the set of k nearest neighbors of i in the
feature space. We set T = 2 throughout our experiments.
We normalize the similarity values so that the second term
of the KL-divergence loss remains a probability distribu-
tion. We set the self-similarity si,i = 0 so that it does not
dominate the normalized similarity. Gradients will be back-
propagated to all inputs.
The objective (3) ensures that the output of xi will be
consistent with the output of its neighbors regardless of its
バックボーン: ResNet-18
データセット: CIFAR-10
比較
n5回の実験の平均を報告
n最大17.6%の性能向上
nノイズ0%でも性能向上
• NCRは正則化効果を持つ
Table 1. Baseline and oracle comparison. Classification accuracy is reported on the mini-ImageNet-{Blue, Red}
ResNet-18 architecture for each individual noise ratio (0%, 20%, 40%, 80%). We present the mean accuracy and standa
five trials. The oracle model is trained on only the known, clean examples in the training set using a cross-entropy loss.
mini-ImageNet-Blue mini-ImageNet-Red
Method 0% 20% 40% 80% 0% 20% 40% 80%
BASELINES
Standard 65.8±0.4 49.5±0.4 36.6±0.5 13.1±1.0 63.5±0.5 55.3±0.9 49.5±0.7 36.4±0.4
Mixup 67.4±0.4 60.1±0.2 51.6±0.8 21.0±0.5 65.5±0.5 61.6±0.5 57.2±0.6 43.7±0.3
Bootstrap 66.4±0.4 54.4±0.5 44.8±0.5 2.9±0.3 64.3±0.3 56.2±0.2 51.3±0.6 38.2±0.3
Bootstrap + Mixup 67.5±0.3 61.9±0.4 51.0±0.7 1.3±0.1 65.9±0.4 62.7±0.2 58.3±0.5 43.5±0.6
Label smoothing 67.5±0.8 60.2±0.5 50.2±0.4 20.9±0.8 65.7±0.5 59.7±0.4 54.0±0.6 39.7±0.5
Label smoothing + Mixup 68.6±0.3 63.3±1.0 57.1±0.2 14.4±0.3 66.9±0.2 63.4±0.4 59.2±0.4 45.5±0.7
OURS
Ours: NCR 66.5±0.2 61.7±0.3 54.1±0.4 20.7±0.5 64.0±0.4 60.9±0.3 56.1±0.7 40.9±0.2
Ours: NCR + Mixup 67.9±0.6 64.3±0.1 59.2±0.6 14.2±0.4 66.3±0.5 64.6±0.6 60.4±0.3 45.4±0.4
OTHER WORKS
D-Mix [22] – – – – 55.8 50.3 50.9 35.4
ELR [26] – – – – 57.4 58.1 50.6 41.7
MOIT [30] – – – – 64.7 63.1 60.8 45.9
ORACLE: CLEAN SUBSET
Standard 65.8±0.4 63.9±0.5 60.6±0.4 45.4±0.8 63.5±0.5 61.7±0.1 58.4±0.3 41.5±0.5
Mixup 67.4±0.4 64.2±0.5 61.5±0.3 46.9±0.8 65.5±0.5 63.1±0.6 59.7±0.7 43.6±0.4
Blue-40% Blue-80% Red-40% Red-80%
比較
nConfidence scoreとデータの分布
• ベースライン(上段)ではノイズが過剰適合
• Clean, Noise どちらともp = 1
• NCR(下段)では過剰適合を回避
• Noiseはp=0付近が多い
Label smoothing + Mixup 68.6±0.3 63.3±1.0 57.1±0.2 14.4±0.3 66.9±0.2 63.4±0.4 59.2±0.4 45.5±0.7
OURS
Ours: NCR 66.5±0.2 61.7±0.3 54.1±0.4 20.7±0.5 64.0±0.4 60.9±0.3 56.1±0.7 40.9±0.2
Ours: NCR + Mixup 67.9±0.6 64.3±0.1 59.2±0.6 14.2±0.4 66.3±0.5 64.6±0.6 60.4±0.3 45.4±0.4
OTHER WORKS
D-Mix [22] – – – – 55.8 50.3 50.9 35.4
ELR [26] – – – – 57.4 58.1 50.6 41.7
MOIT [30] – – – – 64.7 63.1 60.8 45.9
ORACLE: CLEAN SUBSET
Standard 65.8±0.4 63.9±0.5 60.6±0.4 45.4±0.8 63.5±0.5 61.7±0.1 58.4±0.3 41.5±0.5
Mixup 67.4±0.4 64.2±0.5 61.5±0.3 46.9±0.8 65.5±0.5 63.1±0.6 59.7±0.7 43.6±0.4
Blue-40% Blue-80% Red-40% Red-80%
Standard
NCR
特徴空間での類似度分布
n正しいラベルと間違えたラベルのコサイン類似度分布
nBlue-40%では正しく分離された
nRedでは正しいラベルの分離はできたが誤ったラベルでは分離失敗
• Redは視覚的に似ているクラス間で誤ってラベル付けされるので特徴空間上で
は同じような振る舞いをし,類似度が高くなり分離失敗したと考えられる
Blue-40% Blue-80% Red-40% Red-80%
Standard
NCR
比較
MLNT; 3 iter. [23] – – 73.5
CleanNet [21] – – 74.7
LDMI [40] – – 72.5
LongReMix [6] – – 73.0
ELR [26] 76.3†
– –
ELR+ [26] 77.8†
– 74.8
DMix [22] 76.3 – 74.8
GJS [10] 79.3 – –
MoPro [24] – 73.9 –
MILe [32] – 75.2 –
Heteroscedastic [5] – 76.6†
–
CurrNet [12] – 79.3†
–
Table 3. State-of-the-art comparison with synthetic noise on
CIFAR. A-40% refers to 40% asymmetric noise. All of the other
columns refer to symmetric noise.
CIFAR-10 CIFAR-100
20% 40% 50% 80% 90% A-40% 20% 40% 50% 80% 90% A-40%
Standard 83.9 68.3 58.5 25.9 17.3 77.3 61.5 46.2 37.4 10.4 4.1 43.9
MOIT+ [30] 94.1 92.0 - 75.8 - 93.2 75.9 67.4 - 51.4 - 74.0
D-Mix [22] 95.1 94.2 93.6 91.4 74.5 91.8 76.7 74.6 73.1 57.1 29.7 72.1
ELR+ [26] 94.9 94.4 93.9 90.9 74.5 88.9 76.3 74.0 72.0 57.2 30.9 75.8
Ours+ [26] 95.2 94.5 94.3 91.6 75.1 90.7 76.6 74.2 72.5 58.0 30.8 76.3
eters across all noise ratios, unlike Divide-Mix [22], which
is a more realistic scenario in practice.
dient
fact th
thetic
noise
Limit
our pr
featur
this lim
epoch
remov
Pro
pling
examp
in the
applie
ing, as
Broad
ing fr
scrapi
mini-I
in this
is able
verten
when
web, i
to whi
fringin
Figure 4. Similarity distributions. We compare the distribution of cosine similarities for training examples in mini-ImageNet that are
correctly and incorrectly labelled as the same class or different classes. For mini-ImageNet-Blue, the features learned using NCR achieve
significantly better class separation with 40% noise (or less, not pictured). For the more realistic mini-ImageNet-Red, NCR still achieves
better separation of the clean examples but fails to separate examples that are incorrectly labelled as the same class.
Table 2. State-of-the-art comparison with realistic noise. We
compare NCR and our baselines to other methods on mini-
WebVision, WebVision and Clothing1M. All results use ResNet-
50, except for those marked by †
which use Inception-ResNetV2.
mini-WebVision WebVision Clothing1M
Standard 75.8 74.9 71.7
Mixup 77.2 75.5 72.2
Ours: NCR 77.1 75.7 74.4
Ours: NCR+Mixup 79.4 75.4 74.5
Ours: NCR+Mixup+DA 80.5 76.8 74.6
MLNT; 3 iter. [23] – – 73.5
CleanNet [21] – – 74.7
LDMI [40] – – 72.5
LongReMix [6] – – 73.0
ELR [26] 76.3†
– –
ELR+ [26] 77.8†
– 74.8
DMix [22] 76.3 – 74.8
GJS [10] 79.3 – –
MoPro [24] – 73.9 –
MILe [32] – 75.2 –
Heteroscedastic [5] – 76.6†
–
CurrNet [12] – 79.3†
–
6. Conclusion
This work introduced Neighborhood Consistency Reg-
ularization and demonstrated that it is an effective strat-
egy for deep learning with label noise. While our ap-
proach draws inspiration from multi-stage training proce-
dures for semi-supervised learning that employ transduc-
tive label propagation, it consists of a comparatively sim-
ple training procedure, requiring only that an extra loss be
added to the objective which is optimized in stochastic gra-
dient descent. The efficacy of NCR is emphasized by the
fact that it achieved state-of-the-art results under both syn-
thetic (CIFAR-10 and -100) and realistic (mini-WebVision)
noise scenarios.
Limitations and future work. A limitation of NCR is that
our proposed loss assumes that it has access to an adequate
feature representation of the training data. We overcome
this limitation in practice by first training the network for e
epochs before applying the NCR loss, but future work is to
remove this additional training hyperparameter.
Promising directions for future research including cou-
n現実的なデータセットにおいてトップ精度
n正則化項を加える同様な手法GJS [Engleson&Azizpour, NeurIPS2021]
よりも1.2%高精度
n既存手法に正則化項を追加するだけで拡張でき高精度
ELRにNCRを追加したモデルを既存手法と比較
まとめ
nラベルノイズを含むデータで学習するためのNCRを提案
nシンプルな学習手順からなる
nSynthetic, Realstic noiseどちらでも有効である
補足資料
nMini-Imagenet blue, redの違い
Supplementary Materials for Beyond Synthetic Noise: Deep Learning on
Controlled Noisy Labels
Lu Jiang Di Huang Mason Liu Weilong Yang
Ladybug
Orange
text search image search
Mini-ImageNet
true positive
Ladybug Ladybug Ladybug
Orange Orange Orange
synthetic symmetric labels
blue noise red noise
[Jiang+, PMLR2020]
補足資料
n使用したハイパーパラメータ
n1000クラスを含むwebvisionでのバッチサイズによる認識率への影響
• 視覚的に似ているクラスをバッチ内に含むために大きいバッチサイズが必要
comprises the (clean) examples from the 0% noise dataset
Table 5. List of NCR hyperparameters used in our experiments.
mini-ImageNet CIFAR-10 CIFAR-100 WebVision Clothing1M
0% 20% 40% 80%
↵ 0.7 0.7 0.7 0.5 0.1 0.1 0.5 0.9
k 50 1 1 1 10 10 10 1
e 100 50 50 0 50 200 0 40
ing pro
is pre-tr
C. Eff
We u
the batc
large-sc
We repo
or Consistency for Noisy Labels
mentary Material
nts.
g1M
Table 6. Effect of the batch size on our proposed NCR method, on
the WebVision dataset containing 1000 classes.
Batch Size 256 512 1024 2048
Accuracy 73.9 75.0 75.7 75.6

More Related Content

PPTX
ResNetの仕組み
PPTX
全体セミナー20170629
PDF
ELBO型VAEのダメなところ
PDF
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
PDF
敵対的生成ネットワーク(GAN)
PDF
ConvNetの歴史とResNet亜種、ベストプラクティス
PDF
PRML8章
PPTX
変分ベイズ法の説明
ResNetの仕組み
全体セミナー20170629
ELBO型VAEのダメなところ
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
敵対的生成ネットワーク(GAN)
ConvNetの歴史とResNet亜種、ベストプラクティス
PRML8章
変分ベイズ法の説明

What's hot (20)

PDF
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
PPTX
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
PPTX
[DL輪読会]GQNと関連研究,世界モデルとの関係について
PPTX
[DL輪読会]Attentive neural processes
PDF
バイナリニューラルネットとハードウェアの関係
PDF
論文紹介:Multimodal Learning with Transformers: A Survey
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
PDF
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
PDF
IIBMP2016 深層生成モデルによる表現学習
PDF
【論文読み会】Self-Attention Generative Adversarial Networks
PPTX
モデルアーキテクチャ観点からの高速化2019
PDF
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
PDF
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
PPTX
[DL輪読会]Flow-based Deep Generative Models
PDF
Cvpr 2021 manydepth
PPTX
モデル高速化百選
PPTX
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
PDF
Active Learning の基礎と最近の研究
PDF
最適輸送の解き方
PDF
【DL輪読会】Code as Policies: Language Model Programs for Embodied Control
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]Attentive neural processes
バイナリニューラルネットとハードウェアの関係
論文紹介:Multimodal Learning with Transformers: A Survey
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
IIBMP2016 深層生成モデルによる表現学習
【論文読み会】Self-Attention Generative Adversarial Networks
モデルアーキテクチャ観点からの高速化2019
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Flow-based Deep Generative Models
Cvpr 2021 manydepth
モデル高速化百選
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
Active Learning の基礎と最近の研究
最適輸送の解き方
【DL輪読会】Code as Policies: Language Model Programs for Embodied Control
Ad

Similar to 論文紹介:Learning With Neighbor Consistency for Noisy Labels (20)

PDF
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
PDF
PDF
lecture_2-classification and learning -ml-tutorial
PPTX
PRML 5.5
PDF
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
PDF
Exploring Simple Siamese Representation Learning
PDF
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
PDF
AIML2 DNN 3.5hr (111-1).pdf
PPTX
Batch normalization presentation
PDF
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
PDF
PR-305: Exploring Simple Siamese Representation Learning
PDF
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
PDF
chapter 11 HANDS ON MACHINE LEARNING SCIKIT
PPTX
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
PPTX
Learning a nonlinear embedding by preserving class neibourhood structure 최종
PDF
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
PDF
NIPS2017 Few-shot Learning and Graph Convolution
PDF
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
PDF
論文サーベイ(Sasaki)
PPT
Lecture2---Feed-Forward Neural Networks.ppt
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
lecture_2-classification and learning -ml-tutorial
PRML 5.5
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
Exploring Simple Siamese Representation Learning
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
AIML2 DNN 3.5hr (111-1).pdf
Batch normalization presentation
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
PR-305: Exploring Simple Siamese Representation Learning
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
chapter 11 HANDS ON MACHINE LEARNING SCIKIT
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
Learning a nonlinear embedding by preserving class neibourhood structure 최종
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
NIPS2017 Few-shot Learning and Graph Convolution
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文サーベイ(Sasaki)
Lecture2---Feed-Forward Neural Networks.ppt
Ad

More from Toru Tamaki (20)

PDF
論文紹介:Unboxed: Geometrically and Temporally Consistent Video Outpainting
PDF
論文紹介:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video​ Unde...
PDF
論文紹介:HOTR: End-to-End Human-Object Interaction Detection​ With Transformers, ...
PDF
論文紹介:Segment Anything, SAM2: Segment Anything in Images and Videos
PDF
論文紹介:Unbiasing through Textual Descriptions: Mitigating Representation Bias i...
PDF
論文紹介:AutoPrompt: Eliciting Knowledge from Language Models with Automatically ...
PDF
論文紹介:「Amodal Completion via Progressive Mixed Context Diffusion」「Amodal Insta...
PDF
論文紹介:「mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal La...
PDF
論文紹介:What, when, and where? ​Self-Supervised Spatio-Temporal Grounding​in Unt...
PDF
論文紹介:PitcherNet: Powering the Moneyball Evolution in Baseball Video Analytics
PDF
論文紹介:"Visual Genome:Connecting Language and Vision​Using Crowdsourced Dense I...
PDF
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
PDF
論文紹介:ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Stream...
PDF
論文紹介:Make Pixels Dance: High-Dynamic Video Generation
PDF
PCSJ-IMPS2024招待講演「動作認識と動画像符号化」2024年度画像符号化シンポジウム(PCSJ 2024) 2024年度映像メディア処理シンポジ...
PDF
論文紹介:T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise E...
PDF
論文紹介:On Feature Normalization and Data Augmentation
PDF
論文紹介:CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
PDF
論文紹介:MS-DETR: Efficient DETR Training with Mixed Supervision
PDF
論文紹介:Synergy of Sight and Semantics: Visual Intention Understanding with CLIP
論文紹介:Unboxed: Geometrically and Temporally Consistent Video Outpainting
論文紹介:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video​ Unde...
論文紹介:HOTR: End-to-End Human-Object Interaction Detection​ With Transformers, ...
論文紹介:Segment Anything, SAM2: Segment Anything in Images and Videos
論文紹介:Unbiasing through Textual Descriptions: Mitigating Representation Bias i...
論文紹介:AutoPrompt: Eliciting Knowledge from Language Models with Automatically ...
論文紹介:「Amodal Completion via Progressive Mixed Context Diffusion」「Amodal Insta...
論文紹介:「mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal La...
論文紹介:What, when, and where? ​Self-Supervised Spatio-Temporal Grounding​in Unt...
論文紹介:PitcherNet: Powering the Moneyball Evolution in Baseball Video Analytics
論文紹介:"Visual Genome:Connecting Language and Vision​Using Crowdsourced Dense I...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Stream...
論文紹介:Make Pixels Dance: High-Dynamic Video Generation
PCSJ-IMPS2024招待講演「動作認識と動画像符号化」2024年度画像符号化シンポジウム(PCSJ 2024) 2024年度映像メディア処理シンポジ...
論文紹介:T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise E...
論文紹介:On Feature Normalization and Data Augmentation
論文紹介:CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
論文紹介:MS-DETR: Efficient DETR Training with Mixed Supervision
論文紹介:Synergy of Sight and Semantics: Visual Intention Understanding with CLIP

Recently uploaded (20)

PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Chapter 5: Probability Theory and Statistics
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
project resource management chapter-09.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation theory and applications.pdf
1. Introduction to Computer Programming.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Web App vs Mobile App What Should You Build First.pdf
cloud_computing_Infrastucture_as_cloud_p
Hindi spoken digit analysis for native and non-native speakers
Chapter 5: Probability Theory and Statistics
1 - Historical Antecedents, Social Consideration.pdf
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
project resource management chapter-09.pdf
Programs and apps: productivity, graphics, security and other tools
WOOl fibre morphology and structure.pdf for textiles
TLE Review Electricity (Electricity).pptx
Encapsulation_ Review paper, used for researhc scholars
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Heart disease approach using modified random forest and particle swarm optimi...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf

論文紹介:Learning With Neighbor Consistency for Noisy Labels

  • 1. Learning With Neighbor Consistency for Noisy Labels Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, CVPR2022 橋口凌大(名工大) 2023/6/30
  • 2. 概要 nノイズの多いラベルからの学習方法の提案 nNeighbor Consistency Regularization (NCR)を提案 • 類似の特徴表現を持つデータは類似の予測をするように促す nベースラインよりも高い精度を達成する nノイズの多いデータからの学習手法 [Song+, IEEE TNNLS 2022] • ロバストなアーキテクチャ • ロバストな正則化 • ロバストな損失関数の設計 • サンプル選択
  • 3. 従来研究 n予測値を用いた修正 [Han+, NeurIPS2018][Li+, ICLR2020] nラベル伝播アルゴリズム [Iscen+, CVPR2019] Mini-batch 1 A A A A A A B B B A A A B B B != != M-Net Decoupling Co-teaching Mini-batch 2 Mini-batch 3 gure 1: Comparison of error flow among MentorNet (M-Net) [17], Decoupling [26] and Co- aching. Assume that the error flow comes from the biased selection of training instances, and error ow from network A or B is denoted by red arrows or blue arrows, respectively. Left panel: M-Net aintains only one network (A). Middle panel: Decoupling maintains two networks (A & B). The rameters of two networks are updated, when the predictions of them disagree (!=). Right panel: o-teaching maintains two networks (A & B) simultaneously. In each mini-batch data, each network mples its small-loss instances as the useful knowledge, and teaches such useful instances to its peer twork for the further training. Thus, the error flow in Co-teaching displays the zigzag shape. multaneously, and then updates models only using the instances that have different predictions from ese two networks. Nonetheless, noisy labels are evenly spread across the whole space of examples. hus, the disagreement area includes a number of noisy labels, where the Decoupling approach cannot ndle noisy labels explicitly. Although MentorNet and Decoupling are representative approaches in is promising direction, there still exist the above discussed issues, which naturally motivates us to mprove them in our research. eanwhile, an interesting observation for deep models is that they can memorize easy instances st, and gradually adapt to hard instances as training epochs become large [2]. When noisy labels ist, deep learning models will eventually memorize these wrongly given labels [45], which leads to e poor generalization performance. Besides, this phenomenon does not change with the choice of aining optimizations (e.g., Adagrad [9] and Adam [18]) or network architectures (e.g., MLP [15], exnet [20] and Inception [37]) [17, 45]. this paper, we propose a simple but effective learning paradigm called “Co-teaching”, which allows to train deep networks robustly even with extremely noisy labels (e.g., 45% of noisy labels occur the fine-grained classification with multiple classes [8]). Our idea stems from the Co-training Published as a conference paper at ICLR 2020 Mini-batch 1 … … Mini-batch 2 !" ($) , '" ($) A B Epoch e !" (() , '" (() ) MixMatch MixMatch MixMatch MixMatch A B Co-Divide … … A B Epoch e-1 GMM GMM Figure 1: DivideMix trains two networks (A and B) simultaneously. At each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide). At each mini-batch, a network performs semi-supervised training using an improved MixMatch method. We perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples. 2.2 SEMI-SUPERVISED LEARNING SSL methods aim to improve the model’s performance by leveraging unlabeled data. Current state-of-the-art SSL methods mostly involve adding an additional loss term on unlabeled data to regularize training. The regularization falls into two classes: consistency regularization (Laine & Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2019) enforces the model to produce consistent predictions on augmented input data; entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013) encourages the model to give high-confidence predictions on unlabeled data. Recently, Berthelot et al. (2019) propose MixMatch, which unifies consistency regularization, entropy minimization, and the MixUp (Zhang et al., 2018) regularization into one framework. Feature extractor ✓ FC + softmax Network f✓ Phase 1: Train for T epochs with Ls(XL, YL; ✓) (labeled examples only) Train for 1 epoch with Lw(X, YL, ŶU ; ✓) (all examples) Extract descriptors V Compute affinity A (9) W A + A> W D 1/2 W D 1/2 Use ✓ Solve (10) Label propagation Phase 2: Iterate T0 times : labels : missing labels : pseudo-labels (size proportional to certainty !i)
  • 5. NCR nラベルノイズの記憶[Liu+, NeurIPS2020]を防ぐためにNCRを定義 • K近傍のデータの分布に類似度で重み付け • 自分の分布と足し合わせた分布を近づけるように学習 n類似度の計算 nネットワークの学習 al is o in- ssify t ex- time tice. fea- atrix fore- [15] opti- tion. s the aga- g the Sec- hich ss in cting the classifier hW . More specifically, hW (vi) and hW (vj) should behave similarly if si,j is high, regardless of their la- bels yi and yj. This would prevent the network from over- fitting to an incorrect mapping between an example xi and a label yi, if either (or both) yi and yj are noisy. To enforce NCR, we design an objective function which minimizes the distance between logits zi and zj, if the cor- responding feature representations vi and vj are similar: LNCR(X, Y ; ✓, W) := 1 m m X i=1 DKL ✓ (zi/T) X j2NNk(vi) si,j P k si,k · (zj/T) ◆ , (3) where DKL is the KL-divergence loss to measure the dif- ference between two distributions, T is the temperature and NNk(vi) denotes the set of k nearest neighbors of i in the feature space. We set T = 2 throughout our experiments. We normalize the similarity values so that the second term of the KL-divergence loss remains a probability distribu- hotdog Neighbor Consistency Regularization Cross Entropy Backbone Classifier Loss Function Neighborhood in feature space Minibatch Figure 1. To address the problem of noisy labels in the training set, we propose Neighbor Consisteny Regularizatio encourages examples with similar feature representations to have similar outputs, thus mitigating the impact of train incorrect labels. match [3] proposed variants of mixup for semi-supervised learning in which predictions replace the labels for unsu- pervised examples. Xie et al. [44] introduced Unsupervised Data Augmentation for semi-supervised image classifica- tion, where a model is encouraged to be robust to label- preserving transformations even when labels are not avail- able by minimizing the divergence between predictions for transformed and non-transformed images. Most relevant to our work, [10] used prediction consistency with respect to image transformations for the express purpose of learning with noisy labels. While these forms of consistency are ef- fective regularizers, neighbor consistency offers the ability to transfer supervision directly to mislabelled examples. 3. Preliminaries respond to the feature extractor and class The feature extractor maps an image xi to vector vi := g✓(xi) 2 Rd . The class dimensional vector to class scores zi := Typically, the network parameters are lea ing a loss function for supervised classific LS(X, Y ; ✓, W) := 1 m m X i=1 ` ( ( where X and Y correspond to the set of mini-batch, m = |X| = |Y | denotes the batch, is the softmax function and `(q <latexit sha1_base64="bU2ZsQABHGEIlRU836R53gOIS5A=">AAACtHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQv4p8fHlGSkliRqxOQmlmQkpVVX1MZnairYKsD4ZUA+V0wMVwa6wjI0hVUghfECygZ6BmCggMkwhDKUGaAgIF9gOUMMQwpDPkMyQylDLkMqQx5DCZCdw5DIUAyE0QyGDAYMBUCxWIZqoFgRkJUJlk9lqGXgAuotBapKBapIBIpmA8l0IC8aKpoH5IPMLAbrTgbakgPERUCdCgyqBlcNVhp8NjhhsNrgpcEfnGZVg80AuaUSSCdB9KYWxPN3SQR/J6grF0iXMGQgdOF1cwlDGoMF2K2ZQLcXgEVAvkiG6C+rmv452CpItVrNYJHBa6D7FxrcNDgM9EFe2ZfkpYGpQbMZQBFgiB7cmIwwIz1DMz2TQBNlBydoVHAwSDMoMWgAw9ucwYHBgyGAIRRo71aG2wxPGJ4ymTHFMCUzpUKUMjFC9QgzoACmPABWNKjA</latexit> g✓(xi) = vi h✓(vi) = zi ng (2) entries fication s for all bjective which ssigned outputs To overcome this issue, we propose Neighbor Consis- tency Regularization (NCR). Our main assumption is that the over-fitting occurs less dramatically before the classi- fier hW . This is supported by MOIT [30], which shows that feature representations are robust enough to discrimi- nate between noisy and clean examples when training a net- work. With that assumption, we can design a smoothness constraint similar to label propagation (2) when training the network. The overview of our method is shown in Figure 1. Let us define the similarity between two examples by the cosine similarity of their feature representations, i.e. si,j = cos(vi, vj) = vT i vj/(kvikkvjk). Note that the fea- ture representations contain non-negative values when ob- tained after a ReLU non-linearity, and therefore the cosine similarity is bounded in the interval [0, 1]. Our goal is to enforce neighbor consistency regularization by leveraging - e - h n g - - . - o where DKL is the KL-divergence loss to measure the dif- ference between two distributions, T is the temperature and NNk(vi) denotes the set of k nearest neighbors of i in the feature space. We set T = 2 throughout our experiments. We normalize the similarity values so that the second term of the KL-divergence loss remains a probability distribu- tion. We set the self-similarity si,i = 0 so that it does not dominate the normalized similarity. Gradients will be back- propagated to all inputs. The objective (3) ensures that the output of xi will be consistent with the output of its neighbors regardless of its potentially noisy label yi. We combine it with the super- vised classification loss function (1) to obtain the final ob- jective to minimized during the training: L(X, Y ; ✓, W) := (1 ↵) · LS(X, Y ; ✓, W) + ↵ · LNCR(X, Y ; ✓, W), (4)
  • 6. 実験設定 nデータセット • CIFAR10, 100 [Krizhevsky, 2009] • mini-ImageNet Blue, Red [Jiang+, PMLR2020] • Mini-WebVision [Li+, ICLR2020] • WebVision [Li+, arXiv2017] • Clothing1M [Xiao+, CVPR2015] nバックボーン • ResNet-18, 50 Learning with Neighbor C Supplement Table 4. List of network hyperparameters used in our experiments. CIFAR-{10, 100} mini-{Red, Blue} mini-Webvision Clothing1M Opt. SGD Momentum 0.9 Batch 256 128 256 128 LR 0.1 0.1 0.1 0.002 LR Sch. cosine decay with linear warmup Warmup 5 Epochs 250 130 130 80 Weight Dec. 5e 4 5e 4 1e 3 1e 3 Arch. ResNet-18 ResNet-50
  • 7. ハイパーパラメータの影響 nNCR項αの影響 • αが高いと有効 n近傍数kの影響 • K=10で高精度 n初期エポックeの影響 • e=0でより良い性能を示す 0.2 0.4 0.6 0.8 60 70 80 90 ↵ Accuracy (%) 1 10 100 70 80 90 k 0 50 100 150 70 80 90 e 0% Noise 20% Noise 40% Noise Figure 2. Ablation study. Impact of hyperparameters ↵, k and e are evaluated on the CIFAR-10 validation set with ResNet-18. esent our method Neighbor Consistency Reg- d compare it to classical label propagation. ight its relationship to similar, online tech- r Consistency Regularization ing with noisy labels, the network is prone to morize, the mapping from xi to a noisy label ing data [26]. This behavior typically results al classification performance in a clean eval- he network does not generalize well. consistent with the output of its neighbors regardless of its potentially noisy label yi. We combine it with the super- vised classification loss function (1) to obtain the final ob- jective to minimized during the training: L(X, Y ; ✓, W) := (1 ↵) · LS(X, Y ; ✓, W) + ↵ · LNCR(X, Y ; ✓, W), (4) where the hyper-parameter ↵ 2 [0, 1] controls the impact of the each loss term. Similar to label propagation, the fi- nal loss objective (4) has two terms. The first term is the classification loss term LS. This is analogous to the fitting 4675 of nearby points in the graph to be similar. One of the main limitations of label propagation is its transductive property. In transductive learning, the goal is to classify seen unlabeled examples. This is different to in- ductive learning, which learns a generic classifier to classify any unseen data. To apply label propagation on new test ex- amples, a new graph W needs to be constructed each time a test example is seen. This makes it inefficient in practice. Another requirement for label propagation is that the fea- ture space needs to be fixed to compute the affinity matrix W. This requires the feature extractor to be learned before- hand, potentially from the noisy data. Existing work [15] has tried to overcome this issue by alternating between opti- mizing the feature space, and performing label propagation. However, this does not directly enforce smoothness, as the optimization of two components are done separately. Our goal is to overcome the limitations of label propaga- tion by 1) adapting it to an inductive setting 2) applying the smoothness constraint directly during optimization. In Sec- tion 4, we propose a simple and efficient approach which generalizes label propagation by enforcing smoothness in the form of a regularizer. As a result, we avoid constructing an explicit graph to propagate the information, and infer- ence can be performed on any unseen test example. 4. Method We now present our method Neighbor Consistency Reg- enforce neighbor consistency regularization by leveraging the structure of the feature space produced by g✓ to enhance the classifier hW . More specifically, hW (vi) and hW (vj) should behave similarly if si,j is high, regardless of their la- bels yi and yj. This would prevent the network from over- fitting to an incorrect mapping between an example xi and a label yi, if either (or both) yi and yj are noisy. To enforce NCR, we design an objective function which minimizes the distance between logits zi and zj, if the cor- responding feature representations vi and vj are similar: LNCR(X, Y ; ✓, W) := 1 m m X i=1 DKL ✓ (zi/T) X j2NNk(vi) si,j P k si,k · (zj/T) ◆ , (3) where DKL is the KL-divergence loss to measure the dif- ference between two distributions, T is the temperature and NNk(vi) denotes the set of k nearest neighbors of i in the feature space. We set T = 2 throughout our experiments. We normalize the similarity values so that the second term of the KL-divergence loss remains a probability distribu- tion. We set the self-similarity si,i = 0 so that it does not dominate the normalized similarity. Gradients will be back- propagated to all inputs. The objective (3) ensures that the output of xi will be consistent with the output of its neighbors regardless of its バックボーン: ResNet-18 データセット: CIFAR-10
  • 8. 比較 n5回の実験の平均を報告 n最大17.6%の性能向上 nノイズ0%でも性能向上 • NCRは正則化効果を持つ Table 1. Baseline and oracle comparison. Classification accuracy is reported on the mini-ImageNet-{Blue, Red} ResNet-18 architecture for each individual noise ratio (0%, 20%, 40%, 80%). We present the mean accuracy and standa five trials. The oracle model is trained on only the known, clean examples in the training set using a cross-entropy loss. mini-ImageNet-Blue mini-ImageNet-Red Method 0% 20% 40% 80% 0% 20% 40% 80% BASELINES Standard 65.8±0.4 49.5±0.4 36.6±0.5 13.1±1.0 63.5±0.5 55.3±0.9 49.5±0.7 36.4±0.4 Mixup 67.4±0.4 60.1±0.2 51.6±0.8 21.0±0.5 65.5±0.5 61.6±0.5 57.2±0.6 43.7±0.3 Bootstrap 66.4±0.4 54.4±0.5 44.8±0.5 2.9±0.3 64.3±0.3 56.2±0.2 51.3±0.6 38.2±0.3 Bootstrap + Mixup 67.5±0.3 61.9±0.4 51.0±0.7 1.3±0.1 65.9±0.4 62.7±0.2 58.3±0.5 43.5±0.6 Label smoothing 67.5±0.8 60.2±0.5 50.2±0.4 20.9±0.8 65.7±0.5 59.7±0.4 54.0±0.6 39.7±0.5 Label smoothing + Mixup 68.6±0.3 63.3±1.0 57.1±0.2 14.4±0.3 66.9±0.2 63.4±0.4 59.2±0.4 45.5±0.7 OURS Ours: NCR 66.5±0.2 61.7±0.3 54.1±0.4 20.7±0.5 64.0±0.4 60.9±0.3 56.1±0.7 40.9±0.2 Ours: NCR + Mixup 67.9±0.6 64.3±0.1 59.2±0.6 14.2±0.4 66.3±0.5 64.6±0.6 60.4±0.3 45.4±0.4 OTHER WORKS D-Mix [22] – – – – 55.8 50.3 50.9 35.4 ELR [26] – – – – 57.4 58.1 50.6 41.7 MOIT [30] – – – – 64.7 63.1 60.8 45.9 ORACLE: CLEAN SUBSET Standard 65.8±0.4 63.9±0.5 60.6±0.4 45.4±0.8 63.5±0.5 61.7±0.1 58.4±0.3 41.5±0.5 Mixup 67.4±0.4 64.2±0.5 61.5±0.3 46.9±0.8 65.5±0.5 63.1±0.6 59.7±0.7 43.6±0.4 Blue-40% Blue-80% Red-40% Red-80%
  • 9. 比較 nConfidence scoreとデータの分布 • ベースライン(上段)ではノイズが過剰適合 • Clean, Noise どちらともp = 1 • NCR(下段)では過剰適合を回避 • Noiseはp=0付近が多い Label smoothing + Mixup 68.6±0.3 63.3±1.0 57.1±0.2 14.4±0.3 66.9±0.2 63.4±0.4 59.2±0.4 45.5±0.7 OURS Ours: NCR 66.5±0.2 61.7±0.3 54.1±0.4 20.7±0.5 64.0±0.4 60.9±0.3 56.1±0.7 40.9±0.2 Ours: NCR + Mixup 67.9±0.6 64.3±0.1 59.2±0.6 14.2±0.4 66.3±0.5 64.6±0.6 60.4±0.3 45.4±0.4 OTHER WORKS D-Mix [22] – – – – 55.8 50.3 50.9 35.4 ELR [26] – – – – 57.4 58.1 50.6 41.7 MOIT [30] – – – – 64.7 63.1 60.8 45.9 ORACLE: CLEAN SUBSET Standard 65.8±0.4 63.9±0.5 60.6±0.4 45.4±0.8 63.5±0.5 61.7±0.1 58.4±0.3 41.5±0.5 Mixup 67.4±0.4 64.2±0.5 61.5±0.3 46.9±0.8 65.5±0.5 63.1±0.6 59.7±0.7 43.6±0.4 Blue-40% Blue-80% Red-40% Red-80% Standard NCR
  • 11. 比較 MLNT; 3 iter. [23] – – 73.5 CleanNet [21] – – 74.7 LDMI [40] – – 72.5 LongReMix [6] – – 73.0 ELR [26] 76.3† – – ELR+ [26] 77.8† – 74.8 DMix [22] 76.3 – 74.8 GJS [10] 79.3 – – MoPro [24] – 73.9 – MILe [32] – 75.2 – Heteroscedastic [5] – 76.6† – CurrNet [12] – 79.3† – Table 3. State-of-the-art comparison with synthetic noise on CIFAR. A-40% refers to 40% asymmetric noise. All of the other columns refer to symmetric noise. CIFAR-10 CIFAR-100 20% 40% 50% 80% 90% A-40% 20% 40% 50% 80% 90% A-40% Standard 83.9 68.3 58.5 25.9 17.3 77.3 61.5 46.2 37.4 10.4 4.1 43.9 MOIT+ [30] 94.1 92.0 - 75.8 - 93.2 75.9 67.4 - 51.4 - 74.0 D-Mix [22] 95.1 94.2 93.6 91.4 74.5 91.8 76.7 74.6 73.1 57.1 29.7 72.1 ELR+ [26] 94.9 94.4 93.9 90.9 74.5 88.9 76.3 74.0 72.0 57.2 30.9 75.8 Ours+ [26] 95.2 94.5 94.3 91.6 75.1 90.7 76.6 74.2 72.5 58.0 30.8 76.3 eters across all noise ratios, unlike Divide-Mix [22], which is a more realistic scenario in practice. dient fact th thetic noise Limit our pr featur this lim epoch remov Pro pling examp in the applie ing, as Broad ing fr scrapi mini-I in this is able verten when web, i to whi fringin Figure 4. Similarity distributions. We compare the distribution of cosine similarities for training examples in mini-ImageNet that are correctly and incorrectly labelled as the same class or different classes. For mini-ImageNet-Blue, the features learned using NCR achieve significantly better class separation with 40% noise (or less, not pictured). For the more realistic mini-ImageNet-Red, NCR still achieves better separation of the clean examples but fails to separate examples that are incorrectly labelled as the same class. Table 2. State-of-the-art comparison with realistic noise. We compare NCR and our baselines to other methods on mini- WebVision, WebVision and Clothing1M. All results use ResNet- 50, except for those marked by † which use Inception-ResNetV2. mini-WebVision WebVision Clothing1M Standard 75.8 74.9 71.7 Mixup 77.2 75.5 72.2 Ours: NCR 77.1 75.7 74.4 Ours: NCR+Mixup 79.4 75.4 74.5 Ours: NCR+Mixup+DA 80.5 76.8 74.6 MLNT; 3 iter. [23] – – 73.5 CleanNet [21] – – 74.7 LDMI [40] – – 72.5 LongReMix [6] – – 73.0 ELR [26] 76.3† – – ELR+ [26] 77.8† – 74.8 DMix [22] 76.3 – 74.8 GJS [10] 79.3 – – MoPro [24] – 73.9 – MILe [32] – 75.2 – Heteroscedastic [5] – 76.6† – CurrNet [12] – 79.3† – 6. Conclusion This work introduced Neighborhood Consistency Reg- ularization and demonstrated that it is an effective strat- egy for deep learning with label noise. While our ap- proach draws inspiration from multi-stage training proce- dures for semi-supervised learning that employ transduc- tive label propagation, it consists of a comparatively sim- ple training procedure, requiring only that an extra loss be added to the objective which is optimized in stochastic gra- dient descent. The efficacy of NCR is emphasized by the fact that it achieved state-of-the-art results under both syn- thetic (CIFAR-10 and -100) and realistic (mini-WebVision) noise scenarios. Limitations and future work. A limitation of NCR is that our proposed loss assumes that it has access to an adequate feature representation of the training data. We overcome this limitation in practice by first training the network for e epochs before applying the NCR loss, but future work is to remove this additional training hyperparameter. Promising directions for future research including cou- n現実的なデータセットにおいてトップ精度 n正則化項を加える同様な手法GJS [Engleson&Azizpour, NeurIPS2021] よりも1.2%高精度 n既存手法に正則化項を追加するだけで拡張でき高精度 ELRにNCRを追加したモデルを既存手法と比較
  • 13. 補足資料 nMini-Imagenet blue, redの違い Supplementary Materials for Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels Lu Jiang Di Huang Mason Liu Weilong Yang Ladybug Orange text search image search Mini-ImageNet true positive Ladybug Ladybug Ladybug Orange Orange Orange synthetic symmetric labels blue noise red noise [Jiang+, PMLR2020]
  • 14. 補足資料 n使用したハイパーパラメータ n1000クラスを含むwebvisionでのバッチサイズによる認識率への影響 • 視覚的に似ているクラスをバッチ内に含むために大きいバッチサイズが必要 comprises the (clean) examples from the 0% noise dataset Table 5. List of NCR hyperparameters used in our experiments. mini-ImageNet CIFAR-10 CIFAR-100 WebVision Clothing1M 0% 20% 40% 80% ↵ 0.7 0.7 0.7 0.5 0.1 0.1 0.5 0.9 k 50 1 1 1 10 10 10 1 e 100 50 50 0 50 200 0 40 ing pro is pre-tr C. Eff We u the batc large-sc We repo or Consistency for Noisy Labels mentary Material nts. g1M Table 6. Effect of the batch size on our proposed NCR method, on the WebVision dataset containing 1000 classes. Batch Size 256 512 1024 2048 Accuracy 73.9 75.0 75.7 75.6