Unsupervised Data Augmentation for Consistency Training

Contents
1. Semi-supervised learning
2. Unsupervised Data Augmentation (UDA)
3. Experiments
4. Appendix
2

Semi-supervised learning
3
• 목표
• Labeled data가 많지 않은 상황에서 unlabeled data의 도움을 받아 성능을 높이자!
https://steemit.com/kr/@jiwoopapa/realistic-evaluation-of-semi-supervised-learning-algorithms

4
20202017 2018 2019
Entropy
Minimization
(2005)
Consistency
Regularization
Generic
Regularization
Self-Training
VTA
MixUp MixMatch
Mean TeacherΠ-Model
ReMixMatch
FixMatch
UDA
Pseudo Labeling (2013)
Noise
Student

5
• Entropy Minimization
• 예측값(softmax)의 confidenc를 높이기 위해 사용
• 주로 softmax temperature를 이용
• Temperature를 1보다 적게 설정할수록 entropy가 작아짐
https://papers.nips.cc/paper/2740-semi-supervised-learning-by-entropy-minimization.pdf
http://dsba.korea.ac.kr/seminar/?mod=document&uid=68

6
• Consistency Regularization
1. 모델을 이용해 unlabeled data의 분포 예측
2. Unlabeled data에 noise 추가 (data augmentation)
3. 예측한 분포를 augmented data의 정답 label로 사용해 모델로 학습

7
• Π-Model (ICLR 2017)
https://arxiv.org/pdf/1610.02242.pdf

8
• Mean Teacher (NIPS 2017)

9
• Virtual Adversarial Training (TPAMI 2018)
https://medium.com/@kabbi159/semi-supervised-learning-%EC%A0%95%EB%A6%AC-a7ed58a8f023

Unsupervised Data Augmentation (UDA)
10
https://twitter.com/quocleix/status/1123103668318769152

11

12
• UDA
• Given an input 𝑥, compute the output distribution 𝑝 𝜃 𝑦|𝑥 and a noised version 𝑝 𝜃 𝑦|𝑥, 𝜖 by injecting a small
noise 𝜖. The noise can be applied to 𝑥 or hidden states.
• Minimize a divergence metric between the two distributions 𝒟 𝑝 𝜃 𝑦|𝑥 ||𝑝 𝜃 𝑦|𝑥, 𝜖 .
• 모델을 𝜖에 대해 덜 민감하게 만들고, input (or hidden) space의 변화에 대해 smoother하게 만듬
• Consistency loss를 최소화하는 것은 label information을 labeled examples에서 unlabeled ones로 점차 진행
min 𝜽 𝓙 𝜽 = 𝔼 𝒙,𝒚∗∈𝑳 − 𝐥𝐨𝐠 𝒑 𝜽 𝒚∗
|𝒙 + 𝝀𝔼 𝒙∈𝑼 𝔼ෝ𝒙~𝒒 ෝ𝒙|𝒙 𝓓KL 𝒑෩𝜽 𝒚|𝒙 ||𝒑 𝜽 𝒚|ෝ𝒙
• 𝜆 (= 1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss
• 𝑞 ෝ𝒙|𝒙 : a data augmentation transformation
• ෩𝜽 : a fixed copy of the current parameters 𝜃 indicating that the gradient is not propagated through ෩𝜽
supervised cross entropy unsupervised consistency training loss

13
• UDA
min 𝜽 𝓙 𝜽 = 𝔼 𝒙,𝒚∗∈𝑳 − 𝐥𝐨𝐠 𝒑 𝜽 𝒚∗
|𝒙 + 𝝀𝔼 𝒙∈𝑼 𝔼ෝ𝒙~𝒒 ෝ𝒙|𝒙 𝓓KL 𝒑෩𝜽 𝒚|𝒙 ||𝒑 𝜽 𝒚|ෝ𝒙
• 𝜆 (= 1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss
• 𝑞 ෝ𝒙|𝒙 : a data augmentation transformation
• ෩𝜽 : a fixed copy of the current parameters 𝜃 indicating that the gradient is not propagated through ෩𝜽
• Different batch size for the supervised data and the unsupervised data
• Supervised training (𝑝 𝜃 𝑦|𝑥 )과 prediction on unlabeled examples (𝑝෩𝜽 𝑦|𝑥 )의 discrepancy를 줄이기 위해, unlabeled examples에도 같
은 augmentation 적용
supervised cross entropy unsupervised consistency training loss

14
• Augmentation Strategies for Different Tasks
• RandAugment for Image Classification
• AutoAugment와 달리 search가 필요 없음
• Back-translation for Text Classification
• Word replacing with TF-IDF for Text Classification

15
• Training Signal Annealing for Low-data Regime
• Semi-supervised learning에서 unlabeled data와 labeled data의 양은 큰 차이가 있음
• 모델의 크기가 클수록 소량의 labeled data에 overfitting되고 unlabeled data에는 underfitting되기 쉬움
• Training Signal Annealing (TSA)
• 학습 초반에는 정답 레이블에 대한 confidence가 정해진 threshold보다 높은 labeled data를 사용 X

16
• Training Signal Annealing for Low-data Regime
• training step 𝑡에서 맞춘 category에 대한 모델의 predicted probability 𝑝 𝜃 𝑦∗
|𝑥 가 threshold 𝜂 𝑡보다 높으면 loss function에서 제거
• 𝐾 : the number of categories, (𝜂 𝑡 = 𝛼 𝑡 ∗ 1 −
1
𝑘
+
1
𝐾
)
• Log-schedule (𝛼 𝑡 = 1 − exp −
𝑡
𝑇
∗ 5 ) : 모델이 과적합될 것 같지 않을 때 (labeled example의 수가 많거나 모델이 effective regularization 사용)
• Linear-schedule (𝛼 𝑡 =
𝑡
𝑇
)
• Exp-schedule (𝛼 𝑡 = exp
𝑡
𝑇
− 1 ∗ 5 ) : 문제가 쉽거나 labeled example의 수가 제한적일 때

Experiments
• Dataset
• Language
• IMDb, Yelp-2, Yelp-5, Amazon-2, Amazon-5
• Language dataset을 활용한 실험은 발표자료에서 제외
• Vision
• CIFAR-10, SVHN
17https://www.cs.toronto.edu/~kriz/cifar.html
http://ufldl.stanford.edu/housenumbers/
CIFAR-10 SVHN (Street View House Number)

Experiments
• Correlation between Supervised and Semi-supervised Performances
• CIFAR-10으로 augmentation별 + supervised/semi-supervised 성능 비교
• Augmentation : Crop & Flip / Cutout / RandAugment
• Stronger data augmentation일수록 좋음
• Supervised에서 좋은 성능을 내면 semi-supervised에서도 좋은 성능을 보임 (Table 1)
• Transformation의 수가 늘어날수록 대체로 좋은 성능을 보임 (Figure 6)
18

Experiments
• Algorithm Comparison on Vision Semi-supervised Learning Benchmarks
• 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN)
• Vary the size of labeled data
• Wide-ResNet-28-2 + varied supervised data sizes
• Virtual Adversarial Training (VAT)와 MixMatch와 비교
• 모든 측면에서 UDA가 다른 알고리즘들보다 성능이 좋음
• Noise based augmentation한 VAT에서는 real image에서는 볼 수 없는 high-frequency artifact가 포함되므로 성능 저하
19

Experiments
• Vary the size of labeled data
20

Experiments
• Comparison with published results
21

Experiments
• Scalability Test on the ImageNet Dataset
• ImageNet + ResNet-50
1. 10% of the supervised data of ImageNet while using all other data as unlabeled data
2. All images in ImageNet as supervised data + filtering to 1.3M images from JFT → unlabeled data
• Unlabeled data in out-of-domain은 모으기 쉽지만 in-domain data의 분포와 너무 다르면 성능을 저하시킴
→ 먼저 labeled data로 학습하고 out-of-domain data 중 높은 confidence를 갖는 데이터를 골라 unlabeled data로 사용 (정답유무는 필요 없음)
• Supervised baseline과 비교했을 대 모든 측면에서 좋은 성능을 보임
• Labeled data의 scale을 조절할 수 있을 뿐만 아니라, out-of-domain unlabeled data도 사용할 수 있음 (S4L, CPC와 유사한 결과)
22

Experiments
• Ablation Studies for TSA
• CIFAR-10 : 4k labeled examples and 50k unlabeled examples
• CIFAR-10에 대해 linear-schedule가 가장 좋은 성능을 보임
23

Appendix
• Additional Training Techniques
• Sharpening Predictions
• Entropy minimization
• Augmented data의 예측 값이 낮은 entropy를 가지도록 (=예측이 더 sharp하도록) entropy objective term을 전체 objective에 추가
• Confidence-based masking
• 예측의 confidence가 낮은 unlabeled data를 학습에 이용 X
• Softmax temperature controlling
• Unlabeled data의 예측 값을 계산할 때 1 미만의 softmax temperature를 적용, augmented 데이터의 타겟이 더 sharp하도록 함
• Labeled 데이터가 매우 적은 경우 confidence-based masking, softmax temperature controlling이 유용,
labeled 데이터가 많은 경우 entropy minimization이 효과가 있었음
24
https://medium.com/platfarm/unsupervised-data-augmentation-for-consistency-training-5bcd52d3f01b

Unsupervised Data Augmentation for Consistency Training

More Related Content

What's hot (20)

Similar to Unsupervised Data Augmentation for Consistency Training (20)

More from Sungchul Kim (20)

Recently uploaded (20)

Unsupervised Data Augmentation for Consistency Training