SlideShare a Scribd company logo
김 성 철
Contents
1. Semi-supervised learning
2. Unsupervised Data Augmentation (UDA)
3. Experiments
4. Appendix
2
Semi-supervised learning
3
• 목표
• Labeled data가 많지 않은 상황에서 unlabeled data의 도움을 받아 성능을 높이자!
https://steemit.com/kr/@jiwoopapa/realistic-evaluation-of-semi-supervised-learning-algorithms
Semi-supervised learning
4
20202017 2018 2019
Entropy
Minimization
(2005)
Consistency
Regularization
Generic
Regularization
Self-Training
VTA
MixUp MixMatch
Mean TeacherΠ-Model
ReMixMatch
FixMatch
UDA
Pseudo Labeling (2013)
Noise
Student
Semi-supervised learning
5
• Entropy Minimization
• 예측값(softmax)의 confidenc를 높이기 위해 사용
• 주로 softmax temperature를 이용
• Temperature를 1보다 적게 설정할수록 entropy가 작아짐
https://papers.nips.cc/paper/2740-semi-supervised-learning-by-entropy-minimization.pdf
http://dsba.korea.ac.kr/seminar/?mod=document&uid=68
http://dsba.korea.ac.kr/seminar/?mod=document&uid=248
Semi-supervised learning
6
• Consistency Regularization
1. 모델을 이용해 unlabeled data의 분포 예측
2. Unlabeled data에 noise 추가 (data augmentation)
3. 예측한 분포를 augmented data의 정답 label로 사용해 모델로 학습
http://dsba.korea.ac.kr/seminar/?mod=document&uid=248
Semi-supervised learning
7
• Consistency Regularization
• Π-Model (ICLR 2017)
https://arxiv.org/pdf/1610.02242.pdf
Semi-supervised learning
8
• Consistency Regularization
• Mean Teacher (NIPS 2017)
https://arxiv.org/pdf/1703.01780.pdf
Semi-supervised learning
9
• Consistency Regularization
• Virtual Adversarial Training (TPAMI 2018)
https://arxiv.org/pdf/1704.03976.pdf
https://medium.com/@kabbi159/semi-supervised-learning-%EC%A0%95%EB%A6%AC-a7ed58a8f023
Unsupervised Data Augmentation (UDA)
10
https://twitter.com/quocleix/status/1123103668318769152
Unsupervised Data Augmentation (UDA)
11
Unsupervised Data Augmentation (UDA)
12
• UDA
• Given an input 𝑥, compute the output distribution 𝑝 𝜃 𝑦|𝑥 and a noised version 𝑝 𝜃 𝑦|𝑥, 𝜖 by injecting a small
noise 𝜖. The noise can be applied to 𝑥 or hidden states.
• Minimize a divergence metric between the two distributions 𝒟 𝑝 𝜃 𝑦|𝑥 ||𝑝 𝜃 𝑦|𝑥, 𝜖 .
• 모델을 𝜖에 대해 덜 민감하게 만들고, input (or hidden) space의 변화에 대해 smoother하게 만듬
• Consistency loss를 최소화하는 것은 label information을 labeled examples에서 unlabeled ones로 점차 진행
min 𝜽 𝓙 𝜽 = 𝔼 𝒙,𝒚∗∈𝑳 − 𝐥𝐨𝐠 𝒑 𝜽 𝒚∗
|𝒙 + 𝝀𝔼 𝒙∈𝑼 𝔼ෝ𝒙~𝒒 ෝ𝒙|𝒙 𝓓KL 𝒑෩𝜽 𝒚|𝒙 ||𝒑 𝜽 𝒚|ෝ𝒙
• 𝜆 (= 1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss
• 𝑞 ෝ𝒙|𝒙 : a data augmentation transformation
• ෩𝜽 : a fixed copy of the current parameters 𝜃 indicating that the gradient is not propagated through ෩𝜽
supervised cross entropy unsupervised consistency training loss
Unsupervised Data Augmentation (UDA)
13
• UDA
min 𝜽 𝓙 𝜽 = 𝔼 𝒙,𝒚∗∈𝑳 − 𝐥𝐨𝐠 𝒑 𝜽 𝒚∗
|𝒙 + 𝝀𝔼 𝒙∈𝑼 𝔼ෝ𝒙~𝒒 ෝ𝒙|𝒙 𝓓KL 𝒑෩𝜽 𝒚|𝒙 ||𝒑 𝜽 𝒚|ෝ𝒙
• 𝜆 (= 1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss
• 𝑞 ෝ𝒙|𝒙 : a data augmentation transformation
• ෩𝜽 : a fixed copy of the current parameters 𝜃 indicating that the gradient is not propagated through ෩𝜽
• Different batch size for the supervised data and the unsupervised data
• Supervised training (𝑝 𝜃 𝑦|𝑥 )과 prediction on unlabeled examples (𝑝෩𝜽 𝑦|𝑥 )의 discrepancy를 줄이기 위해, unlabeled examples에도 같
은 augmentation 적용
supervised cross entropy unsupervised consistency training loss
Unsupervised Data Augmentation (UDA)
14
• Augmentation Strategies for Different Tasks
• RandAugment for Image Classification
• AutoAugment와 달리 search가 필요 없음
• Back-translation for Text Classification
• Word replacing with TF-IDF for Text Classification
Unsupervised Data Augmentation (UDA)
15
• Training Signal Annealing for Low-data Regime
• Semi-supervised learning에서 unlabeled data와 labeled data의 양은 큰 차이가 있음
• 모델의 크기가 클수록 소량의 labeled data에 overfitting되고 unlabeled data에는 underfitting되기 쉬움
• Training Signal Annealing (TSA)
• 학습 초반에는 정답 레이블에 대한 confidence가 정해진 threshold보다 높은 labeled data를 사용 X
Unsupervised Data Augmentation (UDA)
16
• Training Signal Annealing for Low-data Regime
• training step 𝑡에서 맞춘 category에 대한 모델의 predicted probability 𝑝 𝜃 𝑦∗
|𝑥 가 threshold 𝜂 𝑡보다 높으면 loss function에서 제거
• 𝐾 : the number of categories, (𝜂 𝑡 = 𝛼 𝑡 ∗ 1 −
1
𝑘
+
1
𝐾
)
• Log-schedule (𝛼 𝑡 = 1 − exp −
𝑡
𝑇
∗ 5 ) : 모델이 과적합될 것 같지 않을 때 (labeled example의 수가 많거나 모델이 effective regularization 사용)
• Linear-schedule (𝛼 𝑡 =
𝑡
𝑇
)
• Exp-schedule (𝛼 𝑡 = exp
𝑡
𝑇
− 1 ∗ 5 ) : 문제가 쉽거나 labeled example의 수가 제한적일 때
Experiments
• Dataset
• Language
• IMDb, Yelp-2, Yelp-5, Amazon-2, Amazon-5
• Language dataset을 활용한 실험은 발표자료에서 제외
• Vision
• CIFAR-10, SVHN
17https://www.cs.toronto.edu/~kriz/cifar.html
http://ufldl.stanford.edu/housenumbers/
CIFAR-10 SVHN (Street View House Number)
Experiments
• Correlation between Supervised and Semi-supervised Performances
• CIFAR-10으로 augmentation별 + supervised/semi-supervised 성능 비교
• Augmentation : Crop & Flip / Cutout / RandAugment
• Stronger data augmentation일수록 좋음
• Supervised에서 좋은 성능을 내면 semi-supervised에서도 좋은 성능을 보임 (Table 1)
• Transformation의 수가 늘어날수록 대체로 좋은 성능을 보임 (Figure 6)
18
Experiments
• Algorithm Comparison on Vision Semi-supervised Learning Benchmarks
• 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN)
• Vary the size of labeled data
• Wide-ResNet-28-2 + varied supervised data sizes
• Virtual Adversarial Training (VAT)와 MixMatch와 비교
• 모든 측면에서 UDA가 다른 알고리즘들보다 성능이 좋음
• Noise based augmentation한 VAT에서는 real image에서는 볼 수 없는 high-frequency artifact가 포함되므로 성능 저하
19
Experiments
• Algorithm Comparison on Vision Semi-supervised Learning Benchmarks
• 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN)
• Vary the size of labeled data
20
Experiments
• Algorithm Comparison on Vision Semi-supervised Learning Benchmarks
• 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN)
• Comparison with published results
21
Experiments
• Scalability Test on the ImageNet Dataset
• ImageNet + ResNet-50
1. 10% of the supervised data of ImageNet while using all other data as unlabeled data
2. All images in ImageNet as supervised data + filtering to 1.3M images from JFT → unlabeled data
• Unlabeled data in out-of-domain은 모으기 쉽지만 in-domain data의 분포와 너무 다르면 성능을 저하시킴
→ 먼저 labeled data로 학습하고 out-of-domain data 중 높은 confidence를 갖는 데이터를 골라 unlabeled data로 사용 (정답유무는 필요 없음)
• Supervised baseline과 비교했을 대 모든 측면에서 좋은 성능을 보임
• Labeled data의 scale을 조절할 수 있을 뿐만 아니라, out-of-domain unlabeled data도 사용할 수 있음 (S4L, CPC와 유사한 결과)
22
Experiments
• Ablation Studies for TSA
• CIFAR-10 : 4k labeled examples and 50k unlabeled examples
• CIFAR-10에 대해 linear-schedule가 가장 좋은 성능을 보임
23
Appendix
• Additional Training Techniques
• Sharpening Predictions
• Entropy minimization
• Augmented data의 예측 값이 낮은 entropy를 가지도록 (=예측이 더 sharp하도록) entropy objective term을 전체 objective에 추가
• Confidence-based masking
• 예측의 confidence가 낮은 unlabeled data를 학습에 이용 X
• Softmax temperature controlling
• Unlabeled data의 예측 값을 계산할 때 1 미만의 softmax temperature를 적용, augmented 데이터의 타겟이 더 sharp하도록 함
• Labeled 데이터가 매우 적은 경우 confidence-based masking, softmax temperature controlling이 유용,
labeled 데이터가 많은 경우 entropy minimization이 효과가 있었음
24
https://medium.com/platfarm/unsupervised-data-augmentation-for-consistency-training-5bcd52d3f01b
감 사 합 니 다
25

More Related Content

PPTX
Bagging.pptx
PPTX
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
PDF
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
PPTX
Meta-Learning Presentation
PPTX
Batch normalization presentation
PDF
Feature Engineering
PDF
Feature Engineering
PPTX
[DL輪読会]相互情報量最大化による表現学習
Bagging.pptx
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Meta-Learning Presentation
Batch normalization presentation
Feature Engineering
Feature Engineering
[DL輪読会]相互情報量最大化による表現学習

What's hot (20)

PDF
Emerging Properties in Self-Supervised Vision Transformers
PPTX
Federated learning based_trafiic_flow_prediction.ppt
PDF
基礎からのベイズ統計学第5章
PDF
PRML第3章@京大PRML輪講
PPTX
ラベル付けのいろは
PDF
Multi Layer Perceptron & Back Propagation
PPTX
【DL輪読会】大量API・ツールの扱いに特化したLLM
PDF
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
PDF
【輪読】Bayesian Optimization of Combinatorial Structures
PPTX
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
PDF
Neural Networks: Radial Bases Functions (RBF)
PDF
論文紹介 Semi-supervised Learning with Deep Generative Models
PPTX
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
PPTX
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
PPTX
[DL輪読会]It's not just size that maters small language models are also few sho...
PDF
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PDF
Vision and Language(メタサーベイ )
PPTX
MCMC法
PDF
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
PDF
【論文読み会】Self-Attention Generative Adversarial Networks
Emerging Properties in Self-Supervised Vision Transformers
Federated learning based_trafiic_flow_prediction.ppt
基礎からのベイズ統計学第5章
PRML第3章@京大PRML輪講
ラベル付けのいろは
Multi Layer Perceptron & Back Propagation
【DL輪読会】大量API・ツールの扱いに特化したLLM
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
【輪読】Bayesian Optimization of Combinatorial Structures
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
Neural Networks: Radial Bases Functions (RBF)
論文紹介 Semi-supervised Learning with Deep Generative Models
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
[DL輪読会]It's not just size that maters small language models are also few sho...
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Vision and Language(メタサーベイ )
MCMC法
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
【論文読み会】Self-Attention Generative Adversarial Networks

Similar to Unsupervised Data Augmentation for Consistency Training (20)

PPTX
crossvalidation.pptx
PDF
Machine Learning - Implementation with Python - 3.pdf
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PDF
Introduction to Artificial Intelligence_ Lec 10
PDF
Self training improves_nlu
PPTX
Hyperparameter Tuning
PPTX
in5490-classification (1).pptx
PPTX
MACHINE LEARNING YEAR DL SECOND PART.pptx
PPTX
250602_JW_labseminar[Graph Contrastive Learning Automated].pptx
PDF
Centertrack and naver airush 2020 review
PPTX
Learning methods.pptx
PDF
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PPTX
Learning Method In Data Mining
PDF
Barga Data Science lecture 10
PPTX
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
PPTX
6 Evaluating Predictive Performance and ensemble.pptx
PPTX
Random Forest Decision Tree.pptx
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
PDF
Kaggle Higgs Boson Machine Learning Challenge
crossvalidation.pptx
Machine Learning - Implementation with Python - 3.pdf
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Introduction to Artificial Intelligence_ Lec 10
Self training improves_nlu
Hyperparameter Tuning
in5490-classification (1).pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
250602_JW_labseminar[Graph Contrastive Learning Automated].pptx
Centertrack and naver airush 2020 review
Learning methods.pptx
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Learning Method In Data Mining
Barga Data Science lecture 10
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
6 Evaluating Predictive Performance and ensemble.pptx
Random Forest Decision Tree.pptx
Experimental Design for Distributed Machine Learning with Myles Baker
Kaggle Higgs Boson Machine Learning Challenge

More from Sungchul Kim (20)

PDF
SAM2: Segment Anything in Images and Videos
PDF
FeatUp: A Model-Agnostic Framework for Features at Any Resolution
PDF
Personalize Segment Anything Model with One Shot
PDF
TOOD: Task-aligned One-stage Object Detection
PDF
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
PDF
Network Representation Analysis using Centered Kernel Alignment (CKA)
PDF
Review. Dense Prediction Tasks for SSL
PPTX
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PDF
Revisiting the Calibration of Modern Neural Networks
PDF
PR-305: Exploring Simple Siamese Representation Learning
PDF
Score based Generative Modeling through Stochastic Differential Equations
PDF
Exploring Simple Siamese Representation Learning
PDF
Revisiting the Sibling Head in Object Detector
PDF
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
PDF
Deeplabv1, v2, v3, v3+
PDF
Going Deeper with Convolutions
PDF
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
PDF
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
PDF
Panoptic Segmentation
PDF
On the Variance of the Adaptive Learning Rate and Beyond
SAM2: Segment Anything in Images and Videos
FeatUp: A Model-Agnostic Framework for Features at Any Resolution
Personalize Segment Anything Model with One Shot
TOOD: Task-aligned One-stage Object Detection
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Network Representation Analysis using Centered Kernel Alignment (CKA)
Review. Dense Prediction Tasks for SSL
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
Revisiting the Calibration of Modern Neural Networks
PR-305: Exploring Simple Siamese Representation Learning
Score based Generative Modeling through Stochastic Differential Equations
Exploring Simple Siamese Representation Learning
Revisiting the Sibling Head in Object Detector
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Deeplabv1, v2, v3, v3+
Going Deeper with Convolutions
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Panoptic Segmentation
On the Variance of the Adaptive Learning Rate and Beyond

Recently uploaded (20)

PPTX
web development for engineering and engineering
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PPT on Performance Review to get promotions
web development for engineering and engineering
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CH1 Production IntroductoryConcepts.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Lecture Notes Electrical Wiring System Components
Internet of Things (IOT) - A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
573137875-Attendance-Management-System-original
bas. eng. economics group 4 presentation 1.pptx
OOP with Java - Java Introduction (Basics)
additive manufacturing of ss316l using mig welding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT on Performance Review to get promotions

Unsupervised Data Augmentation for Consistency Training

  • 2. Contents 1. Semi-supervised learning 2. Unsupervised Data Augmentation (UDA) 3. Experiments 4. Appendix 2
  • 3. Semi-supervised learning 3 • 목표 • Labeled data가 많지 않은 상황에서 unlabeled data의 도움을 받아 성능을 높이자! https://steemit.com/kr/@jiwoopapa/realistic-evaluation-of-semi-supervised-learning-algorithms
  • 4. Semi-supervised learning 4 20202017 2018 2019 Entropy Minimization (2005) Consistency Regularization Generic Regularization Self-Training VTA MixUp MixMatch Mean TeacherΠ-Model ReMixMatch FixMatch UDA Pseudo Labeling (2013) Noise Student
  • 5. Semi-supervised learning 5 • Entropy Minimization • 예측값(softmax)의 confidenc를 높이기 위해 사용 • 주로 softmax temperature를 이용 • Temperature를 1보다 적게 설정할수록 entropy가 작아짐 https://papers.nips.cc/paper/2740-semi-supervised-learning-by-entropy-minimization.pdf http://dsba.korea.ac.kr/seminar/?mod=document&uid=68 http://dsba.korea.ac.kr/seminar/?mod=document&uid=248
  • 6. Semi-supervised learning 6 • Consistency Regularization 1. 모델을 이용해 unlabeled data의 분포 예측 2. Unlabeled data에 noise 추가 (data augmentation) 3. 예측한 분포를 augmented data의 정답 label로 사용해 모델로 학습 http://dsba.korea.ac.kr/seminar/?mod=document&uid=248
  • 7. Semi-supervised learning 7 • Consistency Regularization • Π-Model (ICLR 2017) https://arxiv.org/pdf/1610.02242.pdf
  • 8. Semi-supervised learning 8 • Consistency Regularization • Mean Teacher (NIPS 2017) https://arxiv.org/pdf/1703.01780.pdf
  • 9. Semi-supervised learning 9 • Consistency Regularization • Virtual Adversarial Training (TPAMI 2018) https://arxiv.org/pdf/1704.03976.pdf https://medium.com/@kabbi159/semi-supervised-learning-%EC%A0%95%EB%A6%AC-a7ed58a8f023
  • 10. Unsupervised Data Augmentation (UDA) 10 https://twitter.com/quocleix/status/1123103668318769152
  • 12. Unsupervised Data Augmentation (UDA) 12 • UDA • Given an input 𝑥, compute the output distribution 𝑝 𝜃 𝑦|𝑥 and a noised version 𝑝 𝜃 𝑦|𝑥, 𝜖 by injecting a small noise 𝜖. The noise can be applied to 𝑥 or hidden states. • Minimize a divergence metric between the two distributions 𝒟 𝑝 𝜃 𝑦|𝑥 ||𝑝 𝜃 𝑦|𝑥, 𝜖 . • 모델을 𝜖에 대해 덜 민감하게 만들고, input (or hidden) space의 변화에 대해 smoother하게 만듬 • Consistency loss를 최소화하는 것은 label information을 labeled examples에서 unlabeled ones로 점차 진행 min 𝜽 𝓙 𝜽 = 𝔼 𝒙,𝒚∗∈𝑳 − 𝐥𝐨𝐠 𝒑 𝜽 𝒚∗ |𝒙 + 𝝀𝔼 𝒙∈𝑼 𝔼ෝ𝒙~𝒒 ෝ𝒙|𝒙 𝓓KL 𝒑෩𝜽 𝒚|𝒙 ||𝒑 𝜽 𝒚|ෝ𝒙 • 𝜆 (= 1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss • 𝑞 ෝ𝒙|𝒙 : a data augmentation transformation • ෩𝜽 : a fixed copy of the current parameters 𝜃 indicating that the gradient is not propagated through ෩𝜽 supervised cross entropy unsupervised consistency training loss
  • 13. Unsupervised Data Augmentation (UDA) 13 • UDA min 𝜽 𝓙 𝜽 = 𝔼 𝒙,𝒚∗∈𝑳 − 𝐥𝐨𝐠 𝒑 𝜽 𝒚∗ |𝒙 + 𝝀𝔼 𝒙∈𝑼 𝔼ෝ𝒙~𝒒 ෝ𝒙|𝒙 𝓓KL 𝒑෩𝜽 𝒚|𝒙 ||𝒑 𝜽 𝒚|ෝ𝒙 • 𝜆 (= 1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss • 𝑞 ෝ𝒙|𝒙 : a data augmentation transformation • ෩𝜽 : a fixed copy of the current parameters 𝜃 indicating that the gradient is not propagated through ෩𝜽 • Different batch size for the supervised data and the unsupervised data • Supervised training (𝑝 𝜃 𝑦|𝑥 )과 prediction on unlabeled examples (𝑝෩𝜽 𝑦|𝑥 )의 discrepancy를 줄이기 위해, unlabeled examples에도 같 은 augmentation 적용 supervised cross entropy unsupervised consistency training loss
  • 14. Unsupervised Data Augmentation (UDA) 14 • Augmentation Strategies for Different Tasks • RandAugment for Image Classification • AutoAugment와 달리 search가 필요 없음 • Back-translation for Text Classification • Word replacing with TF-IDF for Text Classification
  • 15. Unsupervised Data Augmentation (UDA) 15 • Training Signal Annealing for Low-data Regime • Semi-supervised learning에서 unlabeled data와 labeled data의 양은 큰 차이가 있음 • 모델의 크기가 클수록 소량의 labeled data에 overfitting되고 unlabeled data에는 underfitting되기 쉬움 • Training Signal Annealing (TSA) • 학습 초반에는 정답 레이블에 대한 confidence가 정해진 threshold보다 높은 labeled data를 사용 X
  • 16. Unsupervised Data Augmentation (UDA) 16 • Training Signal Annealing for Low-data Regime • training step 𝑡에서 맞춘 category에 대한 모델의 predicted probability 𝑝 𝜃 𝑦∗ |𝑥 가 threshold 𝜂 𝑡보다 높으면 loss function에서 제거 • 𝐾 : the number of categories, (𝜂 𝑡 = 𝛼 𝑡 ∗ 1 − 1 𝑘 + 1 𝐾 ) • Log-schedule (𝛼 𝑡 = 1 − exp − 𝑡 𝑇 ∗ 5 ) : 모델이 과적합될 것 같지 않을 때 (labeled example의 수가 많거나 모델이 effective regularization 사용) • Linear-schedule (𝛼 𝑡 = 𝑡 𝑇 ) • Exp-schedule (𝛼 𝑡 = exp 𝑡 𝑇 − 1 ∗ 5 ) : 문제가 쉽거나 labeled example의 수가 제한적일 때
  • 17. Experiments • Dataset • Language • IMDb, Yelp-2, Yelp-5, Amazon-2, Amazon-5 • Language dataset을 활용한 실험은 발표자료에서 제외 • Vision • CIFAR-10, SVHN 17https://www.cs.toronto.edu/~kriz/cifar.html http://ufldl.stanford.edu/housenumbers/ CIFAR-10 SVHN (Street View House Number)
  • 18. Experiments • Correlation between Supervised and Semi-supervised Performances • CIFAR-10으로 augmentation별 + supervised/semi-supervised 성능 비교 • Augmentation : Crop & Flip / Cutout / RandAugment • Stronger data augmentation일수록 좋음 • Supervised에서 좋은 성능을 내면 semi-supervised에서도 좋은 성능을 보임 (Table 1) • Transformation의 수가 늘어날수록 대체로 좋은 성능을 보임 (Figure 6) 18
  • 19. Experiments • Algorithm Comparison on Vision Semi-supervised Learning Benchmarks • 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN) • Vary the size of labeled data • Wide-ResNet-28-2 + varied supervised data sizes • Virtual Adversarial Training (VAT)와 MixMatch와 비교 • 모든 측면에서 UDA가 다른 알고리즘들보다 성능이 좋음 • Noise based augmentation한 VAT에서는 real image에서는 볼 수 없는 high-frequency artifact가 포함되므로 성능 저하 19
  • 20. Experiments • Algorithm Comparison on Vision Semi-supervised Learning Benchmarks • 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN) • Vary the size of labeled data 20
  • 21. Experiments • Algorithm Comparison on Vision Semi-supervised Learning Benchmarks • 현재 semi-supervised learning algorithm과 비교 (CIFAR-10, SVHN) • Comparison with published results 21
  • 22. Experiments • Scalability Test on the ImageNet Dataset • ImageNet + ResNet-50 1. 10% of the supervised data of ImageNet while using all other data as unlabeled data 2. All images in ImageNet as supervised data + filtering to 1.3M images from JFT → unlabeled data • Unlabeled data in out-of-domain은 모으기 쉽지만 in-domain data의 분포와 너무 다르면 성능을 저하시킴 → 먼저 labeled data로 학습하고 out-of-domain data 중 높은 confidence를 갖는 데이터를 골라 unlabeled data로 사용 (정답유무는 필요 없음) • Supervised baseline과 비교했을 대 모든 측면에서 좋은 성능을 보임 • Labeled data의 scale을 조절할 수 있을 뿐만 아니라, out-of-domain unlabeled data도 사용할 수 있음 (S4L, CPC와 유사한 결과) 22
  • 23. Experiments • Ablation Studies for TSA • CIFAR-10 : 4k labeled examples and 50k unlabeled examples • CIFAR-10에 대해 linear-schedule가 가장 좋은 성능을 보임 23
  • 24. Appendix • Additional Training Techniques • Sharpening Predictions • Entropy minimization • Augmented data의 예측 값이 낮은 entropy를 가지도록 (=예측이 더 sharp하도록) entropy objective term을 전체 objective에 추가 • Confidence-based masking • 예측의 confidence가 낮은 unlabeled data를 학습에 이용 X • Softmax temperature controlling • Unlabeled data의 예측 값을 계산할 때 1 미만의 softmax temperature를 적용, augmented 데이터의 타겟이 더 sharp하도록 함 • Labeled 데이터가 매우 적은 경우 confidence-based masking, softmax temperature controlling이 유용, labeled 데이터가 많은 경우 entropy minimization이 효과가 있었음 24 https://medium.com/platfarm/unsupervised-data-augmentation-for-consistency-training-5bcd52d3f01b
  • 25. 감 사 합 니 다 25