SlideShare a Scribd company logo
visionNoob @ PR12
PR 171 :
Large-Margin Softmax Loss
for Convolutional Neural Networks
Liu, Weiyang, et al. ICML 2016
https://arxiv.org/abs/1612.02295
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 2
Wang, Mei, and Weihong Deng. "Deep face recognition: A survey." arXiv preprint arXiv:1804.06655 (2018).
today!
The Development of Loss Function
see also
PR-127: FaceNet (https://youtu.be/0k3X-9y_9S8)
SphereFace
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 3
Large-Margin Softmax Loss
intra-class compactness
inter-class separability
Ahmed, Saeed, et al. "Covert cyber assault detection in smart grid
networks utilizing feature selection and Euclidean distance-based
machine learning." Applied Sciences 8.5 (2018): 772.
Intuitively,
the learned features are good if
are simultaneously maximized
Goal
more discriminative information
(compared to original softmax loss or others)
In perspective of loss function,
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 4
Large-Margin Softmax Loss
𝒚 𝟏 = 𝑊1
𝑇
𝒙
𝒚 𝟐 = 𝑊2
𝑇
𝒙
softmax loss
𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔
exp(𝑊𝑦
𝑇 𝒙)
𝑗=1
𝐶
exp(𝑊𝑗
𝑇
𝒙)
* biases and batches are omitted for simplicity.
= −𝑙𝑜𝑔
exp(𝑊1
𝑇
𝒙)
exp 𝑊1
𝑇
𝒙 + exp(𝑊2
𝑇
𝒙)
𝑾𝒙
𝑾 𝟏 𝑾 𝟐
𝑦1
𝑦2
𝒚
example : binary classification
(if 𝑦 = 1)
last fc layerfeature
from prev layer
The original softmax is to force 𝑊1
𝑇
𝒙 > 𝑊2
𝑇
𝒙
in order to classify 𝒙 correctly.
intra-class compactness
inter-class separability
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 5
Large-Margin Softmax Loss
Intuition
𝑾𝒙
𝑾 𝟏 𝑾 𝟐
𝑦1
𝑦2
𝒚
𝒚 𝟏 = 𝑊1
𝑇
𝒙
= 𝑊1 𝒙 cos(𝜃1)
𝒚 𝟐 = 𝑊2
𝑇
𝒙
= 𝑊2 𝒙 cos(𝜃2)
𝑊1 𝒙 cos 𝜃1 > 𝑊2 𝒙 cos(𝜃2)
1. Original softmax is to force 𝑊1
𝑇
𝒙 > 𝑊2
𝑇
𝒙
𝑊1 𝒙 cos 𝑚𝜃1 > 𝑊2 𝒙 cos(𝜃2)
2. L-softmax (proposed)
(0 ≤ 𝜃1 ≤
𝜋
𝑚
)
𝑚 is a positive integer
𝑊1 𝒙 cos 𝜃1 ≥ 𝑊2 𝒙 cos(𝑚𝜃1)
following inequality holds:
> 𝑊1 𝒙 cos 𝜃2
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 6
Large-Margin Softmax Loss
Intuition
𝑊1 𝒙 cos 𝜃1 > 𝑊2 𝒙 cos(𝜃2)
1. Original softmax is to force 𝑊1
𝑇
𝒙 > 𝑊2
𝑇
𝒙
𝑊1 𝒙 cos 𝑚𝜃1 > 𝑊2 𝒙 cos(𝜃2)
2. L-softmax (proposed)
(0 ≤ 𝜃1 ≤
𝜋
𝑚
)
𝑚 is a positive integer
𝑊1 𝒙 cos 𝜃1 ≥ 𝑊2 𝒙 cos(𝑚𝜃1)
following inequality holds:
> 𝑊1 𝒙 cos 𝜃2
the original softmax loss requires θ1 < θ2 to classify the sample x as class 1,
while the L-Softmax loss requires mθ1 < θ2 to make the same decision.
𝑊1 𝒙 cos 𝜃1 > 𝑊2 𝒙 cos(𝜃2) 𝑊1 𝒙 cos 𝑚𝜃1 > 𝑊2 𝒙 cos(𝜃2)
Geometric interpretation
-> More rigor classification criteria
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 7
Large-Margin Softmax Loss
1. softmax loss
𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔
exp(𝑊𝑦
𝑇
𝒙)
𝑗=1
𝐶
exp(𝑊𝑗
𝑇
𝒙)
= −𝑙𝑜𝑔
exp( 𝑊𝑦 𝒙 cos(𝜃 𝑦)))
𝑗=1
𝐶
exp( 𝑊𝑗 𝒙 cos(𝜃𝑗))
* biases and batches are omitted for simplicity.
2. L-softmax loss (Large-Margin Softmax Loss)
𝐿 𝐿−𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔
exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦)))
exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦))) + 𝑗≠𝑦 exp( 𝑊𝑗 𝒙 cos(𝜃𝑗))
monotonically decreasing function
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 8
Experiment
5.1 Experimental Settings
• Dataset
• visual classification : MNIST, CIFAR10, CIFAR100
• face verification : LFW
• VGG-like model with Caffe
• PReLU nonlinearities
• Batch size : 256
• Weight decay : 0.0005
• Momentum : 0.9
• He initialization
• Batch normalization
• No dropout
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 9
Experiment
model overview
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 10
Experiment
3.2 visual classification
1. MNIST
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 11
Experiment
3.2 visual classification
1. Cifar10/Cifar100
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 12
Experiment
3.2 visual classification
softmax suffer from overfitting
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 13
Experiment
3.2 visual classification
softmax suffer from overfitting
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 14
Experiment
3.3 face verification
1. LFW
- Training set : CASIA-WebFace
490k labeled face images with
1000 person class
2. ROI Align : IntraFace
3. PCA for compact vector
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 15
Conclusion
1. Large-Margin Softmax loss is proposed
2. Adjustable margin with parameter 𝒎
3. Clear intuition and geometric interpretation
4. Sate-of-the-art CNN in benchmarks
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 16
Discussion
SphereFace: Deep Hypersphere Embedding for Face Recognition
𝐿 𝐿−𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔
exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦)))
exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦))) + 𝑗≠𝑦 exp( 𝑊𝑗 𝒙 cos(𝜃𝑗))
𝑊1
𝑇
𝒙 = 𝑊1 𝒙 cos 𝜃1
𝑊2
𝑇
𝒙 = 𝑊2 𝒙 cos 𝜃2
( 𝑊 is normalized!)
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 17
Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face recognition." CVPR 2019
CVPR 2019CVPR 2018CVPR 2017
Angular Margin Losses
L-softmax + normalized
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 18
Triplet loss (CVPR 2015)
Wen, Yandong, et al. "A discriminative feature learning approach for deep face recognition." ECCV 2016
Centre loss (ECCV2016)
Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face
recognition and clustering.“ CVPR 2015.
Contrastive loss (NIPS 2014) Ring loss (CVPR 2018)
Metric Learning
Sun, Yi, et al. "Deep learning face representation by joint identification-verification." NIPS 2014
Zheng, Yutong, Dipan K. Pal, and Marios Savvides. "Ring loss: Convex feature normalization for face
recognition." CVPR 2018
PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 19
Q & A

More Related Content

PPTX
Hadoop Architecture
PPTX
Face recognition v1
PDF
Phone book with project report for BCA,MCA
PPT
5.5 graph mining
PPTX
Meta-Learning Presentation
PPTX
Mining Data Streams
PPTX
Support vector machine
PPTX
Lecture 21 problem reduction search ao star search
Hadoop Architecture
Face recognition v1
Phone book with project report for BCA,MCA
5.5 graph mining
Meta-Learning Presentation
Mining Data Streams
Support vector machine
Lecture 21 problem reduction search ao star search

What's hot (20)

PDF
Lecture9 - Bayesian-Decision-Theory
PDF
dfd fitness app (1).pdf
PDF
Recurrence relation
PPTX
Unsupervised learning (clustering)
PPTX
5.2 primitive recursive functions
PPTX
Campus Automation
PPTX
Context free grammar
DOCX
Hadoop Report
PDF
Malware Detection in Android Applications
PPT
EULER AND FERMAT THEOREM
PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
DOCX
Crucial decisions in designing a data warehouse
PPTX
push down automata
PPT
Android Button
PPTX
Graph mining ppt
DOCX
professional practice case study
PPTX
Difference Program vs Process vs Thread
PDF
Learning to Rank: An Introduction to LambdaMART
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PPTX
Security & protection in operating system
Lecture9 - Bayesian-Decision-Theory
dfd fitness app (1).pdf
Recurrence relation
Unsupervised learning (clustering)
5.2 primitive recursive functions
Campus Automation
Context free grammar
Hadoop Report
Malware Detection in Android Applications
EULER AND FERMAT THEOREM
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Crucial decisions in designing a data warehouse
push down automata
Android Button
Graph mining ppt
professional practice case study
Difference Program vs Process vs Thread
Learning to Rank: An Introduction to LambdaMART
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
Security & protection in operating system
Ad

Similar to PR 171: Large margin softmax loss for Convolutional Neural Networks (9)

PPTX
CVPR_01 On Image Processing and application of various alogorithms
PDF
A Revisit of Feature Learning on CNN-based Face Recognition
PDF
“Facing Up to Bias,” a Presentation from Perceive
PDF
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
PPTX
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
PPTX
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
PPTX
Face Recognition: From Scratch To Hatch
PDF
PR 127: FaceNet
PPTX
LRP for hand gesture recogntion.pptx
CVPR_01 On Image Processing and application of various alogorithms
A Revisit of Feature Learning on CNN-based Face Recognition
“Facing Up to Bias,” a Presentation from Perceive
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch
PR 127: FaceNet
LRP for hand gesture recogntion.pptx
Ad

More from jaewon lee (9)

PDF
PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild
PDF
PR-199: SNIPER:Efficient Multi Scale Training
PPTX
PR-146: CornerNet detecting objects as paired keypoints
PDF
PR157: Best of both worlds: human-machine collaboration for object annotation
PPTX
PR-122: Can-Creative Adversarial Networks
PPTX
Rgb data
PPTX
Pytorch kr devcon
PPTX
PR-134 How Does Batch Normalization Help Optimization?
PDF
PR-110: An Analysis of Scale Invariance in Object Detection – SNIP
PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild
PR-199: SNIPER:Efficient Multi Scale Training
PR-146: CornerNet detecting objects as paired keypoints
PR157: Best of both worlds: human-machine collaboration for object annotation
PR-122: Can-Creative Adversarial Networks
Rgb data
Pytorch kr devcon
PR-134 How Does Batch Normalization Help Optimization?
PR-110: An Analysis of Scale Invariance in Object Detection – SNIP

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Structure & Organelles in detailed.
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Classroom Observation Tools for Teachers
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
master seminar digital applications in india
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Pharma ospi slides which help in ospi learning
PDF
01-Introduction-to-Information-Management.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Structure & Organelles in detailed.
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Classroom Observation Tools for Teachers
Module 4: Burden of Disease Tutorial Slides S2 2025
master seminar digital applications in india
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Cell Types and Its function , kingdom of life
102 student loan defaulters named and shamed – Is someone you know on the list?
2.FourierTransform-ShortQuestionswithAnswers.pdf
TR - Agricultural Crops Production NC III.pdf
PPH.pptx obstetrics and gynecology in nursing
Pharma ospi slides which help in ospi learning
01-Introduction-to-Information-Management.pdf
O7-L3 Supply Chain Operations - ICLT Program

PR 171: Large margin softmax loss for Convolutional Neural Networks

  • 1. visionNoob @ PR12 PR 171 : Large-Margin Softmax Loss for Convolutional Neural Networks Liu, Weiyang, et al. ICML 2016 https://arxiv.org/abs/1612.02295
  • 2. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 2 Wang, Mei, and Weihong Deng. "Deep face recognition: A survey." arXiv preprint arXiv:1804.06655 (2018). today! The Development of Loss Function see also PR-127: FaceNet (https://youtu.be/0k3X-9y_9S8) SphereFace
  • 3. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 3 Large-Margin Softmax Loss intra-class compactness inter-class separability Ahmed, Saeed, et al. "Covert cyber assault detection in smart grid networks utilizing feature selection and Euclidean distance-based machine learning." Applied Sciences 8.5 (2018): 772. Intuitively, the learned features are good if are simultaneously maximized Goal more discriminative information (compared to original softmax loss or others) In perspective of loss function,
  • 4. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 4 Large-Margin Softmax Loss 𝒚 𝟏 = 𝑊1 𝑇 𝒙 𝒚 𝟐 = 𝑊2 𝑇 𝒙 softmax loss 𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔 exp(𝑊𝑦 𝑇 𝒙) 𝑗=1 𝐶 exp(𝑊𝑗 𝑇 𝒙) * biases and batches are omitted for simplicity. = −𝑙𝑜𝑔 exp(𝑊1 𝑇 𝒙) exp 𝑊1 𝑇 𝒙 + exp(𝑊2 𝑇 𝒙) 𝑾𝒙 𝑾 𝟏 𝑾 𝟐 𝑦1 𝑦2 𝒚 example : binary classification (if 𝑦 = 1) last fc layerfeature from prev layer The original softmax is to force 𝑊1 𝑇 𝒙 > 𝑊2 𝑇 𝒙 in order to classify 𝒙 correctly. intra-class compactness inter-class separability
  • 5. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 5 Large-Margin Softmax Loss Intuition 𝑾𝒙 𝑾 𝟏 𝑾 𝟐 𝑦1 𝑦2 𝒚 𝒚 𝟏 = 𝑊1 𝑇 𝒙 = 𝑊1 𝒙 cos(𝜃1) 𝒚 𝟐 = 𝑊2 𝑇 𝒙 = 𝑊2 𝒙 cos(𝜃2) 𝑊1 𝒙 cos 𝜃1 > 𝑊2 𝒙 cos(𝜃2) 1. Original softmax is to force 𝑊1 𝑇 𝒙 > 𝑊2 𝑇 𝒙 𝑊1 𝒙 cos 𝑚𝜃1 > 𝑊2 𝒙 cos(𝜃2) 2. L-softmax (proposed) (0 ≤ 𝜃1 ≤ 𝜋 𝑚 ) 𝑚 is a positive integer 𝑊1 𝒙 cos 𝜃1 ≥ 𝑊2 𝒙 cos(𝑚𝜃1) following inequality holds: > 𝑊1 𝒙 cos 𝜃2
  • 6. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 6 Large-Margin Softmax Loss Intuition 𝑊1 𝒙 cos 𝜃1 > 𝑊2 𝒙 cos(𝜃2) 1. Original softmax is to force 𝑊1 𝑇 𝒙 > 𝑊2 𝑇 𝒙 𝑊1 𝒙 cos 𝑚𝜃1 > 𝑊2 𝒙 cos(𝜃2) 2. L-softmax (proposed) (0 ≤ 𝜃1 ≤ 𝜋 𝑚 ) 𝑚 is a positive integer 𝑊1 𝒙 cos 𝜃1 ≥ 𝑊2 𝒙 cos(𝑚𝜃1) following inequality holds: > 𝑊1 𝒙 cos 𝜃2 the original softmax loss requires θ1 < θ2 to classify the sample x as class 1, while the L-Softmax loss requires mθ1 < θ2 to make the same decision. 𝑊1 𝒙 cos 𝜃1 > 𝑊2 𝒙 cos(𝜃2) 𝑊1 𝒙 cos 𝑚𝜃1 > 𝑊2 𝒙 cos(𝜃2) Geometric interpretation -> More rigor classification criteria
  • 7. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 7 Large-Margin Softmax Loss 1. softmax loss 𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔 exp(𝑊𝑦 𝑇 𝒙) 𝑗=1 𝐶 exp(𝑊𝑗 𝑇 𝒙) = −𝑙𝑜𝑔 exp( 𝑊𝑦 𝒙 cos(𝜃 𝑦))) 𝑗=1 𝐶 exp( 𝑊𝑗 𝒙 cos(𝜃𝑗)) * biases and batches are omitted for simplicity. 2. L-softmax loss (Large-Margin Softmax Loss) 𝐿 𝐿−𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔 exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦))) exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦))) + 𝑗≠𝑦 exp( 𝑊𝑗 𝒙 cos(𝜃𝑗)) monotonically decreasing function
  • 8. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 8 Experiment 5.1 Experimental Settings • Dataset • visual classification : MNIST, CIFAR10, CIFAR100 • face verification : LFW • VGG-like model with Caffe • PReLU nonlinearities • Batch size : 256 • Weight decay : 0.0005 • Momentum : 0.9 • He initialization • Batch normalization • No dropout
  • 9. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 9 Experiment model overview
  • 10. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 10 Experiment 3.2 visual classification 1. MNIST
  • 11. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 11 Experiment 3.2 visual classification 1. Cifar10/Cifar100
  • 12. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 12 Experiment 3.2 visual classification softmax suffer from overfitting
  • 13. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 13 Experiment 3.2 visual classification softmax suffer from overfitting
  • 14. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 14 Experiment 3.3 face verification 1. LFW - Training set : CASIA-WebFace 490k labeled face images with 1000 person class 2. ROI Align : IntraFace 3. PCA for compact vector
  • 15. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 15 Conclusion 1. Large-Margin Softmax loss is proposed 2. Adjustable margin with parameter 𝒎 3. Clear intuition and geometric interpretation 4. Sate-of-the-art CNN in benchmarks
  • 16. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 16 Discussion SphereFace: Deep Hypersphere Embedding for Face Recognition 𝐿 𝐿−𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = −𝑙𝑜𝑔 exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦))) exp( 𝑊𝑦 𝒙 ψ(𝜃 𝑦))) + 𝑗≠𝑦 exp( 𝑊𝑗 𝒙 cos(𝜃𝑗)) 𝑊1 𝑇 𝒙 = 𝑊1 𝒙 cos 𝜃1 𝑊2 𝑇 𝒙 = 𝑊2 𝒙 cos 𝜃2 ( 𝑊 is normalized!)
  • 17. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 17 Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face recognition." CVPR 2019 CVPR 2019CVPR 2018CVPR 2017 Angular Margin Losses L-softmax + normalized
  • 18. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 18 Triplet loss (CVPR 2015) Wen, Yandong, et al. "A discriminative feature learning approach for deep face recognition." ECCV 2016 Centre loss (ECCV2016) Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering.“ CVPR 2015. Contrastive loss (NIPS 2014) Ring loss (CVPR 2018) Metric Learning Sun, Yi, et al. "Deep learning face representation by joint identification-verification." NIPS 2014 Zheng, Yutong, Dipan K. Pal, and Marios Savvides. "Ring loss: Convex feature normalization for face recognition." CVPR 2018
  • 19. PR 171: Large-Margin Softmax Loss for Convolutional Neural Networks PR12 19 Q & A

Editor's Notes

  • #2: 발표 시작하겠습니다. 제가 발표할 논문은 Large-Margin Softmax Loss for Convolutional Neural Networks 이라는 논문입니다. ICML 2016에서 발표가 되었구요
  • #3: 이 논문을 선정한 이유는, 태오님이 PR-127에서 Face Recognition에 대한 overview를 발표 해주셨는데요 제가 그래서 올해 CVPR에 나온 ArcFace를 발표를 하려고 하다가 이 오늘 발표하는 Large-Margin softmax 논문이 base가 되는 논문이라서 선행적으로 소개해 드리려고 하구요 뒤에 Sphere Face, Cosface, Arcface 관련 슬라이드도 간략하게 한장 넣었으니 참고하시면 되겠습니다
  • #4: 이 논문에서 하고싶은 것은 결국 우리가 어떻게 loss function을 짜야 more discriminative information를 잘 학습시킬 수 있을까? 라는 것이구요 만약 classification 문제가 있는데 어떤 두 클래스를 잘 분류할 수 있도록 하려면 어떻게 해야할까를 생각해보면 intra-class compactness와 inter-class separability를 maximize 하면 되겠죠 그럴라면 loss function을 설계하는 관점에서 loss를 어떻게 설계하면 좋을까? 어떤 패널티를 줘야 저렇게 학습이 될 수 있을까? 라는겁니다.
  • #5: 우선 softmax를 한번 살펴볼건데요, 익숙하실거에요 간단한 binary classification 예제에서 linear layer가 있다고 해볼게요. CNN 같은 딥러닝 모델이라고 해보자면 이전 레이어를 타고 온 어떤 입력 벡터 x가 있을거구요 Matrix W는 fc-layer같은 linear transformation 이구요 결과벡터의 각 elemen는 Wmatrix의 col vec와 input feature x의 내적으로 나타낼 수 있습니다. 이런 레이어를 학습시키는데 보통 softmax crossentropy loss같은걸 사용하겠죠 softmas는 𝑊 1 𝑇 𝒙 > 𝑊 2 𝑇 𝒙 하면 happy하다. 논문에서 주장하는것은 사실 softmax loss는 Intra-class compactness나 inter-class separability를 고려하지는 않는거죠
  • #6: 자 근데 여기에서 이 W mat의 col vec와 x vec의 내적은 결국 이렇게 나타낼 수 있겠죠? 두 magnitud에다가 cos theta로 나타낼 수있겠죠? 그래서 Large-Margin Softmax Loss가 하고싶은게 뭐냐면 이 cos theta에 margin을 주면 어떨까? 하는겁니다.
  • #7: Geometric interpretation으로 다시한번 살펴보자면 Original softmax에서는 sample x를 분류하려면 Theta1 < theta2 인 상황이라면 L-Softmax에서는 same decision을 하려면 mTheta1 < theta2 인 상황을 생각해보면 Original 의 theta1 보다 m배 더 작아야겠죠 이런 특성을 이용해서 조금 더 엄격한 classification criteria를 만들어 낼 수 있지 않을까? 하는게 이 논문에서 주장하는 내용입니다.
  • #8: 그래서 Large-Margin Softmax Loss를 L-softmax라고 하는데요 실제로는 다음과 같이 나타낼 수 있습니다. Original softmax 의 식을 살펴보면 전체 클래스의 이거 분에 정답 클래스의 이 값 이었는데 L-Softmax loss에서는 정답일때 cos theta가 아니라 여기 프사이 함수를 하나 씌웁니다. 그니까 여기 원래 cos(theta)가 파란색 실선이고 프사이 함수는 마진 m에 따라 cos theta값이 연속적으로 반복되는 형태를 사용했다고 해요 M = 2 일때는 cos(2 theta)인 경우에는 이렇게 빨간 실선을 사용합니다.
  • #9: 이제 실험인데요 실험은 MNIST, CIFAR10 100 LFW datasetset으로 했구요 VGG-like 모델로 일반적인 학습 setting으로 진행을 한 것 같아요
  • #10: 이런식으로
  • #11: MNIST dataset 마진을 얼마를 주냐에 따라서 학습이 어떻게 되는지 M = 1일때는 그냥 softmax
  • #12: CIFAR10/100 에서 우리가 좋더라
  • #13: 그리고 softmax에 비해서 overfitting에 조금더 강인한 것 같다 라고 주장을 하는데요 이유에 대해서는 자세히 나와있지는 않고 마진이라는걸 줘서 모델이 더 학습하기 어렵기 때문이 아닌가 뭐 이정도의 디스커션이 있엇습니다.
  • #14: 오버피팅에 강인해서 레이어를 쭉쭉 늘려도 늘리는대로 성능이 쭉쭉 잘 나온다
  • #15: 특징이 잘 학습됐는지 평가할라고
  • #17: 사실 눈치채신 분들도 있겠지만 벡터 W1이랑 벡터 W2의 magnitude가 동일하지 않을 경우에는 요로케 theta만 가지고 해석하는게 조금 어렵겠죠 논문에서도 이 두 값이 다른 경우에는 해석하기가 조금 complicated 하지만 어쩄든 마진을 가지고 잘되는것 같다! 정도로만 이야기하고 있는데요 이 SphereFace라는 논문에서는 아얘 W를 1로 normalized 해버려서 해결을 해버립니다.
  • #18: 그리고 특히 요즘 얼굴인식에서 이런 sphereFace라던지 CosFace, ArceFace 이런 방법들이 계속 나오는데 결국 여기 보시면 마진을 어디에 둘꺼냐에 따라서 그냥 논문 세개가 나왔다고 봐도 무방할 것 같아요 S는 scale factor이고 이런게 minor하게 다르긴한데 어쩃든 컨셉은 비슷하거든요 Margin을 어디에 두고 학습을 하냐에 따라서 이런 geometric interpretation이 달라지고 성능이 달라지고 하는게 신기했습니다.
  • #19: 사실 이 외에서 Metric Learning 이라고해서 기존에 다양한 방법들이 있는데요 제가 생각하기에 이러한 연구 방법은, 어떻게 하면 조금 더 심플하고 강력하게 intra-class compactness, inter-class separability를 matimize 할 수 있는까 하는 방향으로 계속 연구가 되는 것 같아요. 가령 Triplet Loss같은 경우에 샘플 세개를 뽑거든요 Anchor 포인트 하나, Anchor랑 같은 레이블인 샘플 하나 다른 레이블인 negative 샘플 하나. 근데 negative sample을 뽑는다는게 데이터가 커지면 커질수록 생각보다 상당히 복잡한 문제가 되거든요 Center loss같은 경우는 class가 늘어날수록 converge시키기가 너무 어렵다는 문제도 있었구요 이런식으로 데이터가 커지고 레이블이 많아져도 심플하고 강력하게 할 수 있는 방향이지 않나 싶었습니다.
  • #22: 그럼에도 불구하고 이게 dataset이 더 정제되어있지 않아서 그런게 아닌가 Triplet loss 실제 서비스를 해야하는 관점에서 점점 심플하면서도 강력한
  • #25: 저는 이번에 이 Large-Margin Softmax 라는 논문을 face recognition의 알고리즘 stream의 일부로 살펴볼 예정이구요 Open-set Face Recognition 문제에서Feature를 어떻게 하면 잘 만들 수 있을까
  • #29: 이걸 다시한번 생각해보면 y_1 이라는건 결국 W_1 과 x간의 내적한 값일 것이고 Y_2 라는건 W2와 x간의 내적이겠죠 이렇게 하면 intra-class compactness inter-class separability 가 maximize 되도록 학습이 되지 않을까 Margin이 있으니까 theta1이 theta2에 비해서 훨씬 더 작아야 겠죠 Original softmax에서는 sample x를 분류하려면 Theta1 < theta2 인 상황이라면 L-Softmax에서는 same decision을 하려면 mTheta1 < theta2 여야 겠죠