Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)

Face Recognition: From
Scratch To Hatch
Tyantov Eduard, Mail.ru Group

Face Recognition in Cloud@Mail.ru
Users upload photos to Cloud
Backend identifies persons on
photos, tags and show clusters

???
You
Your
Ex-Girlfriend
Social networks

Convolutional neural networks,
briefly

edges object parts (combination of edges) object models

Auxiliary task: facial landmarks
– Face alignment: rotation
– Goal: make it easier for Face Recognition
Rotate

Train Datasets
Wider
– 32k images
– 494k faces
Celeba
– 200k images, 10k persons
– Landmarks, 40 binary attributes

Test Dataset: FDDB
Face Detection Data Set and Benchmark
– 2845 images
– 5171 faces

Old school: Viola-Jones
Haar Feature-based Cascade Classifiers
Haar-like features
eyes darker nose lighter
Examples

Viola-Jones algorithm: training
for each patch
AdaBoost
ensemble
features
160k
valuable
6k
weighted
sum
Face or NotDataset

Viola-Jones algorithm: inference
for each patch
Stages
Face
Yes Yes
Stage 1 Stage 2 Stage N
Optimization
– Features are grouped into stages
– If a patch fails any stage => discard

Viola-Jones results
OpenCV implementation
– Fast: ~100ms on CPU
FDDB results
0.45
– Not accurate

1. Pre-trained network: extracting features
New school: Region-based Convolutional Networks
Faster RCNN, algorithm
1
CNN
2
RPN
4
Classifier
Face ?
3
Roi-pooling
Feature Maps
3
2. Region proposal network3. RoI-pooling: extract corresponding tensor4. Classifier: classes and the bounding box

Comparison: Viola-Jones vs R-FCN
Results
– 92% accuracy (R-FCN)
FDDB
results
Viola-Jones (opencv)
0.45
HOG (dlib)
0.7
R-FCN
0.92
– 40ms on GPU (slow)

Face detection: how fast
We need faster solution at the same accuracy!
Target: < 10ms

Alternative: MTCNN
1
Different
scales
2
Proposal
CNN
3
Refine
CNN
4
Output
CNN
Cascade of 3 CNN
1. Resize to different scales2. Proposal -> candidates +
b-boxes
3. Refine -> calibration4. Output -> b-boxes +
landmarks

Comparison: MTCNN vs R-FCN
MTCNN
+ Faster
+ Landmarks
- Less accurate
- No batch processing
Model GPU Inference FDDB Precision
(100 errors)
R-FCN 40 ms 92%
MTCNN 17 ms 90%

What is TensorRT
NVIDIA TensorRT is a high-performance deep learning inference optimizer
Features
– Improves performance for complex networks
– FP16 & INT8 support
– Effective at small batch-sizes

TensorRT: layer optimizations
2. Horizontal fusion
3. Concat elision
1. Vertical layer fusion

TensorRT: downsides
1. Caffe + TensorFlow supported
2. Fixed input/batch size
3. Basic layers support

Batch processing
Problem
Image size is fixed, but
MTCNN works at different scales
Solution
Pyramid on a single image

Batch processing
Results
– Single run
– Enables batch processing
Model Inference
ms
MTCNN (Caffe, python) 17
MTCNN (Caffe, C++) 12.7
+ batch 10.7

TensorRT: layers
Problem
No PReLU layer => default pre-trained
model can’t be used
Retrained with ReLU from scratch
Model GPU Inference
ms
FDDB Precision
(100 errors)
MTCNN, batch 10.7 90%
+Tensor RT 8.8 91.2%
-20%

Face detection: inference
Target: < 10 ms
Result: 8.8 ms
Ingredients
1. MTCNN
2. Batch processing
3. TensorRT

Face recognition task
– Goal – to compare faces
Latent
SpaceCNN
Embedding close
distant
Unseen
– How? To learn metric
– To enable Zero-shot learning

Training set: MSCeleb
– Top 100k celebrities
– 10 Million images, 100 per person
– Noisy: constructed by leveraging public search engines

Small test dataset: LFW
Labeled Faces in the Wild Home
– 13k images from the web
– 1680 persons have >= 2 photos

Large test dataset: Megaface
– Identification under up to 1 million “distractors”
– 530 people to find

Classification
CNN
Embedding
Classify
– Train CNN to predict classes
– Pray for a good latent space
close
distant

Softmax
– Learned features only separable but not discriminative
– The resulting features are not sufficiently effective
close

We need metric learning
– Tightness of the cluster
– Discriminative features

Triplet loss
Features
– Identity -> single point
– Enforces a margin between persons
Anchor
Positive Negativepositive + α < negative
minimize maximize

Choosing triplets
Crucial problem
How to choose triplets ? Useful triplets = hardest errors
Pick all
positive
Too easy
Hard enough
Solution
Hard-mining within a large mini-batch (>1000)

Choosing triplets: trap
Anchor
Positive positive ~ negative
minimize maximize
Negative

Instead

Selecting hardest negative may lead to the collapse early in training

Choosing triplets: semi-hard
Pick all
positive
Too easy
Semi-hard
Too hard
positive < negative < positive + α

Triplet loss: summary
Overview
– Requires large batches, margin tuning
– Slow convergence
Opensource Code
– Openface (Torch)
• suboptimal implementation
– Facenet, not original (TensorFlow)
LFW, % Megaface
Openface (Torch) 92 -
Our (Torch) 99.35 65
Google’s Facenet 99.63 70.5

Center loss
Idea: pull the points to class centroids

Center loss: structure
– Without classification loss – collapses
CNN
Embedding
Classify
Softmax
Loss
λ
Center
Loss
Pull
– Final loss = Softmax loss + λ Center loss

Center Loss: different lambdas
λ = 10-7

λ = 10-6

λ = 10-5

Center loss: summary
Overview
– Intra-class compactness and inter-class separability
– Good performance at several other tasks
Opensource Code
– Caffe (original, Megaface - 65%)
LFW, % Megaface
Triplet Loss 99.35 65
Center Loss
(Torch, ours)
99.60 71.7

Tricks: augmentation
Test time augmentation
– Flip image
Embedding
Flipped
Embedding
Final
Embedding
Average
– Average embeddings
– Compute 2 embeddings

Tricks: alignment
Rotation
Kabsch algorithm - the optimal rotation matrix that minimizes the RMSD
LFW, % Megaface
Center Loss 99.6 71.7
Center Loss
+ Tricks
99.68 73

Shades on
At one point we used shades augmentation
How to
– Construct several sunglass textures
– Place them using landmarks

Tricks: adding Eye loss
CNN
Embedding
Person
Softmax
Loss
Center
Loss
– We can force CNN to learn specific discriminative features
– For celebrities eye colors are available in the Internet
Eye
Loss
Eye
Color

Eye loss: summary
*Adding simple features doesn’t help, i.e gender
LFW, % Megaface
Center Loss + Tricks 99.68 73
Center Loss + Eye 99.68 73.5

Angular Softmax
||X||= 1
On sphere
Angle discriminates
||W||= 1
b=0

Enforce larger
angle
Angular Softmax

Angular Softmax: different «m»
m=1 m=3

Angular softmax: summary
Overview
– As describes in the paper: doesn’t work at all
– Works using sum of losses (m=1,N) over training
• only on small datasets!
LFW, % Megaface
Center Loss 99.6 73
Center Loss + Eye 99.68 73.5
A-Softmax (Torch) 99.68 74.2
Opensource Code
– Caffe (original)
– Slight modification of the loss yields 74.2%

Metric learning: summary
Softmax < Triplet < Center < A-Softmax
A-Softmax
– With bells and whistles better than center loss
Center loss
Overall
– Rule of thumb: use Center loss
– Metric learning may improve classification performance

Errors after MSCeleb: children
Problem
Children all look alike
Result
Embeddings are almost single point in
the space

Errors after MSCeleb: asian
Problem
Face Recognition’s intolerant to
Asians
Reason
Dataset doesn’t contain enough
photos of these categories

How to fix these errors ?
It’s all about data, we need diverse
dataset!
Natural choice – avatars of social networks

A way to construct dataset
Face
Detection
Pick
largest
Face
Recognition+
Clustering
Cleaning algorithm
1. Face detection2. Face recognition -> embeddings3. Hierarchical clustering algorithm4. Pick the largest cluster as a personIterate after each model improvement

MSCeleb dataset’s errors
MSCeleb is constructed by leveraging search engines
Joe Eszterhas
Joe Eszterhas and Mel Gibson public confrontation leads to the error
Mel Gibson
=

Female
+
Male

Asia
Mix

Dataset has been shrinked from 100k to 46k celebrities
Random
search engine Corrected

Results on new datasets
Datasets
– Train:
• MSCeleb (46k)
• VK-train (150k)
A-Softmax on
dataset
Megaface
MSCeleb 74.2
MSCeleb cleaned 76.2
– Test
• MegaVK
• Sets for children and asians
MegaVK
58.4
60
+ VK 77.5 87.5

Ensemble
Final
Embedding
Concat
CNN-2
CNN-1
A-Softmax model Megaface MegaVK
Best single 77.5 87.5
Ensemble of 2 79.6 88.6

Workaround still …
Children are still challenge for the model

Workaround
Algorithm
1. Construct dataset with children
2. Compute average embedding
3. Every point inside the sphere – a child
4. Tighten distance threshold there
Results
This allows softening the overall threshold

How to handle big dataset
It seems we can add more data infinitely, but no.
Problems
– Memory consumption (Softmax)
– Computational costs
– A lot of noise in gradients

Softmax Approximation
Algorithm
1. Perform K-Means clustering using current FR model
Dataset
K-Means
Children
Women
Men
Smaller sets

Algorithm
1. Perform K-Means clustering using current FR model
CNN
Embedding
Predict
cluster
Predict
person MenPerson
Softmax
2. Two Softmax heads:
1. Predicts cluster label
2. Class within the true cluster
Cluster
Softmax
Men

Pros
Push
Push
Harder
negative
1. Prevents fusing of the clusters
2. Does hard-negative mining
3. Clusters can be specified
• Children
• Asian

Errors: blur
Problem
• Detector yields blurry photos
• Recognition forms «blurry clusters»
Solution
Laplacian – 2nd order derivative of the image

Laplacian in action
Low
variance
High
variance

Errors: body parts
Detection
mistakes form
clusters

Fixing trash clusters
There is similarity between “no faces”!
CNN
Embedding
Embedding
No «firing»
features

Fixing trash clusters
«Trash» has small norm Faces
Trash
Softmax loss
Motivation
Softmax encourage big embedding’s norms
Results
– ROC AUC 97%
– Better then Laplacian for blurry

Fun: new governors
Recently appointed governors are almost twins, but FR distinguishes them
Dmitriy
Gleb

Over years
Face recognition algorithm captures
similarity across years
Although we didn’t focus on the problem

Summary
1. Use TensorRT to speed up inference
2. Metric learning: use Center loss by default
3. Clean your data thoroughly
4. Understanding CNN helps to fight errors

Best avatar
Problem
How to pick an avatar for a person ?
Solution
Train model to predict awesomeness of photo

Predicting awesomeness: how to approach
Social networks – not only photos, but likes too

Predicting awesomeness: dataset
Awesomeness (A) = likes/audience
A=18% A=27% A=75%

Results
– Mean Aveage Precision @5: 25%
– Data and metric are noisy => human evaluation
Predicting awesomeness: summary
High score
Low score

Predicting awesomeness: incorporating into FR
One more branch in Face Recognition CNN
Small overhead
awesomeness
embedding
face

Histogram loss
Idea
– Compute similarities positive and negative pairs
– Maximize a probability that a randomly sampled positive pair has smaller similarity than
negative one
Loss = the integral of the product between the negative distribution and the cumulative density function for the positive distribution (shown with a dashed line)

Histogram loss
Results
– Got nice distributions
– Doesn’t improve the triplet results
– LFW on 97.7%

Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)

More Related Content

What's hot (20)

Similar to Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group) (20)

More from Ontico (20)

Recently uploaded (20)

Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)

Editor's Notes