SlideShare a Scribd company logo
Metric-learn,
a Scikit-learn compatible package
October 6, 2018
William de Vazelhes
wdevazelhes
william.de-vazelhes@inria.fr
1 / 48
About me:
William de Vazelhes
Engineer @Inria Lille, Magnet team, since 2017
work on metric-learn, with @bellet and @nvauquie.
Joint work with Inria Parietal team (scikit-learn developers), esp. @ogrisel,
@GaelVaroquaux, @agramfort
few contributions to scikit-learn
2 / 48
Summary
Introduction to Machine Learning with scikit-learn
 Introduction to Metric Learning
Presentation of the metric-learn package
3 / 48
Summary
Introduction to Machine Learning with scikit-learn
 Introduction to Metric Learning
Presentation of the metric-learn package
4 / 48
De nition
Machine learning is a field of computer science that uses statistical
techniques to give computer systems the ability to "learn" (e.g.,
progressively improve performance on a specific task) with data,
without being explicitly programmed. -- Wikipedia
5 / 48
Applications
6 / 48
scikit-learn: Machine Learning in Python
used by > 500,000 data scientists daily around the world
30k stars on GitHub
1000+ contributors
A lot of estimators
A lot of machine learning routines
Very detailed documentation
v0.20.0 just a few days ago
7 / 48
Running example: Face Recognition
We have a dataset of labeled images:
'Smith' 'Cooper'
'Stevens' 'Smith'
'Stevens'
...: ...
8 / 48
Running example: Face Recognition
We have a dataset of labeled images:
'Smith' 'Cooper'
'Stevens' 'Smith'
'Stevens'
...: ...
We want to classify a new image:
? → 'Cooper'
9 / 48
Load dataset fromscikit-learn
Input data: 400 greyscale images of 64 x 64 → 400 samples of 4096 features
each
(400, 4096) (400,)
[[0.30991736 0.3677686 0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
[0.45454547 0.47107437 0.5123967 ... 0.15289256 0.15289256 0.15289256]
...
[0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
[0.5165289 0.46280992 0.28099173 ... 0.35950413 0.3553719 0.38429752]]
['Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Mcmahon' 'Mcmahon' '
'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' ... 'Mccarty' 'Mccarty' 'Rivers'
'Rivers' 'Rivers' 'Rivers' 'Rivers' 'Rivers']
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces()
names = np.array(['Hart', 'Mcmahon', 'Cain', 'Mahoney', 'Long', 'Green', 'Vega', 'H
X, y = dataset.data, names[dataset.target]
print(X.shape, y.shape)
print(X)
print(y)
10 / 48
Split between train/test
Train set: to train the ML algorithm
Test set: to simulate some unseen data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y.shape)
print(X_test.shape, y_test.shape)
(300, 4096) (400,)
(100, 4096) (100,)
11 / 48
Train the classi er
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
12 / 48
Predict/score on newsamples
clf.predict(X_test)
array(['Villa', 'Benitez', 'Benson', 'Petersen', 'Acosta', 'Pace',
'Christian', 'Perkins', 'Green', 'Keller', 'Mahoney', 'Benson',
...
'Benitez', 'Gilmore',
'Hurst', 'Mcmahon', 'Keller', 'Vega', 'Hart', 'Porter'],
dtype='<U11')
clf.score(X_test, y_test)
0.91
13 / 48
Select hyperparameters...
Create validation set for evaluating the models
0.96
0.9733333333333334
clf_1 = LogisticRegression(C=0.1)
clf_2 = LogisticRegression(C=1)
X_train_bis, X_validation, y_train_bis, y_validation = train_test_split(X_train,
for clf in [clf_1, clf_2]:
clf.fit(X_train_bis, y_train_bis)
print(clf.score(X_validation, y_validation))
14 / 48
... which is easy with GridSearchCV
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
grid = {'C': [0.1, 1, 5], 'penalty': ['l1', 'l2']}
clf = GridSearchCV(clf, grid)
clf.fit(X_train, y_train)
print(clf.best_params_)
print(clf.best_score_)
{'C': 5, 'penalty': 'l2'}
0.9633333333333334
15 / 48
Summary
Introduction to Machine Learning with scikit-learn
Introduction to Metric Learning
Presentation of the metric-learn package
16 / 48
Face matching for access authorization
Many people in an organisation, but only a few pictures each
Incoming picture: does it match some member ?
Also have a huge database of unlabeled images from a lot of people (from
a faces database)
Mech. turks labeled pairs of images as "same person"/"different persons"
(hard to directly label images)
https://www.facefirst.com/wp-content/uploads/2018/04/Screen-Shot-2018-04-26-at-4.12.56-PM.png
17 / 48
Learn a good metric
Learn a metric that puts similar points closer and dissimilar points
further apart
𝑑
18 / 48
Applications ofMetric Learning
https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.computerhope.com%2Fjargon%2Ff%2Fface-id-truedepth-camera.jpg&f=1.jpg https://rrc.ru/upload/splunk/splunk-workshop/Discovery%20Day%20Russia%20-%20Machine%20Learning.pdf https://i2.wp.com/www.touahria.com/wp-
19 / 48
Loading pairs ofimages
Dataset: Pairs of similar points and dissimilar points
from sklearn.datasets import fetch_lfw_pairs
dataset = fetch_lfw_pairs()
pairs = dataset.pairs
y = 2 * dataset.target - 1
for i in range(2):
plt.subplot(1, 2, i+1)
plt.imshow(pairs[0, i, :, :], cmap='Greys_r')
print(y[0])
1
20 / 48
Loading pairs ofimages
pairs = pairs.reshape(pairs.shape[0], 2, -1)
print(pairs)
print(y)
[[[ 73.666664 70.666664 81.666664 ... 152. 159.66667 155. ]
[ 66. 74.333336 84.333336 ... 225.66667 229.66667 233.33333 ]]
[[ 86.333336 113.333336 133.33333 ... 157.66667 87.333336 49.666668]
[109. 92.666664 114.333336 ... 106. 114.333336 122.333336]]
[[ 37.333332 35.333332 34. ... 192.33333 197. 198. ]
[ 24. 28.333334 32. ... 51.333332 52.333332 52. ]]
...
[[ 73. 94.333336 121.333336 ... 226.66667 229. 227.66667 ]
[ 23. 20.333334 21.333334 ... 64. 71. 82.333336]]
[[119. 110.333336 112.666664 ... 244.33333 239.66667 230.33333 ]
[106.333336 94.333336 88.333336 ... 145.33333 130. 102.333336]]
[[ 23.333334 20. 23.333334 ... 190.33333 187.66667 174.66667 ]
[ 34.666668 44.666668 70. ... 146.33333 151. 159. ]]]
[ 1 1 1 ... -1 -1 -1]
21 / 48
Split between train and test
pairs_train, pairs_test, y_train, y_test = train_test_split(pairs, y)
test
train
[3.2, 6.8, 9.1] [2.5, 1.8, 2.5]
[3.1, 6.7, 1.8] [3.2, 6.8, 9.1]
[3.5, 4.9, 1.0] [8.5, 7.2, 9.0]
[4.5, 9.0, 4.2] [3.8, 6.4, 2.6]
1
-1
1
1
[
]
[
[
[
[
]
]
]
]
22 / 48
Howdo you learn on this data ?
 Example: Mahalanobis Metric for Clustering (MMC)
Parameters to learn: a transformation matrix
That transforms into a new representation
Associated metric: : the euclidean distance in the new space
Problem to solve :
s.t.
𝐿
𝑥 𝑖 𝐿 𝑥 𝑖
||𝐿 − 𝐿 ||𝑥 𝑖 𝑥 𝑗
||𝐿 − 𝐿 |min𝐿 ∑
( , )∈𝑆𝑥 𝑖 𝑥 𝑗
𝑥 𝑖 𝑥 𝑗 |
2
||𝐿 − 𝐿 || ≥ 1∑
( , )∈𝐷𝑥 𝑖 𝑥 𝑗
𝑥 𝑖 𝑥 𝑗
23 / 48
What can you do with this learned metric ?
KNN classification: find the nearest neighbors of some w.r.t. the
learned metric
Clustering: use the learned metric to cluster together similar samples
...
𝑥 𝑖
24 / 48
Summary
Introduction to Machine Learning with scikit-learn
 Introduction to Metric Learning
Presentation of the metric-learn package
25 / 48
Introduction
created by CJ Carey (@perimosocordiae) and Yuan Tang (@terrytangyuan)
472 stars on GitHub
9 algorithms
documentation
13 contributors:
perimosocordiae 4,601 ++ 3,211 --
terrytangyuan 1,268 ++ 218 --
bhargavvader 897 ++ 26 --
wdevazelhes 706 ++ 213 --
Callidior 635 ++ 38 --
svecon 458 ++ 143 --
dsquareindia 141 ++ 1 --
ab-anssi 102 ++ 38 --
anirudt 6 ++ 0 --
arikpoz 4 ++ 2 --
toto 3 ++ 3 --
shalan 1 ++ 1 --
michaelstewart 1 ++ 1 --
+ other contributions
26 / 48
Introduction
Metric-learn v0.4.0 just released 1 month ago
But not yet compatible with scikit learn
Rest of the talk: about v.0.5.0 (release in a few weeks)
27 / 48
 Challenge: make it scikit learn compatible
28 / 48
Sklearn compatibility
After loading and splitting we had:
test
train
1
-1
1
1
Concretely represented by:
test
train
[3.2, 6.8, 9.1] [2.5, 1.8, 2.5]
[3.1, 6.7, 1.8] [3.2, 6.8, 9.1]
[3.5, 4.9, 1.0] [8.5, 7.2, 9.0]
[4.5, 9.0, 4.2] [3.8, 6.4, 2.6]
1
-1
1
1
[
]
[
[
[
[
]
]
]
]
29 / 48
Sklearn compatibility
Scikit-learn routines work with this format !
from metric_learn import MMC
from sklearn.model_selection import GridSearchCV
grid = {'alpha': [0.1, 1, 10]}
mmc = MMC()
metric_learner = GridSearchCV(mmc, grid)
metric_learner.fit(pairs_train, y_train)
30 / 48
Sklearn compatibility
Scikit-learn routines work with this format !
from metric_learn import MMC
from sklearn.model_selection import GridSearchCV
grid = {'alpha': [0.1, 1, 10]}
mmc = MMC()
metric_learner = GridSearchCV(mmc, grid)
metric_learner.fit(pairs_train, y_train)
But: this 3D array is very redundant: data duplication in each pair which
reuses one sample
31 / 48
Sklearn compatibility
Other solution: 2D arrays of indices
First argument of the metric learner is now indices (2D array of indices)
Give also the X array when initializing the metric learner
0 3
4 0
1 5
6 7test
train
[3.2, 6.8, 9.1]
[3.5, 4.9, 1.0]
[1.5, 2.9, 4.0]
[2.5, 1.8, 2.5]
[3.1, 6.7, 1.8]
[8.5, 7.2, 9.0]
[4.5, 9.0, 4.2]
[3.8, 6.4, 2.6]
1
-1
1
1
[
]
[
[
[ ]
]
]
[ ]
32 / 48
Sklearn compatibility
Other solution: 2D arrays of indices
from metric_learn import MMC
from sklearn.model_selection import GridSearchCV
grid = {'alpha': [0.1, 1, 10]}
mmc = MMC(preprocessor=data)
metric_learner = GridSearchCV(mmc, grid)
metric_learner.fit(pairs_train_indices, y_train)
33 / 48
Sklearn compatibility
Other solution: 2D arrays of indices
Other example of accepted data:
path_pairs_train = [['img_1.png', 'img_2.png'], ['img_2.png', 'img_4.png'], ['img_2
root = '~/images'
itml = ITML(preprocessor=ImgLoader(root))
itml.fit(path_pairs, y_train)
34 / 48
Sklearn compatibility
Note
Pairs will be formed batch-wise from indices inside the algorithm:
def fit(self, indices, y):
weights_update = np.zeros(d, d)
for indices_batch in yield_batches(indices):
weights_update += some_computation(preprocessor(batch_indices))
35 / 48
 Package Overview
36 / 48
Algorithms
Fully Supervised:
classification: NCA, LMNN, LFDA, Covariance
regression: MLKR
Weakly Supervised:
pairs: MMC, ITML, SDML
quadruplets: LSML
Every pairs/quadruplets based algorithm comes with a *_Supervised version
that creates pairs/quadruplets on the fly
37 / 48
Quadruplets based algorithms
"A is more similar to B than C is to D"
less supervision: relative similarity judgments (you do not "force" some
similarities to be small or large explicitely)
notion of ordering between pairwise similarities
38 / 48
Weakly Supervised Learners
39 / 48
Weakly Supervised Learners
Scoring pairs/quadruplets based algorithms
for all metric learners (even supervised ones):
score_pairs: returns a similarity score
for pairs learners:
predict: +1 or -1 according to similar or not (uses threshold)
benefit from accuracy, roc_auc, from scikit-learn
for quadruplets learners:
predict +1 if A is more similar to B than C is to D, -1 otherwise
benefit from accuracy, roc_auc, from scikit-learn
40 / 48
Mahalanobis metric learning (c.f. MMCbefore)
41 / 48
Mahalanobis metric learning (c.f. MMCbefore)
For now: all algorithms define a euclidean distance in an embedding space
that is obtained through a linear transformation:
metric:
All have the transform method
They can do dimensionality reduction
mmc.fit(pairs_train, y_train)
mmc.transform(X_test)
# result is an array of shape (X_test.shape[0], dim_output)
||𝐿 − 𝐿 ||𝑥 𝑖 𝑥 𝑗
42 / 48
Testing and Continuous Integration
def test_fit_mmc():
???
We do not know in advance what we want to test
But hopefully:
We know some properties of objects we work with
testing the gradient: can compare with finite approximation
scipy.optimize.check_grad
test that a transformation is indeed linear: f(ax+by) = a f(x) + b f(y)
...
We can use toy examples
43 / 48
Designing toy examples
Simple example that exhibits a property that you can test:
Ex: 3 points in 2D (not colinear), and close but should'nt and and
far but shouldn't
def test_mmc_toy_example():
data = np.array([[0, 0], [0, 1], [2, 0]])
pairs = np.array([[0, 1], [0, 2]])
y = np.array([-1, 1])
mmc = MMC(preprocessor=data)
mmc.fit(pairs, y)
data_transformed = mmc.transform(data)
assert (np.linalg.norm(data_transformed[1] - data_transformed[0]) >
np.linalg.norm(data_transformed[2] - data_transformed[0]))
𝑥 0 𝑥 1 𝑥 0 𝑥 2
44 / 48
Recap: v.0.5.0 (in a fewweeks)
scikit-learn compatibility (cross-validation, GridSearchCV...)
"Preprocessor" to avoid memory consumption
Next steps
submit to sklearn-contrib
stochastic optimizers for scaling up
more choice to form pairs/quadruplets from labeled data
general functions like regularizers etc
more testing
more documentation, incl. examples
...
45 / 48
Conclusion
Metric learning: learn similarities from weakly supervised information
Many use cases
open source package metric-learn
v0.5.0: compatibility with scikit-learn
46 / 48
Check it out !
open source
raise issues
submit PRs
any contribution is welcome !
47 / 48
Questions ?
Contact
william.de-vazelhes@inria.fr
48 / 48

More Related Content

PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
Graph Analyses with Python and NetworkX
PDF
Introduction to behavior based recommendation system
PDF
Dynamics in graph analysis (PyData Carolinas 2016)
PDF
Predicting organic reaction outcomes with weisfeiler lehman network
PDF
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
PDF
Visualizing the Model Selection Process
PDF
Data-Driven Recommender Systems
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Graph Analyses with Python and NetworkX
Introduction to behavior based recommendation system
Dynamics in graph analysis (PyData Carolinas 2016)
Predicting organic reaction outcomes with weisfeiler lehman network
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
Visualizing the Model Selection Process
Data-Driven Recommender Systems

What's hot (20)

PDF
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
PDF
Weight watcher Bay Area ACM Feb 28, 2022
PDF
safe and efficient off policy reinforcement learning
PDF
Gradient boosting in practice: a deep dive into xgboost
PDF
Stanford ICME Lecture on Why Deep Learning Works
PDF
Evolutionary Design of Swarms (SSCI 2014)
PDF
PDF
WeightWatcher Introduction
PDF
How to use SVM for data classification
PDF
“Practical Guide to Implementing Deep Neural Network Inferencing at the Edge,...
PDF
ENS Macrh 2022.pdf
PDF
This Week in Machine Learning and AI Feb 2019
PDF
Support Vector Machines for Classification
PDF
Matrix and Tensor Tools for Computer Vision
PDF
Using Deep Learning to Find Similar Dresses
PDF
ゆるふわ強化学習入門
PDF
Neural Networks: Model Building Through Linear Regression
PDF
Gradient Boosted Regression Trees in scikit-learn
PDF
自然方策勾配法の基礎と応用
PDF
increasing the action gap - new operators for reinforcement learning
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Weight watcher Bay Area ACM Feb 28, 2022
safe and efficient off policy reinforcement learning
Gradient boosting in practice: a deep dive into xgboost
Stanford ICME Lecture on Why Deep Learning Works
Evolutionary Design of Swarms (SSCI 2014)
WeightWatcher Introduction
How to use SVM for data classification
“Practical Guide to Implementing Deep Neural Network Inferencing at the Edge,...
ENS Macrh 2022.pdf
This Week in Machine Learning and AI Feb 2019
Support Vector Machines for Classification
Matrix and Tensor Tools for Computer Vision
Using Deep Learning to Find Similar Dresses
ゆるふわ強化学習入門
Neural Networks: Model Building Through Linear Regression
Gradient Boosted Regression Trees in scikit-learn
自然方策勾配法の基礎と応用
increasing the action gap - new operators for reinforcement learning
Ad

Similar to Metric-learn, a Scikit-learn compatible package (20)

PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Machine learning in science and industry — day 1
PDF
Machine Learning Algorithms Introduction.pdf
PPTX
Deep learning from mashine learning AI..
PDF
Machine Learning: Classification Concepts (Part 1)
PDF
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
PPT
[ppt]
PPT
[ppt]
PDF
Machine Learning ebook.pdf
PDF
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
PPTX
Lecture 09(introduction to machine learning)
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PPTX
Ml9 introduction to-unsupervised_learning_and_clustering_methods
PPT
Support Vector Machines Support Vector Machines
PPTX
DataAnalysis in machine learning using different techniques
PPT
Computational Biology, Part 4 Protein Coding Regions
PDF
MLHEP Lectures - day 1, basic track
PPTX
Application of Machine Learning in Agriculture
PPTX
Introduction to Machine Learning
PPTX
An introduction to Machine Learning with scikit-learn (October 2018)
Introduction to Machine Learning with SciKit-Learn
Machine learning in science and industry — day 1
Machine Learning Algorithms Introduction.pdf
Deep learning from mashine learning AI..
Machine Learning: Classification Concepts (Part 1)
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
[ppt]
[ppt]
Machine Learning ebook.pdf
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
Lecture 09(introduction to machine learning)
Introduction to Machine Learning Aristotelis Tsirigos
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Support Vector Machines Support Vector Machines
DataAnalysis in machine learning using different techniques
Computational Biology, Part 4 Protein Coding Regions
MLHEP Lectures - day 1, basic track
Application of Machine Learning in Agriculture
Introduction to Machine Learning
An introduction to Machine Learning with scikit-learn (October 2018)
Ad

Recently uploaded (20)

PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
famous lake in india and its disturibution and importance
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
An interstellar mission to test astrophysical black holes
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
2. Earth - The Living Planet Module 2ELS
Introduction to Fisheries Biotechnology_Lesson 1.pptx
protein biochemistry.ppt for university classes
famous lake in india and its disturibution and importance
The KM-GBF monitoring framework – status & key messages.pptx
6.1 High Risk New Born. Padetric health ppt
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
An interstellar mission to test astrophysical black holes
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
2Systematics of Living Organisms t-.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
INTRODUCTION TO EVS | Concept of sustainability
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Placing the Near-Earth Object Impact Probability in Context
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
ECG_Course_Presentation د.محمد صقران ppt
Biophysics 2.pdffffffffffffffffffffffffff
7. General Toxicologyfor clinical phrmacy.pptx

Metric-learn, a Scikit-learn compatible package

  • 1. Metric-learn, a Scikit-learn compatible package October 6, 2018 William de Vazelhes wdevazelhes william.de-vazelhes@inria.fr 1 / 48
  • 2. About me: William de Vazelhes Engineer @Inria Lille, Magnet team, since 2017 work on metric-learn, with @bellet and @nvauquie. Joint work with Inria Parietal team (scikit-learn developers), esp. @ogrisel, @GaelVaroquaux, @agramfort few contributions to scikit-learn 2 / 48
  • 3. Summary Introduction to Machine Learning with scikit-learn  Introduction to Metric Learning Presentation of the metric-learn package 3 / 48
  • 4. Summary Introduction to Machine Learning with scikit-learn  Introduction to Metric Learning Presentation of the metric-learn package 4 / 48
  • 5. De nition Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed. -- Wikipedia 5 / 48
  • 7. scikit-learn: Machine Learning in Python used by > 500,000 data scientists daily around the world 30k stars on GitHub 1000+ contributors A lot of estimators A lot of machine learning routines Very detailed documentation v0.20.0 just a few days ago 7 / 48
  • 8. Running example: Face Recognition We have a dataset of labeled images: 'Smith' 'Cooper' 'Stevens' 'Smith' 'Stevens' ...: ... 8 / 48
  • 9. Running example: Face Recognition We have a dataset of labeled images: 'Smith' 'Cooper' 'Stevens' 'Smith' 'Stevens' ...: ... We want to classify a new image: ? → 'Cooper' 9 / 48
  • 10. Load dataset fromscikit-learn Input data: 400 greyscale images of 64 x 64 → 400 samples of 4096 features each (400, 4096) (400,) [[0.30991736 0.3677686 0.41735536 ... 0.15289256 0.16115703 0.1570248 ] [0.45454547 0.47107437 0.5123967 ... 0.15289256 0.15289256 0.15289256] ... [0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576] [0.5165289 0.46280992 0.28099173 ... 0.35950413 0.3553719 0.38429752]] ['Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Mcmahon' 'Mcmahon' ' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' ... 'Mccarty' 'Mccarty' 'Rivers' 'Rivers' 'Rivers' 'Rivers' 'Rivers' 'Rivers'] import numpy as np from sklearn.datasets import fetch_olivetti_faces dataset = fetch_olivetti_faces() names = np.array(['Hart', 'Mcmahon', 'Cain', 'Mahoney', 'Long', 'Green', 'Vega', 'H X, y = dataset.data, names[dataset.target] print(X.shape, y.shape) print(X) print(y) 10 / 48
  • 11. Split between train/test Train set: to train the ML algorithm Test set: to simulate some unseen data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) print(X_train.shape, y.shape) print(X_test.shape, y_test.shape) (300, 4096) (400,) (100, 4096) (100,) 11 / 48
  • 12. Train the classi er from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X_train, y_train) 12 / 48
  • 13. Predict/score on newsamples clf.predict(X_test) array(['Villa', 'Benitez', 'Benson', 'Petersen', 'Acosta', 'Pace', 'Christian', 'Perkins', 'Green', 'Keller', 'Mahoney', 'Benson', ... 'Benitez', 'Gilmore', 'Hurst', 'Mcmahon', 'Keller', 'Vega', 'Hart', 'Porter'], dtype='<U11') clf.score(X_test, y_test) 0.91 13 / 48
  • 14. Select hyperparameters... Create validation set for evaluating the models 0.96 0.9733333333333334 clf_1 = LogisticRegression(C=0.1) clf_2 = LogisticRegression(C=1) X_train_bis, X_validation, y_train_bis, y_validation = train_test_split(X_train, for clf in [clf_1, clf_2]: clf.fit(X_train_bis, y_train_bis) print(clf.score(X_validation, y_validation)) 14 / 48
  • 15. ... which is easy with GridSearchCV from sklearn.model_selection import GridSearchCV clf = LogisticRegression() grid = {'C': [0.1, 1, 5], 'penalty': ['l1', 'l2']} clf = GridSearchCV(clf, grid) clf.fit(X_train, y_train) print(clf.best_params_) print(clf.best_score_) {'C': 5, 'penalty': 'l2'} 0.9633333333333334 15 / 48
  • 16. Summary Introduction to Machine Learning with scikit-learn Introduction to Metric Learning Presentation of the metric-learn package 16 / 48
  • 17. Face matching for access authorization Many people in an organisation, but only a few pictures each Incoming picture: does it match some member ? Also have a huge database of unlabeled images from a lot of people (from a faces database) Mech. turks labeled pairs of images as "same person"/"different persons" (hard to directly label images) https://www.facefirst.com/wp-content/uploads/2018/04/Screen-Shot-2018-04-26-at-4.12.56-PM.png 17 / 48
  • 18. Learn a good metric Learn a metric that puts similar points closer and dissimilar points further apart 𝑑 18 / 48
  • 19. Applications ofMetric Learning https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.computerhope.com%2Fjargon%2Ff%2Fface-id-truedepth-camera.jpg&f=1.jpg https://rrc.ru/upload/splunk/splunk-workshop/Discovery%20Day%20Russia%20-%20Machine%20Learning.pdf https://i2.wp.com/www.touahria.com/wp- 19 / 48
  • 20. Loading pairs ofimages Dataset: Pairs of similar points and dissimilar points from sklearn.datasets import fetch_lfw_pairs dataset = fetch_lfw_pairs() pairs = dataset.pairs y = 2 * dataset.target - 1 for i in range(2): plt.subplot(1, 2, i+1) plt.imshow(pairs[0, i, :, :], cmap='Greys_r') print(y[0]) 1 20 / 48
  • 21. Loading pairs ofimages pairs = pairs.reshape(pairs.shape[0], 2, -1) print(pairs) print(y) [[[ 73.666664 70.666664 81.666664 ... 152. 159.66667 155. ] [ 66. 74.333336 84.333336 ... 225.66667 229.66667 233.33333 ]] [[ 86.333336 113.333336 133.33333 ... 157.66667 87.333336 49.666668] [109. 92.666664 114.333336 ... 106. 114.333336 122.333336]] [[ 37.333332 35.333332 34. ... 192.33333 197. 198. ] [ 24. 28.333334 32. ... 51.333332 52.333332 52. ]] ... [[ 73. 94.333336 121.333336 ... 226.66667 229. 227.66667 ] [ 23. 20.333334 21.333334 ... 64. 71. 82.333336]] [[119. 110.333336 112.666664 ... 244.33333 239.66667 230.33333 ] [106.333336 94.333336 88.333336 ... 145.33333 130. 102.333336]] [[ 23.333334 20. 23.333334 ... 190.33333 187.66667 174.66667 ] [ 34.666668 44.666668 70. ... 146.33333 151. 159. ]]] [ 1 1 1 ... -1 -1 -1] 21 / 48
  • 22. Split between train and test pairs_train, pairs_test, y_train, y_test = train_test_split(pairs, y) test train [3.2, 6.8, 9.1] [2.5, 1.8, 2.5] [3.1, 6.7, 1.8] [3.2, 6.8, 9.1] [3.5, 4.9, 1.0] [8.5, 7.2, 9.0] [4.5, 9.0, 4.2] [3.8, 6.4, 2.6] 1 -1 1 1 [ ] [ [ [ [ ] ] ] ] 22 / 48
  • 23. Howdo you learn on this data ?  Example: Mahalanobis Metric for Clustering (MMC) Parameters to learn: a transformation matrix That transforms into a new representation Associated metric: : the euclidean distance in the new space Problem to solve : s.t. 𝐿 𝑥 𝑖 𝐿 𝑥 𝑖 ||𝐿 − 𝐿 ||𝑥 𝑖 𝑥 𝑗 ||𝐿 − 𝐿 |min𝐿 ∑ ( , )∈𝑆𝑥 𝑖 𝑥 𝑗 𝑥 𝑖 𝑥 𝑗 | 2 ||𝐿 − 𝐿 || ≥ 1∑ ( , )∈𝐷𝑥 𝑖 𝑥 𝑗 𝑥 𝑖 𝑥 𝑗 23 / 48
  • 24. What can you do with this learned metric ? KNN classification: find the nearest neighbors of some w.r.t. the learned metric Clustering: use the learned metric to cluster together similar samples ... 𝑥 𝑖 24 / 48
  • 25. Summary Introduction to Machine Learning with scikit-learn  Introduction to Metric Learning Presentation of the metric-learn package 25 / 48
  • 26. Introduction created by CJ Carey (@perimosocordiae) and Yuan Tang (@terrytangyuan) 472 stars on GitHub 9 algorithms documentation 13 contributors: perimosocordiae 4,601 ++ 3,211 -- terrytangyuan 1,268 ++ 218 -- bhargavvader 897 ++ 26 -- wdevazelhes 706 ++ 213 -- Callidior 635 ++ 38 -- svecon 458 ++ 143 -- dsquareindia 141 ++ 1 -- ab-anssi 102 ++ 38 -- anirudt 6 ++ 0 -- arikpoz 4 ++ 2 -- toto 3 ++ 3 -- shalan 1 ++ 1 -- michaelstewart 1 ++ 1 -- + other contributions 26 / 48
  • 27. Introduction Metric-learn v0.4.0 just released 1 month ago But not yet compatible with scikit learn Rest of the talk: about v.0.5.0 (release in a few weeks) 27 / 48
  • 28.  Challenge: make it scikit learn compatible 28 / 48
  • 29. Sklearn compatibility After loading and splitting we had: test train 1 -1 1 1 Concretely represented by: test train [3.2, 6.8, 9.1] [2.5, 1.8, 2.5] [3.1, 6.7, 1.8] [3.2, 6.8, 9.1] [3.5, 4.9, 1.0] [8.5, 7.2, 9.0] [4.5, 9.0, 4.2] [3.8, 6.4, 2.6] 1 -1 1 1 [ ] [ [ [ [ ] ] ] ] 29 / 48
  • 30. Sklearn compatibility Scikit-learn routines work with this format ! from metric_learn import MMC from sklearn.model_selection import GridSearchCV grid = {'alpha': [0.1, 1, 10]} mmc = MMC() metric_learner = GridSearchCV(mmc, grid) metric_learner.fit(pairs_train, y_train) 30 / 48
  • 31. Sklearn compatibility Scikit-learn routines work with this format ! from metric_learn import MMC from sklearn.model_selection import GridSearchCV grid = {'alpha': [0.1, 1, 10]} mmc = MMC() metric_learner = GridSearchCV(mmc, grid) metric_learner.fit(pairs_train, y_train) But: this 3D array is very redundant: data duplication in each pair which reuses one sample 31 / 48
  • 32. Sklearn compatibility Other solution: 2D arrays of indices First argument of the metric learner is now indices (2D array of indices) Give also the X array when initializing the metric learner 0 3 4 0 1 5 6 7test train [3.2, 6.8, 9.1] [3.5, 4.9, 1.0] [1.5, 2.9, 4.0] [2.5, 1.8, 2.5] [3.1, 6.7, 1.8] [8.5, 7.2, 9.0] [4.5, 9.0, 4.2] [3.8, 6.4, 2.6] 1 -1 1 1 [ ] [ [ [ ] ] ] [ ] 32 / 48
  • 33. Sklearn compatibility Other solution: 2D arrays of indices from metric_learn import MMC from sklearn.model_selection import GridSearchCV grid = {'alpha': [0.1, 1, 10]} mmc = MMC(preprocessor=data) metric_learner = GridSearchCV(mmc, grid) metric_learner.fit(pairs_train_indices, y_train) 33 / 48
  • 34. Sklearn compatibility Other solution: 2D arrays of indices Other example of accepted data: path_pairs_train = [['img_1.png', 'img_2.png'], ['img_2.png', 'img_4.png'], ['img_2 root = '~/images' itml = ITML(preprocessor=ImgLoader(root)) itml.fit(path_pairs, y_train) 34 / 48
  • 35. Sklearn compatibility Note Pairs will be formed batch-wise from indices inside the algorithm: def fit(self, indices, y): weights_update = np.zeros(d, d) for indices_batch in yield_batches(indices): weights_update += some_computation(preprocessor(batch_indices)) 35 / 48
  • 37. Algorithms Fully Supervised: classification: NCA, LMNN, LFDA, Covariance regression: MLKR Weakly Supervised: pairs: MMC, ITML, SDML quadruplets: LSML Every pairs/quadruplets based algorithm comes with a *_Supervised version that creates pairs/quadruplets on the fly 37 / 48
  • 38. Quadruplets based algorithms "A is more similar to B than C is to D" less supervision: relative similarity judgments (you do not "force" some similarities to be small or large explicitely) notion of ordering between pairwise similarities 38 / 48
  • 40. Weakly Supervised Learners Scoring pairs/quadruplets based algorithms for all metric learners (even supervised ones): score_pairs: returns a similarity score for pairs learners: predict: +1 or -1 according to similar or not (uses threshold) benefit from accuracy, roc_auc, from scikit-learn for quadruplets learners: predict +1 if A is more similar to B than C is to D, -1 otherwise benefit from accuracy, roc_auc, from scikit-learn 40 / 48
  • 41. Mahalanobis metric learning (c.f. MMCbefore) 41 / 48
  • 42. Mahalanobis metric learning (c.f. MMCbefore) For now: all algorithms define a euclidean distance in an embedding space that is obtained through a linear transformation: metric: All have the transform method They can do dimensionality reduction mmc.fit(pairs_train, y_train) mmc.transform(X_test) # result is an array of shape (X_test.shape[0], dim_output) ||𝐿 − 𝐿 ||𝑥 𝑖 𝑥 𝑗 42 / 48
  • 43. Testing and Continuous Integration def test_fit_mmc(): ??? We do not know in advance what we want to test But hopefully: We know some properties of objects we work with testing the gradient: can compare with finite approximation scipy.optimize.check_grad test that a transformation is indeed linear: f(ax+by) = a f(x) + b f(y) ... We can use toy examples 43 / 48
  • 44. Designing toy examples Simple example that exhibits a property that you can test: Ex: 3 points in 2D (not colinear), and close but should'nt and and far but shouldn't def test_mmc_toy_example(): data = np.array([[0, 0], [0, 1], [2, 0]]) pairs = np.array([[0, 1], [0, 2]]) y = np.array([-1, 1]) mmc = MMC(preprocessor=data) mmc.fit(pairs, y) data_transformed = mmc.transform(data) assert (np.linalg.norm(data_transformed[1] - data_transformed[0]) > np.linalg.norm(data_transformed[2] - data_transformed[0])) 𝑥 0 𝑥 1 𝑥 0 𝑥 2 44 / 48
  • 45. Recap: v.0.5.0 (in a fewweeks) scikit-learn compatibility (cross-validation, GridSearchCV...) "Preprocessor" to avoid memory consumption Next steps submit to sklearn-contrib stochastic optimizers for scaling up more choice to form pairs/quadruplets from labeled data general functions like regularizers etc more testing more documentation, incl. examples ... 45 / 48
  • 46. Conclusion Metric learning: learn similarities from weakly supervised information Many use cases open source package metric-learn v0.5.0: compatibility with scikit-learn 46 / 48
  • 47. Check it out ! open source raise issues submit PRs any contribution is welcome ! 47 / 48