Metric-learn, a Scikit-learn compatible package

Metric-learn,
a Scikit-learn compatible package
October 6, 2018
William de Vazelhes
wdevazelhes
william.de-vazelhes@inria.fr
1 / 48

About me:
William de Vazelhes
Engineer @Inria Lille, Magnet team, since 2017
work on metric-learn, with @bellet and @nvauquie.
Joint work with Inria Parietal team (scikit-learn developers), esp. @ogrisel,
@GaelVaroquaux, @agramfort
few contributions to scikit-learn
2 / 48

Summary
Introduction to Machine Learning with scikit-learn
Introduction to Metric Learning
Presentation of the metric-learn package
3 / 48

Summary
4 / 48

De nition
Machine learning is a field of computer science that uses statistical
techniques to give computer systems the ability to "learn" (e.g.,
progressively improve performance on a specific task) with data,
without being explicitly programmed. -- Wikipedia
5 / 48

scikit-learn: Machine Learning in Python
used by > 500,000 data scientists daily around the world
30k stars on GitHub
1000+ contributors
A lot of estimators
A lot of machine learning routines
Very detailed documentation
v0.20.0 just a few days ago
7 / 48

Running example: Face Recognition
We have a dataset of labeled images:
'Smith' 'Cooper'
'Stevens' 'Smith'
'Stevens'
...: ...
8 / 48

Running example: Face Recognition
We have a dataset of labeled images:
'Smith' 'Cooper'
'Stevens' 'Smith'
'Stevens'
...: ...
We want to classify a new image:
? → 'Cooper'
9 / 48

Load dataset fromscikit-learn
Input data: 400 greyscale images of 64 x 64 → 400 samples of 4096 features
each
(400, 4096) (400,)
[[0.30991736 0.3677686 0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
[0.45454547 0.47107437 0.5123967 ... 0.15289256 0.15289256 0.15289256]
...
[0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
[0.5165289 0.46280992 0.28099173 ... 0.35950413 0.3553719 0.38429752]]
['Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Hart' 'Mcmahon' 'Mcmahon' '
'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' 'Mcmahon' ... 'Mccarty' 'Mccarty' 'Rivers'
'Rivers' 'Rivers' 'Rivers' 'Rivers' 'Rivers']
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces()
names = np.array(['Hart', 'Mcmahon', 'Cain', 'Mahoney', 'Long', 'Green', 'Vega', 'H
X, y = dataset.data, names[dataset.target]
print(X.shape, y.shape)
print(X)
print(y)
10 / 48

Split between train/test
Train set: to train the ML algorithm
Test set: to simulate some unseen data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y.shape)
print(X_test.shape, y_test.shape)
(300, 4096) (400,)
(100, 4096) (100,)
11 / 48

Train the classi er
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
12 / 48

Predict/score on newsamples
clf.predict(X_test)
array(['Villa', 'Benitez', 'Benson', 'Petersen', 'Acosta', 'Pace',
'Christian', 'Perkins', 'Green', 'Keller', 'Mahoney', 'Benson',
...
'Benitez', 'Gilmore',
'Hurst', 'Mcmahon', 'Keller', 'Vega', 'Hart', 'Porter'],
dtype='<U11')
clf.score(X_test, y_test)
0.91
13 / 48

Select hyperparameters...
Create validation set for evaluating the models
0.96
0.9733333333333334
clf_1 = LogisticRegression(C=0.1)
clf_2 = LogisticRegression(C=1)
X_train_bis, X_validation, y_train_bis, y_validation = train_test_split(X_train,
for clf in [clf_1, clf_2]:
clf.fit(X_train_bis, y_train_bis)
print(clf.score(X_validation, y_validation))
14 / 48

... which is easy with GridSearchCV
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
grid = {'C': [0.1, 1, 5], 'penalty': ['l1', 'l2']}
clf = GridSearchCV(clf, grid)
clf.fit(X_train, y_train)
print(clf.best_params_)
print(clf.best_score_)
{'C': 5, 'penalty': 'l2'}
0.9633333333333334
15 / 48

Summary
16 / 48

Face matching for access authorization
Many people in an organisation, but only a few pictures each
Incoming picture: does it match some member ?
Also have a huge database of unlabeled images from a lot of people (from
a faces database)
Mech. turks labeled pairs of images as "same person"/"different persons"
(hard to directly label images)
https://www.facefirst.com/wp-content/uploads/2018/04/Screen-Shot-2018-04-26-at-4.12.56-PM.png
17 / 48

Learn a good metric
Learn a metric that puts similar points closer and dissimilar points
further apart
𝑑
18 / 48

Applications ofMetric Learning
https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.computerhope.com%2Fjargon%2Ff%2Fface-id-truedepth-camera.jpg&f=1.jpg https://rrc.ru/upload/splunk/splunk-workshop/Discovery%20Day%20Russia%20-%20Machine%20Learning.pdf https://i2.wp.com/www.touahria.com/wp-
19 / 48

Loading pairs ofimages
Dataset: Pairs of similar points and dissimilar points
from sklearn.datasets import fetch_lfw_pairs
dataset = fetch_lfw_pairs()
pairs = dataset.pairs
y = 2 * dataset.target - 1
for i in range(2):
plt.subplot(1, 2, i+1)
plt.imshow(pairs[0, i, :, :], cmap='Greys_r')
print(y[0])
1
20 / 48

Loading pairs ofimages
pairs = pairs.reshape(pairs.shape[0], 2, -1)
print(pairs)
print(y)
[[[ 73.666664 70.666664 81.666664 ... 152. 159.66667 155. ]
[ 66. 74.333336 84.333336 ... 225.66667 229.66667 233.33333 ]]
[[ 86.333336 113.333336 133.33333 ... 157.66667 87.333336 49.666668]
[109. 92.666664 114.333336 ... 106. 114.333336 122.333336]]
[[ 37.333332 35.333332 34. ... 192.33333 197. 198. ]
[ 24. 28.333334 32. ... 51.333332 52.333332 52. ]]
...
[[ 73. 94.333336 121.333336 ... 226.66667 229. 227.66667 ]
[ 23. 20.333334 21.333334 ... 64. 71. 82.333336]]
[[119. 110.333336 112.666664 ... 244.33333 239.66667 230.33333 ]
[106.333336 94.333336 88.333336 ... 145.33333 130. 102.333336]]
[[ 23.333334 20. 23.333334 ... 190.33333 187.66667 174.66667 ]
[ 34.666668 44.666668 70. ... 146.33333 151. 159. ]]]
[ 1 1 1 ... -1 -1 -1]
21 / 48

Split between train and test
pairs_train, pairs_test, y_train, y_test = train_test_split(pairs, y)
test
train
[3.2, 6.8, 9.1] [2.5, 1.8, 2.5]
[3.1, 6.7, 1.8] [3.2, 6.8, 9.1]
[3.5, 4.9, 1.0] [8.5, 7.2, 9.0]
[4.5, 9.0, 4.2] [3.8, 6.4, 2.6]
1
-1
1
1
[
]
[
[
[
[
]
]
]
]
22 / 48

Howdo you learn on this data ?
Example: Mahalanobis Metric for Clustering (MMC)
Parameters to learn: a transformation matrix
That transforms into a new representation
Associated metric: : the euclidean distance in the new space
Problem to solve :
s.t.
𝐿
𝑥 𝑖 𝐿 𝑥 𝑖
||𝐿 − 𝐿 ||𝑥 𝑖 𝑥 𝑗
||𝐿 − 𝐿 |min𝐿 ∑
( , )∈𝑆𝑥 𝑖 𝑥 𝑗
𝑥 𝑖 𝑥 𝑗 |
2
||𝐿 − 𝐿 || ≥ 1∑
( , )∈𝐷𝑥 𝑖 𝑥 𝑗
𝑥 𝑖 𝑥 𝑗
23 / 48

What can you do with this learned metric ?
KNN classification: find the nearest neighbors of some w.r.t. the
learned metric
Clustering: use the learned metric to cluster together similar samples
...
𝑥 𝑖
24 / 48

Summary
25 / 48

Introduction
created by CJ Carey (@perimosocordiae) and Yuan Tang (@terrytangyuan)
472 stars on GitHub
9 algorithms
documentation
13 contributors:
perimosocordiae 4,601 ++ 3,211 --
terrytangyuan 1,268 ++ 218 --
bhargavvader 897 ++ 26 --
wdevazelhes 706 ++ 213 --
Callidior 635 ++ 38 --
svecon 458 ++ 143 --
dsquareindia 141 ++ 1 --
ab-anssi 102 ++ 38 --
anirudt 6 ++ 0 --
arikpoz 4 ++ 2 --
toto 3 ++ 3 --
shalan 1 ++ 1 --
michaelstewart 1 ++ 1 --
+ other contributions
26 / 48

Introduction
Metric-learn v0.4.0 just released 1 month ago
But not yet compatible with scikit learn
Rest of the talk: about v.0.5.0 (release in a few weeks)
27 / 48

Challenge: make it scikit learn compatible
28 / 48

Sklearn compatibility
After loading and splitting we had:
test
train
1
-1
1
1
Concretely represented by:
test
train
[3.2, 6.8, 9.1] [2.5, 1.8, 2.5]
[3.1, 6.7, 1.8] [3.2, 6.8, 9.1]
[3.5, 4.9, 1.0] [8.5, 7.2, 9.0]
[4.5, 9.0, 4.2] [3.8, 6.4, 2.6]
1
-1
1
1
[
]
[
[
[
[
]
]
]
]
29 / 48

Scikit-learn routines work with this format !
from metric_learn import MMC
grid = {'alpha': [0.1, 1, 10]}
mmc = MMC()
metric_learner = GridSearchCV(mmc, grid)
metric_learner.fit(pairs_train, y_train)
30 / 48

Scikit-learn routines work with this format !
grid = {'alpha': [0.1, 1, 10]}
mmc = MMC()
metric_learner.fit(pairs_train, y_train)
But: this 3D array is very redundant: data duplication in each pair which
reuses one sample
31 / 48

Other solution: 2D arrays of indices
First argument of the metric learner is now indices (2D array of indices)
Give also the X array when initializing the metric learner
0 3
4 0
1 5
6 7test
train
[3.2, 6.8, 9.1]
[3.5, 4.9, 1.0]
[1.5, 2.9, 4.0]
[2.5, 1.8, 2.5]
[3.1, 6.7, 1.8]
[8.5, 7.2, 9.0]
[4.5, 9.0, 4.2]
[3.8, 6.4, 2.6]
1
-1
1
1
[
]
[
[
[ ]
]
]
[ ]
32 / 48

grid = {'alpha': [0.1, 1, 10]}
mmc = MMC(preprocessor=data)
metric_learner.fit(pairs_train_indices, y_train)
33 / 48

Other example of accepted data:
path_pairs_train = [['img_1.png', 'img_2.png'], ['img_2.png', 'img_4.png'], ['img_2
root = '~/images'
itml = ITML(preprocessor=ImgLoader(root))
itml.fit(path_pairs, y_train)
34 / 48

Note
Pairs will be formed batch-wise from indices inside the algorithm:
def fit(self, indices, y):
weights_update = np.zeros(d, d)
for indices_batch in yield_batches(indices):
weights_update += some_computation(preprocessor(batch_indices))
35 / 48

Algorithms
Fully Supervised:
classification: NCA, LMNN, LFDA, Covariance
regression: MLKR
Weakly Supervised:
pairs: MMC, ITML, SDML
quadruplets: LSML
Every pairs/quadruplets based algorithm comes with a *_Supervised version
that creates pairs/quadruplets on the fly
37 / 48

Quadruplets based algorithms
"A is more similar to B than C is to D"
less supervision: relative similarity judgments (you do not "force" some
similarities to be small or large explicitely)
notion of ordering between pairwise similarities
38 / 48

Weakly Supervised Learners
39 / 48

Weakly Supervised Learners
Scoring pairs/quadruplets based algorithms
for all metric learners (even supervised ones):
score_pairs: returns a similarity score
for pairs learners:
predict: +1 or -1 according to similar or not (uses threshold)
benefit from accuracy, roc_auc, from scikit-learn
for quadruplets learners:
predict +1 if A is more similar to B than C is to D, -1 otherwise
benefit from accuracy, roc_auc, from scikit-learn
40 / 48

Mahalanobis metric learning (c.f. MMCbefore)
41 / 48

Mahalanobis metric learning (c.f. MMCbefore)
For now: all algorithms define a euclidean distance in an embedding space
that is obtained through a linear transformation:
metric:
All have the transform method
They can do dimensionality reduction
mmc.fit(pairs_train, y_train)
mmc.transform(X_test)
# result is an array of shape (X_test.shape[0], dim_output)
||𝐿 − 𝐿 ||𝑥 𝑖 𝑥 𝑗
42 / 48

Testing and Continuous Integration
def test_fit_mmc():
???
We do not know in advance what we want to test
But hopefully:
We know some properties of objects we work with
testing the gradient: can compare with finite approximation
scipy.optimize.check_grad
test that a transformation is indeed linear: f(ax+by) = a f(x) + b f(y)
...
We can use toy examples
43 / 48

Designing toy examples
Simple example that exhibits a property that you can test:
Ex: 3 points in 2D (not colinear), and close but should'nt and and
far but shouldn't
def test_mmc_toy_example():
data = np.array([[0, 0], [0, 1], [2, 0]])
pairs = np.array([[0, 1], [0, 2]])
y = np.array([-1, 1])
mmc = MMC(preprocessor=data)
mmc.fit(pairs, y)
data_transformed = mmc.transform(data)
assert (np.linalg.norm(data_transformed[1] - data_transformed[0]) >
np.linalg.norm(data_transformed[2] - data_transformed[0]))
𝑥 0 𝑥 1 𝑥 0 𝑥 2
44 / 48

Recap: v.0.5.0 (in a fewweeks)
scikit-learn compatibility (cross-validation, GridSearchCV...)
"Preprocessor" to avoid memory consumption
Next steps
submit to sklearn-contrib
stochastic optimizers for scaling up
more choice to form pairs/quadruplets from labeled data
general functions like regularizers etc
more testing
more documentation, incl. examples
...
45 / 48

Conclusion
Metric learning: learn similarities from weakly supervised information
Many use cases
open source package metric-learn
v0.5.0: compatibility with scikit-learn
46 / 48

Check it out !
open source
raise issues
submit PRs
any contribution is welcome !
47 / 48

Questions ?
Contact
william.de-vazelhes@inria.fr
48 / 48

Metric-learn, a Scikit-learn compatible package

More Related Content

What's hot (20)

Similar to Metric-learn, a Scikit-learn compatible package (20)

Recently uploaded (20)

Metric-learn, a Scikit-learn compatible package