Learning from (dis)similarity data

Learning from (dis)similarity data
Nathalie Vialaneix
nathalie.vialaneix@inra.fr
http://www.nathalievialaneix.eu
MelbURN 2018
July 16th, 2018 - Melbourne, Australia
Nathalie Vialaneix | Learning from (dis)similarity data 1/24

What are my data like?

A medieval social network [Boulet et al., 2008, Rossi et al., 2013]
corpus with more than 6,000
transactions, 3 centuries, all
related to
Castelnau Montratier

A medieval social network [Boulet et al., 2008, Rossi et al., 2013]
corpus with more than 6,000
transactions, 3 centuries, all
related to
Castelnau Montratier
Individual
Transaction
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Ratier
Ratier (II) Castelnau
Jean Laperarede
Bertrande Audoy
Gailhard Gourdon
Guy Moynes (de)
Pierre Piret (de)
Bernard Audoy
Hélène Castelnau
Guiral Baro
Bernard Audoy
Arnaud Bernard Laperarede
Guilhem Bernard Prestis
Jean Manas
Jean Laperarede
Jean Laperarede
Jean Roquefeuil
Jean Pojols
Ramond Belpech
Raymond Laperarede
Bertrand Prestis (de)
Ratier
(Monseigneur) Roquefeuil (de)
Arnaud Gasbert Castanhier (del)
Ratier (III) Castelnau
Pierre Prestis (de)
P Valeribosc
Guillaume Marsa
Berenguier Roquefeuil
Arnaud Bernard Perarede
Jean Roquefeuil
Arnaud I Audoy
bipartite network with more than 17,000
nodes (∼ 10,000 individuals)
What can we learn from the French
medieval society?

Career paths [Olteanu and Villa-Vialaneix, 2015]
Survey “Génération 98”: labor market
status (9 categories) on more than
16,000 people having graduated in 1998
during 94 months. 1
1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH).

during 94 months. 1
How to cluster career paths into
homogeneous groups?

during 94 months. 1
How to cluster career paths into
homogeneous groups?
It is all about distance...
χ2
dissimilarity emphasizes the
contemporary identical situations
Optimal-matching dissimilarities is
more focused on the sequences
similarities
[Needleman and Wunsch, 1970]
(or “edit distance”, “Levenshtein
distance”)

and then I went into NGS data...
and again...
distances are everywhere

a collection of NGS data...
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species

DNA barcoding
optimal matching
(edit) distances to
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale

DNA barcoding
optimal matching
(edit) distances to
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
Metagenomics
dissemblance between
samples is better
captured when
phylogeny between
species is taken into
account (unifrac
distances)

Relational Self-Organizing Map
algorithm

Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
(xi)i=1,...,n ⊂ Rd
are affected to a unit f(xi) ∈ {1, . . . , U}
the grid is equipped with a “distance” between units: d(u, u ) and
observations affected to close units are close in Rd
every unit u corresponds to a prototype, pu (x) in Rd

[Kohonen, 2001]
x
x
x
Iterative learning (assignment step): xi is picked at random within (xk )k
and affected to best matching unit:
ft
(xi) = arg min
u
xi − pt
u
2

[Kohonen, 2001]
x
x
x
Iterative learning (representation step): all prototypes in neighboring units
are updated with a gradient descent like step:
pt+1
u ←− pt
u + µ(t)Ht
(d(f(xi), u))(xi − pt
u)

Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ Rd
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
βt
u
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for

dissimilarity
Data: (xi)i=1,...,n ∈ X
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
βt
u
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for

dissimilarity
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
xi (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
βt
u
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for

dissimilarity
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
βt
uD(pt
u, xi)
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for

dissimilarity
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
βt
uD(pt
u, xi)
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) ∼ xi − pt
u
7: end for
8: end for

dissimilarity
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
βt
u(βt
u) D(., xi) −
1
2
(βt
u) Dβt
u
6:
βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u)) 1i − βt
u
7: end for
8: end for

Note on drawbacks of RSOM
Two main drawbacks:
For T ∼ γn iterations, complexity of RSOM is O(γn3
U) (compared to
O(γUdn) for numeric) [Rossi, 2014]

Two main drawbacks:
U) (compared to
Exact solution proposed in [Mariette et al., 2017] to reduce the
complexity to O(γn2
U) with additional storage memory of O(Un)

Two main drawbacks:
U) (compared to
For the non Euclidean case, the learning algorithm can be very
unstable (saddle points)

Two main drawbacks:
U) (compared to
For the non Euclidean case, the learning algorithm can be very
unstable (saddle points)
clip or ﬂip? [Chen et al., 2009]

SOMbrero
[Villa-Vialaneix, 2017]
SOMbrero is an R package implementing stochastic variants of SOM
for non vectorial data
Specifically well adapted to...
non expert use and teaching
use with graphs and obtain simplified representations
first release: March 2013; latest release: Feb. 2018 (version 1.2.3)
depends on R (version ≥ 3.1.0) http://www.r-project.org
and on several packages available on CRAN:
wordcloud, igraph, RColorBrewer, scatterplot3d, knitr, shiny
available at https://cran.r-project.org/package=SOMbrero
(licence GPL) and can be installed from inside R using
install.packages("SOMbrero")

Training
mysom <- trainSOM(iris[ ,1:4], ...)
Options to train the SOM:
grid: square grid, with arbitrary width and length
distance between units: standard distances as in dist or "letremy" (Euclidean then
"maximum")
neighborhood relationship: Gaussian or "letremy"
prototypes: initialized randomly, with a PCA, with random observations from the training
sample
preprocessing: centering, scaling to unit variance or nothing
training: number of iterations, standard or Heskes’s assignment step
ft
(xi) ← arg min
u=1,...,U
U
u =1
Ht
(d(u, u )) xi − pt−1
u
2

Diagnostic tools
quality(mysom)
topographic error: average frequency (over the samples) for which the
prototypes that comes closest is in the direct neighborhood on the
grid of the BMU
quantization error
Q =
1
n
n
i=1
xi − pf(xi)
2

Plots...
plot(mysom,
what = c("observations", "prototypes", "add"),
type = ..., ...)

Super-clustering
mysom.sc <- superClass(mysom)

Start with SOMbrero
3 datasets corresponding to the three types of data that SOMbrero
can handle (iris, presidentielles2002 and lesmis, a graph from
“Les Misérables”)

Start with SOMbrero
comprehensive (HTML) vignettes included in the package and
available on the website

Start with SOMbrero
comprehensive (HTML) vignettes included in the package and
available on the website
Web User Interface (made with shiny) for using the package even if
you do not know R programming language (included in the package
with sombreroGUI() Tested and approved on an historian!

RSOM for mining a medieval social network
with the heat kernel
Individual
Transaction
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Ratier
Ratier (II) Castelnau
Jean Laperarede
Bertrande Audoy
Gailhard Gourdon
Guy Moynes (de)
Pierre Piret (de)
Bernard Audoy
Hélène Castelnau
Guiral Baro
Bernard Audoy
Arnaud Bernard Laperarede
Jean Manas
Jean Laperarede
Jean Laperarede
Jean Roquefeuil
Jean Pojols
Ramond Belpech
Raymond Laperarede
Bertrand Prestis (de)
Ratier
(Monseigneur) Roquefeuil (de)
Arnaud Gasbert Castanhier (del)
Ratier (III) Castelnau
Pierre Prestis (de)
P Valeribosc
Guillaume Marsa
Berenguier Roquefeuil
Jean Roquefeuil
Arnaud I Audoy
[Boulet et al., 2008]
Graph induced by clusters:
has nice relations with space and time
emphasizes leading people
has helped to identify problems in the
database (namesakes)
But: biggest communities are still
very complex

RSOM for typology of Astraptes fulgerator from DNA
barcoding
Edit distances between DNA sequences [Olteanu and Villa-Vialaneix, 2015]
Almost perfect clustering (identifying a possible label error on one sample)
with (in addition) information on relations between species.

RSOM for typology of school-to-time transitions
Edit distance between 12,000 categorical time series

Also in SOMbrero: KORRESP
[Cottrell and Letrémy, 2005]
Data: contingency table T = (nij)ij with p rows and q columns transformed
into a numeric dataset X:
X =
columns rows
columns
rows
column proﬁle
row proﬁle
with
∀ i = 1, . . . , p and ∀ j = 1, . . . , q, xij =
nij
ni.
× n
n.j

X =
columns rows
columns
rows
augmented
column proﬁle
augmented row
proﬁle
with
∀ i = 1, . . . , p and ∀ j = q + 1, . . . , q + p, xij = xk(i)+p,j with
k(i) = arg maxk=1,...,q xik

X =
columns rows
columns
rows
augmented
column profile
augmented row
profile
column profile
row profile
assignment uses reduced profile
representation uses augmented profile
alternatively process row profiles and column profiles

Also available in SOMbrero
mysom <- trainSOM(presidentielles2002 , type = "korresp")
plot(mysom, what = "obs", type = "names")

SOMbrero
Madalina Olteanu,
Fabrice Rossi, Marie Cottrell,
Laura Bendhaïba and
Julien Boelaert
SOMbrero and mixKernel
Jérôme Mariette
adjclust
Pierre Neuvial, Guillem Rigail, Christophe Ambroise and
Shubham Chaturvedi

Don’t miss useR! 2019
user2019.r-project.org

Credits for pictures
Slide 2: Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae,
Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
Slide 3: Picture of Castelnau Montratier from
https://commons.wikimedia.org/wiki/File:
Place_Gambetta,_Castelnau-Montratier.JPG by Duch.seb CC BY-SA 3.0
Slide 4: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham
(EBI) and Michael Pazin (NHGRI)
Slide 6: Astraptes picture is from
https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal
(CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015
DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is
taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA
3.0)
Slide 12: TADS picture is from the article Fraser et al., 2015
DOI:10.15252/msb.20156492 (CC BY-SA 4.0)

References
Boulet, R., Jouve, B., Rossi, F., and Villa, N. (2008).
Batch kernel SOM and related Laplacian methods for social network analysis.
Neurocomputing, 71(7-9):1257–1273.
Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).
Similarity-based classiﬁcation: concepts and algorithm.
Journal of Machine Learning Research, 10:747–776.
Cottrell, M. and Letrémy, P. (2005).
How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.
Neurocomputing, 63:193–207.
Kohonen, T. (2001).
Self-Organizing Maps, 3rd Edition, volume 30.
Springer, Berlin, Heidelberg, New York.
Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017).
Accelerating stochastic kernel som.
In Verleysen, M., editor, XXVth European Symposium on Artiﬁcial Neural Networks, Computational Intelligence and Machine
Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc.
Needleman, S. and Wunsch, C. (1970).
A general method applicable to the search for similarities in the amino acid sequence of two proteins.
Journal of Molecular Biology, 48(3):443–453.
Olteanu, M. and Villa-Vialaneix, N. (2015).
On-line relational and multiple relational SOM.
Neurocomputing, 147:15–30.
Rossi, F. (2014).
How many dissimilarity/kernel self organizing map variants do we need?

In Villmann, T., Schleif, F., Kaden, M., and Lange, M., editors, Advances in Self-Organizing Maps and Learning Vector
Quantization (Proceedings of WSOM 2014), volume 295 of Advances in Intelligent Systems and Computing, pages 3–23,
Mittweida, Germany. Springer Verlag, Berlin, Heidelberg.
Rossi, F., Villa-Vialaneix, N., and Hautefeuille, F. (2013).
Exploration of a large database of French notarial acts with social network methods.
Digital Medievalist, 9.
Villa-Vialaneix, N. (2017).
Stochastic self-organizing map variants with the R package SOMbrero.
In Lamirel, J., Cottrell, M., and Olteanu, M., editors, 12th International Workshop on Self-Organizing Maps and Learning Vector
Quantization, Clustering and Data Visualization (Proceedings of WSOM 2017), Nancy, France. IEEE.

Learning from (dis)similarity data

More Related Content

What's hot (20)

Similar to Learning from (dis)similarity data (20)

More from tuxette (20)

Recently uploaded (20)

Learning from (dis)similarity data