SlideShare a Scribd company logo
1/52
MASSICCC: A SaaS Platform for
Clustering and Co-Clustering of Mixed Data
https://massiccc.lille.inria.fr/
F. Laporte
with B. Auder, C. Biernacki, G. Celeux, J. Demont, F. Langrognet, V. Kubicki, C. Poli, J. Renault, S. Iovleff
May 29th 2019, TechTalk, Paris
2/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
3/52
MASSICCC?
massiccc.lille.inria.fr
SaaS: Software as a Service
4/52
MASSICCC: Examples of Applications
Market sales
Cities’ similarities
Predictive maintenance
Health
Data mining
Large dataset
Complex dataset
5/52
MASSICCC??
A high quality and easy to use web platform
where are transfered mature research clustering (and more) software
towards (non academic) professionals
6/52
Here is the computer you need!
7/52
Clustering?
Detect hidden structures in data sets
−1 −0.5 0 0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2ndMCAaxis
−1 −0.5 0 0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2ndMCAaxis
Low income
Average income
High income
8/52
Large data sets1
1
S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 29
9/52
An opportunity for detecting weak signal
10/52
Todays features: full mixed/missing
11/52
Notations
Data: n individuals: x = {x1, . . . , xn} = {xO , xM } in a space X of dimension d
Observed individuals xO
Missing individuals xM
Aim: estimation of the partition z and the number of clusters K
Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK )
xi ∈ Gk ⇔ zih = I{h=k}
Mixed, missing, uncertain
Individuals x Partition z ⇔ Group
? 0.5 red 5 ? ? ? ⇔ ???
0.3 0.1 green 3 ? ? ? ⇔ ???
0.3 0.6 {red,green} 3 ? ? ? ⇔ ???
0.9 [0.25 0.45] red ? ? ? ? ⇔ ???
↓ ↓ ↓ ↓
continuous continuous categorical integer
12/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
13/52
Parametric mixture model
Parametric assumption:
pk (x1) = p(x1; αk )
thus
p(x1) = p(x1; θ) =
K
k=1
πk p(x1; αk )
Mixture parameter:
θ = (π, α) with α = (α1, . . . , αK )
Model: it includes both the family p(·; αk ) and the number of groups K
m = {p(x1; θ) : θ ∈ Θ}
The number of free continuous parameters is given by
ν = dim(Θ)
Clustering becomes a well-posed problem. . .
14/52
The clustering process in mixtures
1 Estimation of θ by ˆθ
2 Estimation of the conditional probability that xi ∈ Gk
tik (ˆθ) = p(Zik = 1|Xi = xi ; ˆθ) =
ˆπk p(xi ; ˆαk )
p(xi ; ˆθ)
3 Estimation of zi by maximum a posteriori (MAP)
ˆzik = I{k=arg maxh=1,...,K tih( ˆθ)}
4 Model selection: BIC, ICL, . . .
15/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
16/52
Possible datasets
Continuous data
14 models on correlation matrix between variables (models’ selection)
Categorical data
Conditional independence
Mixed data
Conditional independence inter-type
Conditional independence intra-type (symmetry between types)
Distributions
Continuous: Gaussian
Categorical: Multinomial
17/52
Estimation of θ
Maximize the complete-likelihood over (θ, z)
c (θ; x, z) =
n
i=1
K
k=1
zik ln {πk p(xi ; αk )} CEM
Maximize the observe-likelihood on θ
(θ; x) =
n
i=1
ln p(xi ; θ) EM
18/52
Principle of EM and CEM
Initialization: θ0
Iteration noq:
Step E: estimate probabilities tq
= {tik (θq
)}
Step C: classify by setting tq
= MAP({tik (θq
)})
Step M: maximize θq+1
= arg maxθ c (θ; x, tq
)
Stopping rule: iteration number or criterion stability
Properties
⊕: simplicity, monotony, low memory requirement
: local maxima (depends on θ0), linear convergence (EM)
19/52
Prostate cancer data (without mixing data)
Individuals: n = 475 patients with prostatic cancer grouped on clinical criteria
into two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed by
eight continuous variables (age, weight, systolic blood pressure, diastolic blood
pressure, serum haemoglobin, size of primary tumour, index of tumour stage and
histolic grade, serum prostatic acid phosphatase) and four categorical variables
with various numbers of levels (performance rating, cardiovascular disease history,
electrocardiogram code, bone metastases)
Model: cond. indep. p(x1; αk ) = p(x1; αcont
k ) · p(x1; αcat
k )
20/52
Mixed data
21/52
Why should I use mixmod?
Avdantage(s)
Compare a lot of different models (continuous data)
Analyse mixed data
Disadvantage(s)
Do not handle missing data
Only continuous or categorical data
22/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
23/52
Full mixed data: conditional independence everywhere2
The aim is to combine continuous, categorical, integer data, ordinal, ranking and
functional data
x1 = (xcont
1 , xcat
1 , xint
1 , . . .)
The proposed solution is to mixed all types by inter-type conditional independence
p(x1; αk ) = p(xcont
1 ; αcont
k ) × p(xcat
1 ; αcat
k ) × p(xint
1 ; αint
k ) × . . .
In addition, for symmetry between types, intra-type conditional independence
Only need to define the univariate pdf for each variable type!
Continuous: Gaussian
Categorical: multinomial
Integer: Poisson
. . .
2
MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/
24/52
Missing data: MAR assumption and estimation
Assumption on the missingness mecanism
Missing At Randon (MAR): the probability that a variable is missing does not
depend on its own value given the observed variables.
Observed log-likelihood. . .
(θ; xO
) =
n
i=1
log
K
k=1
πk p(xO
i ; αk ) =
n
i=1
log






K
k=1
πk
xM
i
p(xO
i , xM
i ; αk)dxM
i
MAR assumption






25/52
SEM algorithm3
A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood
Initialisation: θ(0)
Iteration nb q:
E-step: compute conditional probabilities p(xM
, z|x0
; θ(q)
)
S-step: draw (xM(q)
, z(q)
) from p(xM
, z|x0
; θ(q)
)
M-step: maximize θ(q+1)
= arg maxθ ln p(xO
, xM(q)
, z(q)
; θ)
Stopping rule: iteration number
Properties: simpler than EM and interesting properties!
Avoid possibly difficult E-step in an EM
Classical M steps
Avoids local maxima
The mean of the sequence (θ(q)) approximates ˆθ
The variance of the sequence (θ(q)) gives confidence intervals
3
MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/
26/52
Prostate cancer data (with missing data)4
Individuals: 506 patients with prostatic cancer grouped on clinical criteria into
two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed by
eight continuous variables (age, weight, systolic blood pressure, diastolic blood
pressure, serum haemoglobin, size of primary tumour, index of tumour stage and
histolic grade, serum prostatic acid phosphatase) and four categorical variables
with various numbers of levels (performance rating, cardiovascular disease history,
electrocardiogram code, bone metastases)
Some missing data: 62 missing values (≈ 1%)
We forget the classes (Stages of the desease) for performing clustering
Questions
How many clusters?
Which partition?
4
Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488
27/52
Data upload without preprocessing
28/52
Run clustering analysis
29/52
Several quick result overviews. . . without post-processing
30/52
Variable significance on global partition
+ similarity between variables
31/52
Variable “SG” difference between clusters
32/52
Variable “BM” difference between clusters
33/52
Companies and MixtComp
Modal (Rougegorge)
Inriatech (Alstom, ArcelorMittal, D´ecathlon, ...)
DiagRAMS technologies (predictive maintenance)
34/52
Why should I use MixtComp?
Avdantage(s)
Analyse different kinds of data
Handle missing and partially missing data
Disadvantage(s)
Do not use correlation structure (even with continuous data)
35/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
36/52
High-dimensional (HD) data5
5
S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 29
37/52
From clustering to co-clustering
[Govaert, 2011]
38/52
Notations
zi : the cluster of the row i
wj : the cluster of the column j
(zi , wj ): the block of the element xij (row i, column j)
z = (z1, . . . , zn): partition of individuals in K custers of rows
w = (w1, . . . , wd ): partition of variables in L clusters of columns
(z, w): bi-partition of the whole data set x
Both space partitions are respectively denoted by Z and W
Restriction
All variables are of the same kind (research in progress for overcoming that. . . )
39/52
MLE estimation: EM algorithm
Observed log-likelihood: (θ; x) = log p(x; θ)
Complete log-likelihood:
c (θ; x, z, w) = log p(x, z, w; θ)
=
i,k
zik log πk +
k,l
wjl log ρl +
i,j,k,l
zik wjl log p(xj
i ; αkl )
E-step of EM (iteration q):
Q(θ, θ(q)
) = E[ c (θ; x, z, w)|x; θ(q)
]
=
i,k
p(zi = k|x; θ(q)
)
t
(q)
ik
ln πk +
j,l
p(wi = l|x; θ(q)
)
s
(q)
jl
ln ρl
+
i,j,k,l
p(zi = k, wj = l|x; θ(q)
)
e
(q)
ijkl
ln p(xij ; αkl )
M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives
π
(q+1)
k = i t
(q)
ik
n
, ρ
(q+1)
l =
j s
(q)
jl
d
, α
(q+1)
kl =
i,j e
(q)
ijkl xij
i,j e
(q)
ijkl
40/52
MLE: intractable E step
e
(q)
ijkl is usually intractable. . .
Consequence of dependency between xij s (link between rows and columns)
Involve KnLd calculus (number of possible blocks)
Example: if n = d = 20 and K = L = 2 then 1012 blocks
Example (cont’d): 33 years with a computer calculating 100,000 blocks/second
Alternatives to EM
Variational EM (numerical approx.): conditional independence assumption
p(z, w|x; θ) ≈ p(z|x; θ)p(w|x; θ)
SEM-Gibbs (stochastic approx.): replace E-step by a S-step approx. by Gibbs
z|x, w; θ and w|x, z; θ
41/52
Document clustering (1/2)
Mixture of 1033 medical summaries and 1398 aeronautics summaries
Lines: 2431 documents
Columns: present words (except stop), thus 9275 unique words
Data matrix: cross counting document×words
Poisson model
42/52
Document clustering (2/2)
Results with 2×2 blocs
Medline Cranfield
Medline 1033 0
Cranfield 0 1398
43/52
Running BlockCluster
44/52
Running BlockCluster
45/52
Running BlockCluster
46/52
Running BlockCluster
47/52
Why should I use BlockCluster?
Advantage(s)
Co-clustering (HD data)
Disadvantage(s)
Analyse one kind of data at a time
48/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
49/52
Current work
mixmod
Missing Not At Random data (using logit distribution) =⇒ missing values
impact the probability of missingness
Patnership between CMAP/INRIA/Traumabase
Work with C. Biernacki, G. Celeux, J. Josse and Y. Stroppa
MixtComp
Publish an R package on CRAN
With you?
New kind of data
Complex missingness
2D clusters’ plot
50/52
Use probabilistic modelling as a mathematical guideline
Use the MASSICCC platform for user-friendly implementation
User-friendly interpretation
”One for all” of clustering
Low computer requirement needed
Free software
https://massiccc.lille.inria.fr/
Also check R packages (https://cran.r-project.org/)
Rmixmod: https://cran.r-project.org/web/packages/Rmixmod/index.html
blockcluster: https://cran.r-project.org/web/packages/blockcluster/index.html
51/52
MERCI
!
52/52

More Related Content

PDF
CSC446: Pattern Recognition (LN5)
PDF
CSC446: Pattern Recognition (LN6)
PDF
CSC446: Pattern Recognition (LN4)
PDF
Neural Networks: Support Vector machines
PDF
Csc446: Pattern Recognition
PDF
CSC446: Pattern Recognition (LN7)
PDF
Differential analyses of structures in HiC data
PDF
CSC446: Pattern Recognition (LN8)
CSC446: Pattern Recognition (LN5)
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN4)
Neural Networks: Support Vector machines
Csc446: Pattern Recognition
CSC446: Pattern Recognition (LN7)
Differential analyses of structures in HiC data
CSC446: Pattern Recognition (LN8)

What's hot (20)

PDF
PDF
A quantum-inspired optimization heuristic for the multiple sequence alignment...
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PDF
Convolutional networks and graph networks through kernels
PDF
Graph Neural Network in practice
PDF
A short and naive introduction to using network in prediction models
PDF
Linear models for classification
PDF
From RNN to neural networks for cyclic undirected graphs
PDF
CSC446: Pattern Recognition (LN3)
PDF
Investigating the 3D structure of the genome with Hi-C data analysis
PDF
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
PDF
Polynomial Matrix Decompositions
PPT
P1121133727
PDF
Machine learning in science and industry — day 3
PPTX
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
PDF
Pres metabief2020jmm
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PPT
Bayseian decision theory
PDF
Kernel methods for data integration in systems biology
A quantum-inspired optimization heuristic for the multiple sequence alignment...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Convolutional networks and graph networks through kernels
Graph Neural Network in practice
A short and naive introduction to using network in prediction models
Linear models for classification
From RNN to neural networks for cyclic undirected graphs
CSC446: Pattern Recognition (LN3)
Investigating the 3D structure of the genome with Hi-C data analysis
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
Polynomial Matrix Decompositions
P1121133727
Machine learning in science and industry — day 3
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Pres metabief2020jmm
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Bayseian decision theory
Kernel methods for data integration in systems biology
Ad

Similar to Inria Tech Talk - La classification de données complexes avec MASSICCC (20)

PDF
A comparative study of clustering and biclustering of microarray data
PDF
Performance Analysis of a Gaussian Mixture based Feature Selection Algorithm
PDF
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
PPT
Part 2: Unsupervised Learning Machine Learning Techniques
PDF
Machine learning ,supervised learning ,j
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
PPTX
GMM Clustering Presentation Slides for Machine Learning Course
PPTX
presentationIDC - 14MAY2015
PPTX
Statistical Clustering Redux - kmeans, GMM and Variational Inference
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
PDF
On learning statistical mixtures maximizing the complete likelihood
PDF
clusteranalysis_simplexrelated to ai.pdf
PDF
Machine learning Mind Map
DOCX
Maximum likelihood estimation from uncertain
PDF
Machine Learning.pdf
PDF
BAYSM'14, Wien, Austria
PPTX
DataAnalysis in machine learning using different techniques
A comparative study of clustering and biclustering of microarray data
Performance Analysis of a Gaussian Mixture based Feature Selection Algorithm
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Clustering:k-means, expect-maximization and gaussian mixture model
Part 2: Unsupervised Learning Machine Learning Techniques
Machine learning ,supervised learning ,j
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
GMM Clustering Presentation Slides for Machine Learning Course
presentationIDC - 14MAY2015
Statistical Clustering Redux - kmeans, GMM and Variational Inference
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
On learning statistical mixtures maximizing the complete likelihood
clusteranalysis_simplexrelated to ai.pdf
Machine learning Mind Map
Maximum likelihood estimation from uncertain
Machine Learning.pdf
BAYSM'14, Wien, Austria
DataAnalysis in machine learning using different techniques
Ad

More from Stéphanie Roger (20)

PDF
Workshop IA : accélérez vos projets grâce au calcul intensif
PDF
Workshop - Le traitement de données biométriques par la CNIL
PDF
Masterclass Welcome to France with Business France
PDF
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
PDF
Dossier de Presse - chatbot
PDF
Masterclass pour s'implanter en Inde avec Business France Export et INPI
PDF
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
PDF
Masterclass - Vendre au secteur public de santé par l'UGAP
PDF
Inria Tech Talk : IceSL, le logiciel d'impression 3D
PDF
Workshop CNIL - RGPD & Objets connectés
PPTX
Masterclass pour se développer en zone ASEAN @BF Export @INPI
PDF
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
PDF
Workshop CNIL - RGPD & données de santé 22 février
PDF
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
PDF
Workshop Financement par la CCIPARIS-IDF
PPTX
Masterclass : les grands enjeux de la #Smartcity
PDF
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
PDF
La Masterclass #RGPD #International @CNIL
PPTX
Workshop IA : supercalculateur pour booster vos projets par GENCI
PPTX
Workshop Recrutement #Associés #Fondateurs
Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop - Le traitement de données biométriques par la CNIL
Masterclass Welcome to France with Business France
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Dossier de Presse - chatbot
Masterclass pour s'implanter en Inde avec Business France Export et INPI
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Masterclass - Vendre au secteur public de santé par l'UGAP
Inria Tech Talk : IceSL, le logiciel d'impression 3D
Workshop CNIL - RGPD & Objets connectés
Masterclass pour se développer en zone ASEAN @BF Export @INPI
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Workshop CNIL - RGPD & données de santé 22 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop Financement par la CCIPARIS-IDF
Masterclass : les grands enjeux de la #Smartcity
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
La Masterclass #RGPD #International @CNIL
Workshop IA : supercalculateur pour booster vos projets par GENCI
Workshop Recrutement #Associés #Fondateurs

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
A Presentation on Touch Screen Technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mushroom cultivation and it's methods.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hybrid model detection and classification of lung cancer
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
A Presentation on Artificial Intelligence
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
Getting Started with Data Integration: FME Form 101
OMC Textile Division Presentation 2021.pptx
Programs and apps: productivity, graphics, security and other tools
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A Presentation on Touch Screen Technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mushroom cultivation and it's methods.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25-Week II
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hybrid model detection and classification of lung cancer
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
DP Operators-handbook-extract for the Mautical Institute
A Presentation on Artificial Intelligence
Chapter 5: Probability Theory and Statistics
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
SOPHOS-XG Firewall Administrator PPT.pptx

Inria Tech Talk - La classification de données complexes avec MASSICCC

  • 2. MASSICCC: A SaaS Platform for Clustering and Co-Clustering of Mixed Data https://massiccc.lille.inria.fr/ F. Laporte with B. Auder, C. Biernacki, G. Celeux, J. Demont, F. Langrognet, V. Kubicki, C. Poli, J. Renault, S. Iovleff May 29th 2019, TechTalk, Paris 2/52
  • 3. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 3/52
  • 5. MASSICCC: Examples of Applications Market sales Cities’ similarities Predictive maintenance Health Data mining Large dataset Complex dataset 5/52
  • 6. MASSICCC?? A high quality and easy to use web platform where are transfered mature research clustering (and more) software towards (non academic) professionals 6/52
  • 7. Here is the computer you need! 7/52
  • 8. Clustering? Detect hidden structures in data sets −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st MCA axis 2ndMCAaxis −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st MCA axis 2ndMCAaxis Low income Average income High income 8/52
  • 9. Large data sets1 1 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29 9/52
  • 10. An opportunity for detecting weak signal 10/52
  • 11. Todays features: full mixed/missing 11/52
  • 12. Notations Data: n individuals: x = {x1, . . . , xn} = {xO , xM } in a space X of dimension d Observed individuals xO Missing individuals xM Aim: estimation of the partition z and the number of clusters K Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK ) xi ∈ Gk ⇔ zih = I{h=k} Mixed, missing, uncertain Individuals x Partition z ⇔ Group ? 0.5 red 5 ? ? ? ⇔ ??? 0.3 0.1 green 3 ? ? ? ⇔ ??? 0.3 0.6 {red,green} 3 ? ? ? ⇔ ??? 0.9 [0.25 0.45] red ? ? ? ? ⇔ ??? ↓ ↓ ↓ ↓ continuous continuous categorical integer 12/52
  • 13. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 13/52
  • 14. Parametric mixture model Parametric assumption: pk (x1) = p(x1; αk ) thus p(x1) = p(x1; θ) = K k=1 πk p(x1; αk ) Mixture parameter: θ = (π, α) with α = (α1, . . . , αK ) Model: it includes both the family p(·; αk ) and the number of groups K m = {p(x1; θ) : θ ∈ Θ} The number of free continuous parameters is given by ν = dim(Θ) Clustering becomes a well-posed problem. . . 14/52
  • 15. The clustering process in mixtures 1 Estimation of θ by ˆθ 2 Estimation of the conditional probability that xi ∈ Gk tik (ˆθ) = p(Zik = 1|Xi = xi ; ˆθ) = ˆπk p(xi ; ˆαk ) p(xi ; ˆθ) 3 Estimation of zi by maximum a posteriori (MAP) ˆzik = I{k=arg maxh=1,...,K tih( ˆθ)} 4 Model selection: BIC, ICL, . . . 15/52
  • 16. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 16/52
  • 17. Possible datasets Continuous data 14 models on correlation matrix between variables (models’ selection) Categorical data Conditional independence Mixed data Conditional independence inter-type Conditional independence intra-type (symmetry between types) Distributions Continuous: Gaussian Categorical: Multinomial 17/52
  • 18. Estimation of θ Maximize the complete-likelihood over (θ, z) c (θ; x, z) = n i=1 K k=1 zik ln {πk p(xi ; αk )} CEM Maximize the observe-likelihood on θ (θ; x) = n i=1 ln p(xi ; θ) EM 18/52
  • 19. Principle of EM and CEM Initialization: θ0 Iteration noq: Step E: estimate probabilities tq = {tik (θq )} Step C: classify by setting tq = MAP({tik (θq )}) Step M: maximize θq+1 = arg maxθ c (θ; x, tq ) Stopping rule: iteration number or criterion stability Properties ⊕: simplicity, monotony, low memory requirement : local maxima (depends on θ0), linear convergence (EM) 19/52
  • 20. Prostate cancer data (without mixing data) Individuals: n = 475 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Model: cond. indep. p(x1; αk ) = p(x1; αcont k ) · p(x1; αcat k ) 20/52
  • 22. Why should I use mixmod? Avdantage(s) Compare a lot of different models (continuous data) Analyse mixed data Disadvantage(s) Do not handle missing data Only continuous or categorical data 22/52
  • 23. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 23/52
  • 24. Full mixed data: conditional independence everywhere2 The aim is to combine continuous, categorical, integer data, ordinal, ranking and functional data x1 = (xcont 1 , xcat 1 , xint 1 , . . .) The proposed solution is to mixed all types by inter-type conditional independence p(x1; αk ) = p(xcont 1 ; αcont k ) × p(xcat 1 ; αcat k ) × p(xint 1 ; αint k ) × . . . In addition, for symmetry between types, intra-type conditional independence Only need to define the univariate pdf for each variable type! Continuous: Gaussian Categorical: multinomial Integer: Poisson . . . 2 MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/ 24/52
  • 25. Missing data: MAR assumption and estimation Assumption on the missingness mecanism Missing At Randon (MAR): the probability that a variable is missing does not depend on its own value given the observed variables. Observed log-likelihood. . . (θ; xO ) = n i=1 log K k=1 πk p(xO i ; αk ) = n i=1 log       K k=1 πk xM i p(xO i , xM i ; αk)dxM i MAR assumption       25/52
  • 26. SEM algorithm3 A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood Initialisation: θ(0) Iteration nb q: E-step: compute conditional probabilities p(xM , z|x0 ; θ(q) ) S-step: draw (xM(q) , z(q) ) from p(xM , z|x0 ; θ(q) ) M-step: maximize θ(q+1) = arg maxθ ln p(xO , xM(q) , z(q) ; θ) Stopping rule: iteration number Properties: simpler than EM and interesting properties! Avoid possibly difficult E-step in an EM Classical M steps Avoids local maxima The mean of the sequence (θ(q)) approximates ˆθ The variance of the sequence (θ(q)) gives confidence intervals 3 MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/ 26/52
  • 27. Prostate cancer data (with missing data)4 Individuals: 506 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Some missing data: 62 missing values (≈ 1%) We forget the classes (Stages of the desease) for performing clustering Questions How many clusters? Which partition? 4 Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488 27/52
  • 28. Data upload without preprocessing 28/52
  • 30. Several quick result overviews. . . without post-processing 30/52
  • 31. Variable significance on global partition + similarity between variables 31/52
  • 32. Variable “SG” difference between clusters 32/52
  • 33. Variable “BM” difference between clusters 33/52
  • 34. Companies and MixtComp Modal (Rougegorge) Inriatech (Alstom, ArcelorMittal, D´ecathlon, ...) DiagRAMS technologies (predictive maintenance) 34/52
  • 35. Why should I use MixtComp? Avdantage(s) Analyse different kinds of data Handle missing and partially missing data Disadvantage(s) Do not use correlation structure (even with continuous data) 35/52
  • 36. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 36/52
  • 37. High-dimensional (HD) data5 5 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29 37/52
  • 38. From clustering to co-clustering [Govaert, 2011] 38/52
  • 39. Notations zi : the cluster of the row i wj : the cluster of the column j (zi , wj ): the block of the element xij (row i, column j) z = (z1, . . . , zn): partition of individuals in K custers of rows w = (w1, . . . , wd ): partition of variables in L clusters of columns (z, w): bi-partition of the whole data set x Both space partitions are respectively denoted by Z and W Restriction All variables are of the same kind (research in progress for overcoming that. . . ) 39/52
  • 40. MLE estimation: EM algorithm Observed log-likelihood: (θ; x) = log p(x; θ) Complete log-likelihood: c (θ; x, z, w) = log p(x, z, w; θ) = i,k zik log πk + k,l wjl log ρl + i,j,k,l zik wjl log p(xj i ; αkl ) E-step of EM (iteration q): Q(θ, θ(q) ) = E[ c (θ; x, z, w)|x; θ(q) ] = i,k p(zi = k|x; θ(q) ) t (q) ik ln πk + j,l p(wi = l|x; θ(q) ) s (q) jl ln ρl + i,j,k,l p(zi = k, wj = l|x; θ(q) ) e (q) ijkl ln p(xij ; αkl ) M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives π (q+1) k = i t (q) ik n , ρ (q+1) l = j s (q) jl d , α (q+1) kl = i,j e (q) ijkl xij i,j e (q) ijkl 40/52
  • 41. MLE: intractable E step e (q) ijkl is usually intractable. . . Consequence of dependency between xij s (link between rows and columns) Involve KnLd calculus (number of possible blocks) Example: if n = d = 20 and K = L = 2 then 1012 blocks Example (cont’d): 33 years with a computer calculating 100,000 blocks/second Alternatives to EM Variational EM (numerical approx.): conditional independence assumption p(z, w|x; θ) ≈ p(z|x; θ)p(w|x; θ) SEM-Gibbs (stochastic approx.): replace E-step by a S-step approx. by Gibbs z|x, w; θ and w|x, z; θ 41/52
  • 42. Document clustering (1/2) Mixture of 1033 medical summaries and 1398 aeronautics summaries Lines: 2431 documents Columns: present words (except stop), thus 9275 unique words Data matrix: cross counting document×words Poisson model 42/52
  • 43. Document clustering (2/2) Results with 2×2 blocs Medline Cranfield Medline 1033 0 Cranfield 0 1398 43/52
  • 48. Why should I use BlockCluster? Advantage(s) Co-clustering (HD data) Disadvantage(s) Analyse one kind of data at a time 48/52
  • 49. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 49/52
  • 50. Current work mixmod Missing Not At Random data (using logit distribution) =⇒ missing values impact the probability of missingness Patnership between CMAP/INRIA/Traumabase Work with C. Biernacki, G. Celeux, J. Josse and Y. Stroppa MixtComp Publish an R package on CRAN With you? New kind of data Complex missingness 2D clusters’ plot 50/52
  • 51. Use probabilistic modelling as a mathematical guideline Use the MASSICCC platform for user-friendly implementation User-friendly interpretation ”One for all” of clustering Low computer requirement needed Free software https://massiccc.lille.inria.fr/ Also check R packages (https://cran.r-project.org/) Rmixmod: https://cran.r-project.org/web/packages/Rmixmod/index.html blockcluster: https://cran.r-project.org/web/packages/blockcluster/index.html 51/52