SlideShare a Scribd company logo
DYS875-006

Reproducible Clusters from Microarray Research:
Whither?
Garge, Nikhil et. al.

Gota Morota

Dec. 9, 2009

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Clustering Gene Expression Profiles

Given: expression profiles for a set of genes or
experiments/individuals/time points
Do: organize profiles into clusters such that
genes in the same cluster are highly similar to each other
genes from different clusters have low similarity to each other

Goal
Understand general characteristics of data and infer something
about a gene based on how it relates to other genes

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

Validity of Clustering Analysis
Clustering presents challenges because
there is no null hypothesis to test and no right answer
the result of clustering may be method sensitive (distance
metric, clustering algorithm)
no way to evaluate the validity of a cluster solution

⇓
Measure replicability of clustering algorithms.
Clusters that produce classifications with greater replicability
would be considered more valid.

Objective
Determine the replicability and degree of stability of commonly
used non-hierarchical clustering algorithms
Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Data
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Real datasets

Simulated datasets

!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Table 1: List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its
name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18
datasets.

Name of the
dataset

Source

Sample size (n)

Name of the dataset

Source

Sample size (n)

GDS22
GDS171
GDS184
GDS232
GDS274
GDS285
GDS365

GEO
GEO
GEO
GEO
GEO
GEO
GEO

80
30
30
46
80
20
66

[30]
[31]
[32]
[33]
[34]
Unpublished
Unpublished

70
34
100
60
42
24
106

GDS465
GDS331
GDS534
GDS565
GDS427
GDS402
GDS356
GDS389
GDS388
GDS352
GDS531
GDS535

GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO

90
70
74
48
24
12
14
16
18
12
172
12

Leukemia dataset
Medulloblastoma Data Set
Prostate Cancer dataset
Gaffney Head and Neck data
Affymetrix Hu133A Latin Square
CNGI design experiment
Paired pre and post euglycaemic insulin clamp skeletal
muscle biopsies
GDS156
GDS254
GDS268
GDS287
GDS288
GDS472
GDS473
GDS511
GDS520
GDS564
GDS540

GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO

12
16
24
16
16
14
12
12
20
28
18

Table 2: List of simulated microarray datasets. Table 2 show the
details of simulated datasets. Each of these datasets has
clustering structure k = 6 (six clusters) with correlation ! set to
(0.33)1/2.

3. Compute th
not updated un
the data.

Dataset Name Sample size Number of genes Clusters

4. Alternate ste
ters.

Dataset1
Dataset2
Dataset3
Dataset4
Dataset5
Dataset6
Dataset7
Dataset8

20
100
200
500
1000
40
60
80

1200
1200
1200
1200
1200
1200
1200
1200

6
6
6
6
6
6
6
6

We consider
methods, whic
algorithms for
package.

K-means
In K-means clu
ters and rando
Filtered out genes values
which contained at least one missing value ters. If a gene
Missing
not available in real datasets. We simulated 8 datasets
If we represent microarray data as a matrix with rows repwith 1200 genes and sample sizes ranging from n = 20 to
cluster, as asse
resenting genes and columns representing chips or
1000, where n is the number of subjects. All simulated
Igij !
Standardized the we filtered out all rows which contained at leastsam- meanIgizero and unit
ples, expression values to
one
datasets were structured for 6 clusters (k = 6) with correlaPearson's corre
Zij =
null expression or missing value because we do not know
tion ! set to (0.33) for all pairwise combinations of
SDgi
genes within clusters and zero for all pair wise combinasource(s) for the
observation.
be assigned to
variance to validate our the exactdata can be due missing/null valuetranscription
tions of genes in different clusters. In order
Missing
to array damage,
to the closest cl
methodology, we would predict higher scores when we
errors, etc. Conventional algorithms for clustering require
extract 6 clusters in our fitted solutions. Simulated datacomplete datasets to run and extending these clustering
Where Zij = Z score computed for expression level recalculated. A
sets also help us understand the stability behaviour for
routines to accommodate missing data was beyond the
values other than k = 6 (i.e., when we extract the wrong
scope of our inquiry.
troids will no l
number of clusters). Table 2 explains the details of simuobserved for gene i in sample/subject j, Ig = intensity
Standardization
lated datasets. We acknowledge that number of genes in
Gota Morota
Reproducible Clusters from Microarray Research:ij Whither? K-means cluste
1/2
DYS875-006

Seminar

Four Algorithms Considered

Four non-hierarchical (partitional) clustering algorithms.
Non-hierarchical clusterings require the number of clusters (k) be
pre-specified.
K-means ( kmeans {stats} )
Self Organizing Maps (SOM) (som { cluster })
Clustering LARge Applications (CLARA) (clara { cluster })
Fuzzy C-means (fanny { cluster })

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

K-means
1
2
3
4
5
6

K-Means Clustering

Choose the number of k clusters
Randomly assign items to the k clusters
•! assume our instances are represented by vectors of real
Calculate new centroid for each of the k clusters
values
Calculate the distance of all items to the k centroids
•! put k cluster centers in same space as instances
!
Assign items to closest centroid
•! each cluster is represented by a vector f j
Repeat until clusters assignments are stable

•! consider an example in which our vectors have 2 dimensions
+

instances

+

+

cluster center

+

Figure 1: K-Means
Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

K-means

K-Means Clustering

Each iteration involves two steps
1

•! each iteration involves two steps
assignment of instances to clusters

2

–! assignment the means
re-computation of of instances to clusters
–! re-computation of the means
+

+

+

+

+

+
+

+

assignment

re-computation of means

Figure 2: K-Means

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Other Clustering Methods
Self Organazing Map (SOM)
Similar to K-means, but centroids are restricted to a
two-dimensional grid
Clustering LARge Applications (CLARA)
Extension of PAM(Partition Around Medoids)
it can deal with much larger datasets than PAM
Fuzzy C-means
each gene belongs to a cluster that is specified by a
membership degree (0-1)
basically you can assign genes to more than one cluster
assign the gene to a cluster showing maximum degree of
membership

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Cluster Stability
Cramer’s v2
χ2
N(k − 1)
where

χ2 is the ordinary χ2 test statistic for independence in
contingency tables

N is the number of genes to be clustered
k is the number of clusters extracted
Stability score
0: no relationship
1: perfect reproducibility
Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Approach to Compute Cluster Stability
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Microarray dataset with S subjects and N genes

Split dataset into “left” and “right” datasets

Left dataset with
S/2 subjects

Sub-sample left
dataset into sets of
various sample
sizes (3 to S/2)

Sub-sample right
dataset into sets of
various sample
sizes (3 to S/2)

Left sub-sampled
set of sample size
“x” (x ranges from
3 to S/2)

Right subsampled set of
sample size “x” (x
ranges from 3 to
S/2)

Cluster left set of
sample size “x” with
k (2 to 10) number
of clusters

Repeat 3
times

Right dataset with
S/2 subjects

Cluster right set of
sample size “x” with
k (2 to 10) number
of clusters

Compute Chi square (X2) between clustering results

Cluster Stability S(x,k) = Cramer’s v2

Figure 1
Algorithm: cluster stability computation
Algorithm: cluster stability computation. Cluster stability score S(x,k) is computed for every "k"(number of clusters) and
every pair of sub-sampled set of sample size "x".

./01!2!34!--

Figure 3: Cluster Stability Computation
/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

Result on Real Datasets – (SOM)
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Table 3: Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a
given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled
6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We
measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other
from right dataset both of same sample size) using Cramer's v2 . Columns in the table represent number of clusters (k) and rows
represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement
between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size
8).

K (CLUSTERS)
2
SAMPLESIZE

3
4
5
6
7
8

3

4

5

6

7

8

9

10

0.5883
0.5799
0.5738
0.6433
0.6534
0.6759

0.47091
0.48045
0.48296
0.54638
0.54821
0.58447

0.4503
0.4244
0.4297
0.5142
0.5250
0.5520

0.4028
0.3894
0.3982
0.4727
0.4826
0.5045

0.3809
0.365
0.3644
0.4405
0.4462
0.4700

0.3600
0.3469
0.3430
0.4066
0.4211
0.4592

0.3313
0.3132
0.3195
0.3817
0.3915
0.4160

0.3107
0.297
0.3013
0.3616
0.3679
0.3975

0.2992
0.2858
0.2790
0.3396
0.348
0.3699

sample size. CLARA and Fuzzy C-means, however, mainwe deviate from k = 6, we observed a decline in stability
Figure scores until a sample size of a real scores. This phenomenon can 16
tained low stability4: Stability result on 30 was dataset of sample sizebe clearly observed in
attained. Stability scores then gradually increased after
CLARA, K-means and Fuzzy C-means (Figure 5). Hence,
this threshold. K-means and SOM showed superior stabilscores observed on k = 7 were always higher than that on
ity scores as compared to CLARA until the sample size
k = 2, since k = 7 is nearer to k = 6 (Figure 5). Figure 4
attained n = 30. It is interesting to note that average stabilshows results on simulated datasets for k = 6. We observed
ity achieved is not greater than 0.55 for all four clustering
the following differences in stability behaviors among the
Gota Morota
routines even when at sample size of n = 50 is attained.Reproducible Clustersalgorithms.
four clustering from Microarray Research: Whither?
ng methods. Alternatively, if
e of scores across 37 selected
scores from 37 real datasets)
epresent stability coefficients
clustering structure, we then
nd 0.8 until a sample size of
algorithms is achieved.

iors until sample size reached n = 100. K-means showed
Seminar
DYS875-006
high stability at smaller sample sizes as compared to the
other methods.

Result on Real Datasets – among different algorithms
Real datasets

Stability coefficient

0.5
SOM

0.4

Kmeans

0.3

Fuzzy C-means

0.2

Clara

0.1

48

38

43

33

28

23

18

8

13

0

3

the same clustering structure
tion ! set to (0.33)1/2 within
all datasets show high scores
her values of k. In simulated
utput tables produced on 8
e with each cell computed as
ding cells in 8 tables thereby
scores for each value of k (k
sub-sampled space. The final
ability behavior of the clusvalues of clusters (k) considwe produced a final output
2 to 10) across sub-sampled
esults for various values of k
n in Figure 5. As expected,
ed for the correct number of
tering routines thereby valiprogramming. However, as

0.6

Sample Size

Figure Stability results
Cluster 3
Cluster Stability results. Stability scores for various values of k (2 to 10) are computed on all 37 datasets. For each
dataset, we selected a column (k) showing maximum summation of scores across sample size. Finally all 37 columns
selected on 37 datasets were merged into one resultant column representing stability scores with respect to sample size
for that clustering routine.

Figure 5: Stability result on a real dataset of sample size 16
./01!2!34!-/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

Result on Simulated Datasets – 1
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Figure 3 and Figure 4 suggest that
ble performance than other cluste
ered (SOM, CLARA and Fuzzy C
1.2
SOM showed similar behavior in
1
they are closely related to each oth
SOM
0.8
ids move freely in multidimension
K-means
0.6
constrained to a two-dimensiona
Fuzzy C-means
0.4
SOM, the distance of each input fr
Clara
0.2
is considered, instead of just the c
the neighborhood kernel [29]. Th
0
as conventional clustering algorit
Sample Size
neighborhood kernel is zero [29].
on all four clustering routines
microarray datasets, in general,
Figure Stability results on simulated datasets for k = 6
Cluster 4
structure. We do not claim that th
Cluster Stability results on simulated datasets for k =
the exact stability nature of a giv
6. Datasets are simulated with a clustering structure k = 6 (6
sample size, since these are genera
clusters). The above figure shows high stability scores
observed for k = 6 on all four clustering routines.
and variety of datasets. Nonetheles
consider performing cluster analys
to obtain more stable clustering s
• Stability
Fuzzy C-means simulated fluctuation
criterion for
6: K-means,evenresult onand SOM showeddataset forsuggests ofstatistical(k) for a givensm
k = 6a
in scores
at large sample sizes, whereas CLARA
number clusters
showed consistent behavior (constant level of scores) at
may be accomplished by computi
ous values of k and selecting that
larger sample sizes.
vides a maximum stability score fo
• CLARA maintained 100% stability for larger sample
We also evaluated stability perfo
sizes (300–500) whereas, SOM and Fuzzy C-means failed
Gota Morota
Reproducible Clusters from Microarray Research: Whither?

Figure

476

433

390

347

304

261

218

175

89

132

46

3

Stability coefficient

Simulated datasets for rho = sqrt(0.33) and k=6
Seminar

DYS875-006

Result on Simulated Datasets – 2
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Clara: rho = sqrt(0.33) & k=2 to 10

K-means: rho = sqrt(0.33) & k=2 to 10
2

1

3

0.8

4

0.6

5

0.4

6

0.2

7
8
498

465

432

399

366

333

300

267

234

201

168

69

10

Sample Size

Fuzzy C-means: rho = sqrt(0.33) & k=2 to 10

9
10

Sample Size

SOM: rho = sqrt(0.33) & k=2 to 10

Sample Size

2
3

0.6

4

0.4

5

0.2

6
7
498

465

432

399

366

333

300

267

234

201

168

135

69

0
36

8

1
0.8

102

498

465

432

399

366

333

300

7
267

0
234

6
201

0.2
168

5

135

0.4

69

4

36

3

102

2

0.6

3

1
0.8

Stability coefficient

1.2

3

1.2
Sability coefficient

135

9

36

0
102

498

465

432

399

366

333

8
300

0
267

7
234

0.2
201

6

168

0.4

69

5

135

0.6

36

4

102

3

3

1
0.8

1.2
Stability coefficient

2

3

Stability coefficient

1.2

9

Sample Size

Figure 5
Cluster Stability results on simulated datasets for k = 2 to k = 10
Cluster Stability results on simulated datasets for k = 2 to k = 10. Stability scores for various values of k (2 to 10) are
computed on all the 8 simulated datasets. For each dataset, we generate an output table of scores (explained in Algorithms
section). We merge all the 8 output tables produced into one table with each cell computed as average of corresponding cells
in 8 tables. Finally scores are plotted for all k values with respect to sample size. For cleaner visualization purposes, we do not
show stability curves for all k values in figure 5c and figure 5d. a Scores plotted for CLARA for each k (2–10). b Scores plotted
for K-means for each k (2–10). c Scores plotted for Fuzzy Cmeans for each k (2–10). d Scores plotted for SOM for each k (2–
10).

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Conclusion

microarray datasets may lack natural clustering structure
thereby producing low stability scores on all four methods
the algorithms studied may not be well suited to producing
reliable results
sample sizes typically used in microarray research may be too
small to support derivation of reliable clustering results

Gota Morota

Reproducible Clusters from Microarray Research: Whither?

More Related Content

PPTX
Thesis (presentation)
PDF
Simplicial closure and higher-order link prediction --- SIAMNS18
PDF
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
PDF
Cluster Analysis : Assignment & Update
PDF
Forest Cover type prediction
PDF
An enhanced fuzzy rough set based clustering algorithm for categorical data
PDF
An enhanced fuzzy rough set based clustering algorithm for categorical data
PDF
Bj24390398
Thesis (presentation)
Simplicial closure and higher-order link prediction --- SIAMNS18
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cluster Analysis : Assignment & Update
Forest Cover type prediction
An enhanced fuzzy rough set based clustering algorithm for categorical data
An enhanced fuzzy rough set based clustering algorithm for categorical data
Bj24390398

What's hot (20)

PDF
Cluster Analysis for Dummies
PDF
Premeditated Initial Points for K-Means Clustering
PDF
Hybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
PPT
Statistical Clustering
PDF
K-Means, its Variants and its Applications
PDF
New Approach for K-mean and K-medoids Algorithm
PDF
An improvement in k mean clustering algorithm using better time and accuracy
DOC
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
PDF
The comparison study of kernel KC-means and support vector machines for class...
PDF
Optimising Data Using K-Means Clustering Algorithm
PDF
Clustering: A Survey
PPTX
Large Scale Data Clustering: an overview
PPT
K mean-clustering algorithm
PDF
Cluster analysis
PPTX
Pattern recognition binoy k means clustering
PDF
Exponential lindley additive failure rate model
PDF
47 292-298
PDF
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
PDF
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
PPTX
Hierarchical clustering
Cluster Analysis for Dummies
Premeditated Initial Points for K-Means Clustering
Hybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
Statistical Clustering
K-Means, its Variants and its Applications
New Approach for K-mean and K-medoids Algorithm
An improvement in k mean clustering algorithm using better time and accuracy
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
The comparison study of kernel KC-means and support vector machines for class...
Optimising Data Using K-Means Clustering Algorithm
Clustering: A Survey
Large Scale Data Clustering: an overview
K mean-clustering algorithm
Cluster analysis
Pattern recognition binoy k means clustering
Exponential lindley additive failure rate model
47 292-298
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
Hierarchical clustering
Ad

Viewers also liked (17)

PDF
SEO Presentation In First Class
DOCX
NEW CV2
PDF
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
DOCX
Profil desa galung
DOC
Prateek_CV1
PDF
Diffusion kernels on SNP data embedded in a non-Euclidean metric
PDF
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
PDF
Catalogo.pdf meridional
PDF
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
PPTX
Презентация для лабораторной работы
PPTX
Пример для лабораторной работы
PDF
Dark light electronic short project
PDF
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
PDF
Learn Youtube marketing and make your carrier in youtube
PDF
9 ways to success in real estate business
PDF
My personaldresser the perfect gift
PPTX
Saudi Arabia Presentation Map Slides
SEO Presentation In First Class
NEW CV2
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Profil desa galung
Prateek_CV1
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Catalogo.pdf meridional
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
Презентация для лабораторной работы
Пример для лабораторной работы
Dark light electronic short project
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Learn Youtube marketing and make your carrier in youtube
9 ways to success in real estate business
My personaldresser the perfect gift
Saudi Arabia Presentation Map Slides
Ad

Similar to Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics. (20)

PDF
Comparison Between Clustering Algorithms for Microarray Data Analysis
PDF
A comparative study of clustering and biclustering of microarray data
PDF
11 clusadvanced
PPT
11ClusAdvanced.ppt
DOCX
Identification of Differentially Expressed Genes by unsupervised Learning Method
PPT
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
PDF
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
PDF
Gene Selection for Sample Classification in Microarray: Clustering Based Method
PPTX
Density based spatial clustering of applications with noises for dna methylat...
PDF
The International Journal of Engineering and Science (The IJES)
DOCX
Data preprocessing
PPT
Chapter 11. Cluster Analysis Advanced Methods.ppt
PPTX
Unsupervised learning Algorithms and Assumptions
PPT
26-Clustering MTech-2017.ppt
PPT
Chapter 11 cluster advanced, Han & Kamber
PDF
ClusteringClusteringClusteringClustering.pdf
PPTX
Clustering.pptx
PPT
Chapter 11 cluster advanced : web and text mining
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PPT
Chap8 basic cluster_analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
A comparative study of clustering and biclustering of microarray data
11 clusadvanced
11ClusAdvanced.ppt
Identification of Differentially Expressed Genes by unsupervised Learning Method
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Density based spatial clustering of applications with noises for dna methylat...
The International Journal of Engineering and Science (The IJES)
Data preprocessing
Chapter 11. Cluster Analysis Advanced Methods.ppt
Unsupervised learning Algorithms and Assumptions
26-Clustering MTech-2017.ppt
Chapter 11 cluster advanced, Han & Kamber
ClusteringClusteringClusteringClustering.pdf
Clustering.pptx
Chapter 11 cluster advanced : web and text mining
Unsupervised%20Learninffffg (2).pptx. application
Chap8 basic cluster_analysis

Recently uploaded (20)

PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPT
Teaching material agriculture food technology
PDF
August Patch Tuesday
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
Heart disease approach using modified random forest and particle swarm optimi...
Group 1 Presentation -Planning and Decision Making .pptx
Digital-Transformation-Roadmap-for-Companies.pptx
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
A comparative study of natural language inference in Swahili using monolingua...
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
TLE Review Electricity (Electricity).pptx
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Teaching material agriculture food technology
August Patch Tuesday
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars

Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics.

  • 1. DYS875-006 Reproducible Clusters from Microarray Research: Whither? Garge, Nikhil et. al. Gota Morota Dec. 9, 2009 Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 2. DYS875-006 Seminar Clustering Gene Expression Profiles Given: expression profiles for a set of genes or experiments/individuals/time points Do: organize profiles into clusters such that genes in the same cluster are highly similar to each other genes from different clusters have low similarity to each other Goal Understand general characteristics of data and infer something about a gene based on how it relates to other genes Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 3. Seminar DYS875-006 Validity of Clustering Analysis Clustering presents challenges because there is no null hypothesis to test and no right answer the result of clustering may be method sensitive (distance metric, clustering algorithm) no way to evaluate the validity of a cluster solution ⇓ Measure replicability of clustering algorithms. Clusters that produce classifications with greater replicability would be considered more valid. Objective Determine the replicability and degree of stability of commonly used non-hierarchical clustering algorithms Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 4. DYS875-006 Seminar Data !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Real datasets Simulated datasets !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Table 1: List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18 datasets. Name of the dataset Source Sample size (n) Name of the dataset Source Sample size (n) GDS22 GDS171 GDS184 GDS232 GDS274 GDS285 GDS365 GEO GEO GEO GEO GEO GEO GEO 80 30 30 46 80 20 66 [30] [31] [32] [33] [34] Unpublished Unpublished 70 34 100 60 42 24 106 GDS465 GDS331 GDS534 GDS565 GDS427 GDS402 GDS356 GDS389 GDS388 GDS352 GDS531 GDS535 GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO 90 70 74 48 24 12 14 16 18 12 172 12 Leukemia dataset Medulloblastoma Data Set Prostate Cancer dataset Gaffney Head and Neck data Affymetrix Hu133A Latin Square CNGI design experiment Paired pre and post euglycaemic insulin clamp skeletal muscle biopsies GDS156 GDS254 GDS268 GDS287 GDS288 GDS472 GDS473 GDS511 GDS520 GDS564 GDS540 GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO 12 16 24 16 16 14 12 12 20 28 18 Table 2: List of simulated microarray datasets. Table 2 show the details of simulated datasets. Each of these datasets has clustering structure k = 6 (six clusters) with correlation ! set to (0.33)1/2. 3. Compute th not updated un the data. Dataset Name Sample size Number of genes Clusters 4. Alternate ste ters. Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6 Dataset7 Dataset8 20 100 200 500 1000 40 60 80 1200 1200 1200 1200 1200 1200 1200 1200 6 6 6 6 6 6 6 6 We consider methods, whic algorithms for package. K-means In K-means clu ters and rando Filtered out genes values which contained at least one missing value ters. If a gene Missing not available in real datasets. We simulated 8 datasets If we represent microarray data as a matrix with rows repwith 1200 genes and sample sizes ranging from n = 20 to cluster, as asse resenting genes and columns representing chips or 1000, where n is the number of subjects. All simulated Igij ! Standardized the we filtered out all rows which contained at leastsam- meanIgizero and unit ples, expression values to one datasets were structured for 6 clusters (k = 6) with correlaPearson's corre Zij = null expression or missing value because we do not know tion ! set to (0.33) for all pairwise combinations of SDgi genes within clusters and zero for all pair wise combinasource(s) for the observation. be assigned to variance to validate our the exactdata can be due missing/null valuetranscription tions of genes in different clusters. In order Missing to array damage, to the closest cl methodology, we would predict higher scores when we errors, etc. Conventional algorithms for clustering require extract 6 clusters in our fitted solutions. Simulated datacomplete datasets to run and extending these clustering Where Zij = Z score computed for expression level recalculated. A sets also help us understand the stability behaviour for routines to accommodate missing data was beyond the values other than k = 6 (i.e., when we extract the wrong scope of our inquiry. troids will no l number of clusters). Table 2 explains the details of simuobserved for gene i in sample/subject j, Ig = intensity Standardization lated datasets. We acknowledge that number of genes in Gota Morota Reproducible Clusters from Microarray Research:ij Whither? K-means cluste 1/2
  • 5. DYS875-006 Seminar Four Algorithms Considered Four non-hierarchical (partitional) clustering algorithms. Non-hierarchical clusterings require the number of clusters (k) be pre-specified. K-means ( kmeans {stats} ) Self Organizing Maps (SOM) (som { cluster }) Clustering LARge Applications (CLARA) (clara { cluster }) Fuzzy C-means (fanny { cluster }) Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 6. Seminar DYS875-006 K-means 1 2 3 4 5 6 K-Means Clustering Choose the number of k clusters Randomly assign items to the k clusters •! assume our instances are represented by vectors of real Calculate new centroid for each of the k clusters values Calculate the distance of all items to the k centroids •! put k cluster centers in same space as instances ! Assign items to closest centroid •! each cluster is represented by a vector f j Repeat until clusters assignments are stable •! consider an example in which our vectors have 2 dimensions + instances + + cluster center + Figure 1: K-Means Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 7. DYS875-006 Seminar K-means K-Means Clustering Each iteration involves two steps 1 •! each iteration involves two steps assignment of instances to clusters 2 –! assignment the means re-computation of of instances to clusters –! re-computation of the means + + + + + + + + assignment re-computation of means Figure 2: K-Means Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 8. DYS875-006 Seminar Other Clustering Methods Self Organazing Map (SOM) Similar to K-means, but centroids are restricted to a two-dimensional grid Clustering LARge Applications (CLARA) Extension of PAM(Partition Around Medoids) it can deal with much larger datasets than PAM Fuzzy C-means each gene belongs to a cluster that is specified by a membership degree (0-1) basically you can assign genes to more than one cluster assign the gene to a cluster showing maximum degree of membership Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 9. DYS875-006 Seminar Cluster Stability Cramer’s v2 χ2 N(k − 1) where χ2 is the ordinary χ2 test statistic for independence in contingency tables N is the number of genes to be clustered k is the number of clusters extracted Stability score 0: no relationship 1: perfect reproducibility Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 10. DYS875-006 Seminar Approach to Compute Cluster Stability !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Microarray dataset with S subjects and N genes Split dataset into “left” and “right” datasets Left dataset with S/2 subjects Sub-sample left dataset into sets of various sample sizes (3 to S/2) Sub-sample right dataset into sets of various sample sizes (3 to S/2) Left sub-sampled set of sample size “x” (x ranges from 3 to S/2) Right subsampled set of sample size “x” (x ranges from 3 to S/2) Cluster left set of sample size “x” with k (2 to 10) number of clusters Repeat 3 times Right dataset with S/2 subjects Cluster right set of sample size “x” with k (2 to 10) number of clusters Compute Chi square (X2) between clustering results Cluster Stability S(x,k) = Cramer’s v2 Figure 1 Algorithm: cluster stability computation Algorithm: cluster stability computation. Cluster stability score S(x,k) is computed for every "k"(number of clusters) and every pair of sub-sampled set of sample size "x". ./01!2!34!-- Figure 3: Cluster Stability Computation /0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5 Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 11. Seminar DYS875-006 Result on Real Datasets – (SOM) !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Table 3: Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled 6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other from right dataset both of same sample size) using Cramer's v2 . Columns in the table represent number of clusters (k) and rows represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size 8). K (CLUSTERS) 2 SAMPLESIZE 3 4 5 6 7 8 3 4 5 6 7 8 9 10 0.5883 0.5799 0.5738 0.6433 0.6534 0.6759 0.47091 0.48045 0.48296 0.54638 0.54821 0.58447 0.4503 0.4244 0.4297 0.5142 0.5250 0.5520 0.4028 0.3894 0.3982 0.4727 0.4826 0.5045 0.3809 0.365 0.3644 0.4405 0.4462 0.4700 0.3600 0.3469 0.3430 0.4066 0.4211 0.4592 0.3313 0.3132 0.3195 0.3817 0.3915 0.4160 0.3107 0.297 0.3013 0.3616 0.3679 0.3975 0.2992 0.2858 0.2790 0.3396 0.348 0.3699 sample size. CLARA and Fuzzy C-means, however, mainwe deviate from k = 6, we observed a decline in stability Figure scores until a sample size of a real scores. This phenomenon can 16 tained low stability4: Stability result on 30 was dataset of sample sizebe clearly observed in attained. Stability scores then gradually increased after CLARA, K-means and Fuzzy C-means (Figure 5). Hence, this threshold. K-means and SOM showed superior stabilscores observed on k = 7 were always higher than that on ity scores as compared to CLARA until the sample size k = 2, since k = 7 is nearer to k = 6 (Figure 5). Figure 4 attained n = 30. It is interesting to note that average stabilshows results on simulated datasets for k = 6. We observed ity achieved is not greater than 0.55 for all four clustering the following differences in stability behaviors among the Gota Morota routines even when at sample size of n = 50 is attained.Reproducible Clustersalgorithms. four clustering from Microarray Research: Whither?
  • 12. ng methods. Alternatively, if e of scores across 37 selected scores from 37 real datasets) epresent stability coefficients clustering structure, we then nd 0.8 until a sample size of algorithms is achieved. iors until sample size reached n = 100. K-means showed Seminar DYS875-006 high stability at smaller sample sizes as compared to the other methods. Result on Real Datasets – among different algorithms Real datasets Stability coefficient 0.5 SOM 0.4 Kmeans 0.3 Fuzzy C-means 0.2 Clara 0.1 48 38 43 33 28 23 18 8 13 0 3 the same clustering structure tion ! set to (0.33)1/2 within all datasets show high scores her values of k. In simulated utput tables produced on 8 e with each cell computed as ding cells in 8 tables thereby scores for each value of k (k sub-sampled space. The final ability behavior of the clusvalues of clusters (k) considwe produced a final output 2 to 10) across sub-sampled esults for various values of k n in Figure 5. As expected, ed for the correct number of tering routines thereby valiprogramming. However, as 0.6 Sample Size Figure Stability results Cluster 3 Cluster Stability results. Stability scores for various values of k (2 to 10) are computed on all 37 datasets. For each dataset, we selected a column (k) showing maximum summation of scores across sample size. Finally all 37 columns selected on 37 datasets were merged into one resultant column representing stability scores with respect to sample size for that clustering routine. Figure 5: Stability result on a real dataset of sample size 16 ./01!2!34!-/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5 Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 13. Seminar DYS875-006 Result on Simulated Datasets – 1 !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Figure 3 and Figure 4 suggest that ble performance than other cluste ered (SOM, CLARA and Fuzzy C 1.2 SOM showed similar behavior in 1 they are closely related to each oth SOM 0.8 ids move freely in multidimension K-means 0.6 constrained to a two-dimensiona Fuzzy C-means 0.4 SOM, the distance of each input fr Clara 0.2 is considered, instead of just the c the neighborhood kernel [29]. Th 0 as conventional clustering algorit Sample Size neighborhood kernel is zero [29]. on all four clustering routines microarray datasets, in general, Figure Stability results on simulated datasets for k = 6 Cluster 4 structure. We do not claim that th Cluster Stability results on simulated datasets for k = the exact stability nature of a giv 6. Datasets are simulated with a clustering structure k = 6 (6 sample size, since these are genera clusters). The above figure shows high stability scores observed for k = 6 on all four clustering routines. and variety of datasets. Nonetheles consider performing cluster analys to obtain more stable clustering s • Stability Fuzzy C-means simulated fluctuation criterion for 6: K-means,evenresult onand SOM showeddataset forsuggests ofstatistical(k) for a givensm k = 6a in scores at large sample sizes, whereas CLARA number clusters showed consistent behavior (constant level of scores) at may be accomplished by computi ous values of k and selecting that larger sample sizes. vides a maximum stability score fo • CLARA maintained 100% stability for larger sample We also evaluated stability perfo sizes (300–500) whereas, SOM and Fuzzy C-means failed Gota Morota Reproducible Clusters from Microarray Research: Whither? Figure 476 433 390 347 304 261 218 175 89 132 46 3 Stability coefficient Simulated datasets for rho = sqrt(0.33) and k=6
  • 14. Seminar DYS875-006 Result on Simulated Datasets – 2 !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Clara: rho = sqrt(0.33) & k=2 to 10 K-means: rho = sqrt(0.33) & k=2 to 10 2 1 3 0.8 4 0.6 5 0.4 6 0.2 7 8 498 465 432 399 366 333 300 267 234 201 168 69 10 Sample Size Fuzzy C-means: rho = sqrt(0.33) & k=2 to 10 9 10 Sample Size SOM: rho = sqrt(0.33) & k=2 to 10 Sample Size 2 3 0.6 4 0.4 5 0.2 6 7 498 465 432 399 366 333 300 267 234 201 168 135 69 0 36 8 1 0.8 102 498 465 432 399 366 333 300 7 267 0 234 6 201 0.2 168 5 135 0.4 69 4 36 3 102 2 0.6 3 1 0.8 Stability coefficient 1.2 3 1.2 Sability coefficient 135 9 36 0 102 498 465 432 399 366 333 8 300 0 267 7 234 0.2 201 6 168 0.4 69 5 135 0.6 36 4 102 3 3 1 0.8 1.2 Stability coefficient 2 3 Stability coefficient 1.2 9 Sample Size Figure 5 Cluster Stability results on simulated datasets for k = 2 to k = 10 Cluster Stability results on simulated datasets for k = 2 to k = 10. Stability scores for various values of k (2 to 10) are computed on all the 8 simulated datasets. For each dataset, we generate an output table of scores (explained in Algorithms section). We merge all the 8 output tables produced into one table with each cell computed as average of corresponding cells in 8 tables. Finally scores are plotted for all k values with respect to sample size. For cleaner visualization purposes, we do not show stability curves for all k values in figure 5c and figure 5d. a Scores plotted for CLARA for each k (2–10). b Scores plotted for K-means for each k (2–10). c Scores plotted for Fuzzy Cmeans for each k (2–10). d Scores plotted for SOM for each k (2– 10). Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 15. DYS875-006 Seminar Conclusion microarray datasets may lack natural clustering structure thereby producing low stability scores on all four methods the algorithms studied may not be well suited to producing reliable results sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results Gota Morota Reproducible Clusters from Microarray Research: Whither?