SlideShare a Scribd company logo
Liangqun Lu
2018 - 04 - 25
Outline
● Background on Data Integration
○ Biological regulation
○ Omic data integration objectives
○ Data Integration Challenges
● Unsupervised methods and Application
○ Matrix factorization methods (iCluster+ )
○ Bayesian methods (BCC)
○ Network-based methods (SNF)
○ Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP)
2
Biological regulation
● Central dogma
3
Gene Regulatory Network
Regulatory elements
● Receptors
● Transcriptional factors
● Inhibitory factors
● Cis-trans element
Source: https://en.wikipedia.org/wiki/Gene_regulatory_network
4
Rich data
5
Single omic study
● One-dimension data explains the
diagnostics and progression for
complex disorders
● Information is limited
● Different layers of biological
system are relevant and
dependent
6
Omic data integration objectives
● Promoting precise medicine from big data
● Multiview investigation on the
completeness and complexity of the
biological system
● Discover hidden biological regularities
● Make use of complementary information
and discover biomarkers for diagnosis,
progression and treatment in human
diseases
7
Data Integration Challenges (From Computational)
● Data integration is broad
● Data heterogeneity
● Data unification
● Data noise and bias
● Data integration and dimensionality reduction
8
9
Unsupervised classification
● Matrix factorization methods (iCluster and iCluster+ )
○ Assumption: common latent variable in different data
● Bayesian methods (Bayesian consensus clustering)
○ Assumption: assumptions on data distribution and data correlation
● Network-based methods (SNF)
○ Assumption: samples relationship can be enhanced from
complementary multiple omic data
● Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP)
○ Assumption: pattern in a lower dimensional and integrative
subspace
10
Data Integration for subtype discovery
● Data Source
○ Gene expression; DNA methylation; gene mutation
● Procedures
○ Data fusion -- Clustering -- Evaluation
● Biological interpretation
○ Molecular alterations
○ Survival outcome
○ Response to therapies
11
12
iCluster and iCluster+
13
Procedure
● Data Fusion and K-means model selection
○ EM algorithm to obtain maximum
likelihood estimates
■ E-step provides a simultaneous
dimension reduction
■ M-step is to update the parameter
estimates
● Evaluation
○ Proportion of deviance -- POD (d/n^2)
○ Smaller, stronger cluster separability
○ Determine cluster number and lasso
parameter λ 15
Application on breast cancer
16
Summaries
● The joint latent variable model is completely scalable to include additional
data types
● iCluster have been applied to discover subtypes at breast cancer and
glioblastoma multiforme (GBM)
● iCluster+ makes different modeling assumptions on data types: binary,
continuous, categorical, and sequential data
17
Similarity Network Fusion (SNF)
18
SNF data fusion
1. Calculate sample similarity W in each omic dataset
using (1)
2. Calculate normalized weight matrix P from W using (2)
3. Use K nearest neighbors (KNN) to calculate local
affinity matrix S through the formulas (3) from W. P
carries the full information about the similarity of each
patient to all others whereas S only encodes the
similarity to the K most similar patients for each
patient.
4. Network fusion process: for 2 datasets, P1, S1 and P2,
S2 can be calculated, then iteratively update P1 and P2
for t steps using (4) and (5); for more than 2 datasets,
update the Ps using (5)
5. Obtain the overall fused matrix P by averaging the
updated single Ps
19
Spectral Clustering
Input X (n x n sample similarity matrix) and k clusters
Goal subgroups in a graph with disjoint cliques
Procedures:
1. Compute the normalized Laplacian L
2. Compute the first eigenvectors u and eigenvalues
for L
3. Let U be the matrix containing eigenvectors u as
columns
4. Form the matrix T from U by normalizing the rows
to norm 1
5. Cluster the points with k-means into clusters C1, ...,
Ck
20
Application: GBM subtype discovery
Evaluations:
1. P value in Cox log-rank test
2. Silhouette score
21
Summaries
● SNF can construct sample sample network by integrating multiple datasets
● SNF can be expanded to include more datasets and be applied in more
questions
22
Bayesian Consensus Clustering
● An integrative statistical model that permits a separate clustering of the
objects for each data source.
● These separate clusterings adhere loosely to an overall consensus clustering
● BCC do simultaneous estimation of both the consensus clustering and the
source-specific clusterings
23
Procedures
● Dirichlet mixture model to accommodate multiple data (X)
● Probability of belonging to one cluster
● Estimation
○ Gibbs sampling procedure to approximate the posterior distribution
○ Markov chain Monte Carlo (MCMC) proceeds by iteratively sampling
● Choose K based on highest mean adjusted adherence
24
Application on breast cancer
● RNA gene expression (GE) data
for 645 genes.
● DNA methylation (ME) data for
574 probes.
● miRNA expression (miRNA) data
for 423 miRNAs.
● Reverse phase protein array
(RPPA) data for 171 proteins.
25
26
Summaries
1. BCC model assumes a simple and general dependence between data
sources.
2. BCC models both an overall clustering and a clustering specific to each data
source, with advantages over traditional methods in terms of modeling
uncertainty and the ability to borrow information across sources.
3. BCC is suitable to work on multisource biomedical data, as well may be used
to compare clusterings from different statistical models for a single
homogeneous dataset.
27
Regularized Multiple Kernel Learning Locality
Preserving Projections (rMKL-LPP)
28
● It is an extension of the current multiple kernel learning with dimensional
reduction (MKL-DR) method, where the data are projected into a lower
dimensional and integrative subspace.
● A regularization term is added to avoid overfitting during the optimization
procedure, and it allows using several different kernel types.
● The Locality Preserving Projections (LPP) is applied to conserve the
sum of distances for each sample’s k-Nearest Neighbors.
Procedures
● Data fusion
○ rMKL-LPP
○ Optimization
○ integrated kernel matrix
● Clustering
○ K-means
○ Mean silhouette width used to optimize number of clusters
● Evaluation
○ Silhouette score and cross validation (Rand index)
29
Applications in 5 cancers
1. Comparison to state-of-the-art (SNF)
2. Robustness analysis
3. Comparison of clusterings to
established subtypes
4. Clinical implications from clusterings
30
5 cancers
1. glioblastoma multiforme (GBM) --
213 samples
2. breast invasive carcinoma (BIC) --
105 samples
3. kidney renal clear cell carcinoma
(KRCCC) -- 122 samples
4. lung squamous cell carcinoma
(LSCC) -- 106 samples
5. colon adenocarcinoma (COAD) -- 92
samplesDatasets: gene expression, DNA methylation
and miRNA expression data
1. Comparison to state-of-the-art
31
2. Robustness analysis
32
Fig. 2. Robustness of clustering for leave-one-out
datasets measured using Rand index.
Fig. 3. Robustness of clustering for leave-
one-out cross-validation applied to
reduced sized datasets measured using
Rand index.
3. Comparison of clusterings to established subtypes
33
4. Clinical implications from clusterings
34
GBM:
● 94 of 213 were
treated with
Temozolomide
35
Explain better survival
Summaries
1. rMKL-LPP found subtypes with more interesting log-rank test compared to the
state-of-the-art method
2. Several kernel matrices per data type can improve performance burdance,
remove the burden of selecting the optimal kernel matrix and have fair
stability
3. rMKL-LPP compared to unregularized MKL-DR remains stable also for small
datasets
4. The application at GBM shows to capture this diverse information within one
clustering
36
References
1. Huang, S., Chaudhary, K. & Garmire, L. X. More Is Better: Recent Progress in Multi-Omics Data
Integration Methods. Front. Genet. 8, 84 (2017).
2. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat.
Methods 11, 333–337 (2014).
3. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a
joint latent variable model with application to breast and lung cancer subtype analysis.
Bioinformatics 25, 2906–2912 (2009).
4. Shen, R. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS One 7, e35236
(2012).
5. Mo, Q. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data.
Proc. Natl. Acad. Sci. U. S. A. 110, 4245–4250 (2013).
6. Speicher, N. K. & Pfeifer, N. Integrating different data types by regularized unsupervised multiple
kernel learning with application to cancer subtype discovery. Bioinformatics 31, i268–75 (2015).
7. Lock, E. F. & Dunson, D. B. Bayesian consensus clustering. Bioinformatics 29, 2610–2616 (2013).
37

More Related Content

PDF
Text documents clustering using modified multi-verse optimizer
DOCX
Data preprocessing
PPTX
Density based spatial clustering of applications with noises for dna methylat...
PDF
Cray HPC + D + A = HPDA
PDF
An Efficient Clustering Method for Aggregation on Data Fragments
PPTX
chemengine karthi acs sandiego rev1.0
PDF
PDF
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
Text documents clustering using modified multi-verse optimizer
Data preprocessing
Density based spatial clustering of applications with noises for dna methylat...
Cray HPC + D + A = HPDA
An Efficient Clustering Method for Aggregation on Data Fragments
chemengine karthi acs sandiego rev1.0
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER

What's hot (20)

PDF
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
PDF
PDF
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
PPT
MUSEPosterCoGAPS
PDF
Particle Swarm Optimization based K-Prototype Clustering Algorithm
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
PDF
A new link based approach for categorical data clustering
PDF
Big Data Clustering Model based on Fuzzy Gaussian
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PDF
Learning in non stationary environments
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
PDF
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PDF
7. 10083 12464-1-pb
PDF
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
PDF
Data reduction techniques for high dimensional biological data
PDF
Large scale cell tracking using an approximated Sinkhorn algorithm
PPTX
Designing GWAS arrays for efficient imputation-based coverage
PDF
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
DOCX
Heterogeneous Information Network Embedding for Recommendation
DOCX
Heterogeneous Information Network Embedding for Recommendation
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
MUSEPosterCoGAPS
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
A new link based approach for categorical data clustering
Big Data Clustering Model based on Fuzzy Gaussian
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
Learning in non stationary environments
Textual Data Partitioning with Relationship and Discriminative Analysis
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
7. 10083 12464-1-pb
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
Data reduction techniques for high dimensional biological data
Large scale cell tracking using an approximated Sinkhorn algorithm
Designing GWAS arrays for efficient imputation-based coverage
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Heterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for Recommendation
Ad

Similar to Data integration lab_meeting (20)

PPTX
Lab Presentation, Molecular Data Cluster Algorithms
PDF
BRITEREU_finalposter
PPTX
TNBC Research Presentation and medical virology .pptx
PDF
Basics of Data Analysis in Bioinformatics
PDF
Co-clustering algorithm for the identification of cancer subtypes from gene e...
PDF
AI approaches in healthcare - targeting precise and personalized medicine
PPT
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
PDF
Data preprocessing and unsupervised learning methods in Bioinformatics
PDF
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
PPT
Large scale machine learning challenges for systems biology
PDF
La statistique et le machine learning pour l'intégration de données de la bio...
PDF
Kernel based approaches in drug target interaction prediction
PDF
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
PDF
IRJET-Improvement and Enhancement in Emergency Medical Services using IOT
PDF
coad_machine_learning
PDF
IRJET- A New Hybrid Squirrel Search Algorithm and Invasive Weed Optimization ...
PPTX
Qi liu 08.08.2014
PDF
Subphenotyping in TCGA data
PDF
Use cases
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Lab Presentation, Molecular Data Cluster Algorithms
BRITEREU_finalposter
TNBC Research Presentation and medical virology .pptx
Basics of Data Analysis in Bioinformatics
Co-clustering algorithm for the identification of cancer subtypes from gene e...
AI approaches in healthcare - targeting precise and personalized medicine
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Data preprocessing and unsupervised learning methods in Bioinformatics
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
Large scale machine learning challenges for systems biology
La statistique et le machine learning pour l'intégration de données de la bio...
Kernel based approaches in drug target interaction prediction
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET-Improvement and Enhancement in Emergency Medical Services using IOT
coad_machine_learning
IRJET- A New Hybrid Squirrel Search Algorithm and Invasive Weed Optimization ...
Qi liu 08.08.2014
Subphenotyping in TCGA data
Use cases
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Ad

More from Liangqun Lu (13)

PPTX
NFL_intros.pptx
PDF
BERT: Bidirectional Encoder Representations from Transformers
PDF
Gan summary
PPTX
NLP DLforDS
PDF
PDF
PDF
Deep Learning Application in Biology
PDF
Liangqun ms defense.pptx
PDF
Thesis ms llq
PDF
Liangqun lu 1st_gss_version2
PDF
Presentation orientation
PDF
Journal club.pptx
PDF
Final.project
NFL_intros.pptx
BERT: Bidirectional Encoder Representations from Transformers
Gan summary
NLP DLforDS
Deep Learning Application in Biology
Liangqun ms defense.pptx
Thesis ms llq
Liangqun lu 1st_gss_version2
Presentation orientation
Journal club.pptx
Final.project

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
2. Earth - The Living Planet earth and life
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
2Systematics of Living Organisms t-.pptx
PDF
An interstellar mission to test astrophysical black holes
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Microbiology with diagram medical studies .pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
ECG_Course_Presentation د.محمد صقران ppt
2. Earth - The Living Planet Module 2ELS
Classification Systems_TAXONOMY_SCIENCE8.pptx
BIOMOLECULES PPT........................
2. Earth - The Living Planet earth and life
The scientific heritage No 166 (166) (2025)
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
2Systematics of Living Organisms t-.pptx
An interstellar mission to test astrophysical black holes
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Microbiology with diagram medical studies .pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
INTRODUCTION TO EVS | Concept of sustainability
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
POSITIONING IN OPERATION THEATRE ROOM.ppt

Data integration lab_meeting

  • 2. Outline ● Background on Data Integration ○ Biological regulation ○ Omic data integration objectives ○ Data Integration Challenges ● Unsupervised methods and Application ○ Matrix factorization methods (iCluster+ ) ○ Bayesian methods (BCC) ○ Network-based methods (SNF) ○ Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP) 2
  • 4. Gene Regulatory Network Regulatory elements ● Receptors ● Transcriptional factors ● Inhibitory factors ● Cis-trans element Source: https://en.wikipedia.org/wiki/Gene_regulatory_network 4
  • 6. Single omic study ● One-dimension data explains the diagnostics and progression for complex disorders ● Information is limited ● Different layers of biological system are relevant and dependent 6
  • 7. Omic data integration objectives ● Promoting precise medicine from big data ● Multiview investigation on the completeness and complexity of the biological system ● Discover hidden biological regularities ● Make use of complementary information and discover biomarkers for diagnosis, progression and treatment in human diseases 7
  • 8. Data Integration Challenges (From Computational) ● Data integration is broad ● Data heterogeneity ● Data unification ● Data noise and bias ● Data integration and dimensionality reduction 8
  • 9. 9
  • 10. Unsupervised classification ● Matrix factorization methods (iCluster and iCluster+ ) ○ Assumption: common latent variable in different data ● Bayesian methods (Bayesian consensus clustering) ○ Assumption: assumptions on data distribution and data correlation ● Network-based methods (SNF) ○ Assumption: samples relationship can be enhanced from complementary multiple omic data ● Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP) ○ Assumption: pattern in a lower dimensional and integrative subspace 10
  • 11. Data Integration for subtype discovery ● Data Source ○ Gene expression; DNA methylation; gene mutation ● Procedures ○ Data fusion -- Clustering -- Evaluation ● Biological interpretation ○ Molecular alterations ○ Survival outcome ○ Response to therapies 11
  • 12. 12
  • 14. Procedure ● Data Fusion and K-means model selection ○ EM algorithm to obtain maximum likelihood estimates ■ E-step provides a simultaneous dimension reduction ■ M-step is to update the parameter estimates ● Evaluation ○ Proportion of deviance -- POD (d/n^2) ○ Smaller, stronger cluster separability ○ Determine cluster number and lasso parameter λ 15
  • 16. Summaries ● The joint latent variable model is completely scalable to include additional data types ● iCluster have been applied to discover subtypes at breast cancer and glioblastoma multiforme (GBM) ● iCluster+ makes different modeling assumptions on data types: binary, continuous, categorical, and sequential data 17
  • 18. SNF data fusion 1. Calculate sample similarity W in each omic dataset using (1) 2. Calculate normalized weight matrix P from W using (2) 3. Use K nearest neighbors (KNN) to calculate local affinity matrix S through the formulas (3) from W. P carries the full information about the similarity of each patient to all others whereas S only encodes the similarity to the K most similar patients for each patient. 4. Network fusion process: for 2 datasets, P1, S1 and P2, S2 can be calculated, then iteratively update P1 and P2 for t steps using (4) and (5); for more than 2 datasets, update the Ps using (5) 5. Obtain the overall fused matrix P by averaging the updated single Ps 19
  • 19. Spectral Clustering Input X (n x n sample similarity matrix) and k clusters Goal subgroups in a graph with disjoint cliques Procedures: 1. Compute the normalized Laplacian L 2. Compute the first eigenvectors u and eigenvalues for L 3. Let U be the matrix containing eigenvectors u as columns 4. Form the matrix T from U by normalizing the rows to norm 1 5. Cluster the points with k-means into clusters C1, ..., Ck 20
  • 20. Application: GBM subtype discovery Evaluations: 1. P value in Cox log-rank test 2. Silhouette score 21
  • 21. Summaries ● SNF can construct sample sample network by integrating multiple datasets ● SNF can be expanded to include more datasets and be applied in more questions 22
  • 22. Bayesian Consensus Clustering ● An integrative statistical model that permits a separate clustering of the objects for each data source. ● These separate clusterings adhere loosely to an overall consensus clustering ● BCC do simultaneous estimation of both the consensus clustering and the source-specific clusterings 23
  • 23. Procedures ● Dirichlet mixture model to accommodate multiple data (X) ● Probability of belonging to one cluster ● Estimation ○ Gibbs sampling procedure to approximate the posterior distribution ○ Markov chain Monte Carlo (MCMC) proceeds by iteratively sampling ● Choose K based on highest mean adjusted adherence 24
  • 24. Application on breast cancer ● RNA gene expression (GE) data for 645 genes. ● DNA methylation (ME) data for 574 probes. ● miRNA expression (miRNA) data for 423 miRNAs. ● Reverse phase protein array (RPPA) data for 171 proteins. 25
  • 25. 26
  • 26. Summaries 1. BCC model assumes a simple and general dependence between data sources. 2. BCC models both an overall clustering and a clustering specific to each data source, with advantages over traditional methods in terms of modeling uncertainty and the ability to borrow information across sources. 3. BCC is suitable to work on multisource biomedical data, as well may be used to compare clusterings from different statistical models for a single homogeneous dataset. 27
  • 27. Regularized Multiple Kernel Learning Locality Preserving Projections (rMKL-LPP) 28 ● It is an extension of the current multiple kernel learning with dimensional reduction (MKL-DR) method, where the data are projected into a lower dimensional and integrative subspace. ● A regularization term is added to avoid overfitting during the optimization procedure, and it allows using several different kernel types. ● The Locality Preserving Projections (LPP) is applied to conserve the sum of distances for each sample’s k-Nearest Neighbors.
  • 28. Procedures ● Data fusion ○ rMKL-LPP ○ Optimization ○ integrated kernel matrix ● Clustering ○ K-means ○ Mean silhouette width used to optimize number of clusters ● Evaluation ○ Silhouette score and cross validation (Rand index) 29
  • 29. Applications in 5 cancers 1. Comparison to state-of-the-art (SNF) 2. Robustness analysis 3. Comparison of clusterings to established subtypes 4. Clinical implications from clusterings 30 5 cancers 1. glioblastoma multiforme (GBM) -- 213 samples 2. breast invasive carcinoma (BIC) -- 105 samples 3. kidney renal clear cell carcinoma (KRCCC) -- 122 samples 4. lung squamous cell carcinoma (LSCC) -- 106 samples 5. colon adenocarcinoma (COAD) -- 92 samplesDatasets: gene expression, DNA methylation and miRNA expression data
  • 30. 1. Comparison to state-of-the-art 31
  • 31. 2. Robustness analysis 32 Fig. 2. Robustness of clustering for leave-one-out datasets measured using Rand index. Fig. 3. Robustness of clustering for leave- one-out cross-validation applied to reduced sized datasets measured using Rand index.
  • 32. 3. Comparison of clusterings to established subtypes 33
  • 33. 4. Clinical implications from clusterings 34 GBM: ● 94 of 213 were treated with Temozolomide
  • 35. Summaries 1. rMKL-LPP found subtypes with more interesting log-rank test compared to the state-of-the-art method 2. Several kernel matrices per data type can improve performance burdance, remove the burden of selecting the optimal kernel matrix and have fair stability 3. rMKL-LPP compared to unregularized MKL-DR remains stable also for small datasets 4. The application at GBM shows to capture this diverse information within one clustering 36
  • 36. References 1. Huang, S., Chaudhary, K. & Garmire, L. X. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front. Genet. 8, 84 (2017). 2. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014). 3. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009). 4. Shen, R. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS One 7, e35236 (2012). 5. Mo, Q. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl. Acad. Sci. U. S. A. 110, 4245–4250 (2013). 6. Speicher, N. K. & Pfeifer, N. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31, i268–75 (2015). 7. Lock, E. F. & Dunson, D. B. Bayesian consensus clustering. Bioinformatics 29, 2610–2616 (2013). 37

Editor's Notes

  • #11: The main advantage of Bayesian methods in data integration is that they can make assumptions not only on different types of data sets with various distributions but also on the correlations among data sets.
  • #16: estimating the number of clusters K and the lasso parameter λ.
  • #17: (C) Model selection based on POD measure. A four-cluster sparse solution (λ = 0.2) was chosen.
  • #21: Spectral clustering is suitable for graph clustering
  • #29: It is an extension of the current multiple kernel learning with dimensional reduction (MKL-DR) method MKL-DR: https://pdfs.semanticscholar.org/1cd3/bbae54b217843870fdc771d727b6043225b8.pdf
  • #33: Fig. 2. Robustness of clustering for leave-one-out datasets measured using Rand index. Each patient is left out once in the dimensionality reduction and clustering procedure and afterwards added to the cluster with the closest mean based on the learned projection for this data point, which is given by projðxiÞ ¼ AT Ki b. The resulting cluster assignment is then compared with the clustering of the whole dataset. The error bars represent one standard deviation Fig. 3. Robustness of clustering for leave-one-out cross-validation applied to reduced sized datasets measured using Rand index. For each cancer type, we sampled 20 times half of the patients and applied leave-one-out cross-validation as described in Section 3.4. The error bars represent one standard deviation
  • #36: The results are very similar to those found by Noushmehr et al. (2010) for their identified G-CIMP positive subtype. In addition, we found the set of underexpressed genes to be highly enriched for processes associated to the immune system and inflammation [cf. Table 3 (column 2)]. Since chronic inflammation is generally related to cancer progression and is thought to play an important role in the construction of the tumor microenvironment (Hanahan and Weinberg, 2011), these downregulations might be a reason for the favorable outcome of patients from this cluster.