Multivariate Analysis and 
Visualization of ProteOmic Data 
Dmitry Grapov, PhD
State of the art facility producing massive 
amounts of biological data… 
>20-30K samples/yr 
>200 studies
Analysis at the ProteOmic Scale and Beyond 
Genomic 
Omic Multi-Omic 
Metabolomic 
Proteomic 
integration
Sample 
Data Analysis and Visualization 
Variable 
Quality Assessment 
• use replicated mesurements 
and/or internal standards to 
estimate analytical variance 
Statistical and Multivariate 
• use the experimental design 
to test hypotheses and/or 
identify trends in analytes 
Functional 
• use statistical and multivariate 
results to identify impacted 
biochemical domains 
Network 
• integrate statistical and 
multivariate results with the 
experimental design and 
analyte metadata 
Sample Variable 
experimental design 
- organism, sex, age etc. 
analyte description and 
metadata 
- biochemical class, mass 
spectra, etc.
Sample 
Data Analysis and Visualization 
Variable 
Quality Assessment 
• use replicated mesurements 
and/or internal standards to 
estimate analytical variance 
Statistical and Multivariate 
• use the experimental design 
to test hypotheses and/or 
identify trends in analytes 
Functional 
• use statistical and multivariate 
results to identify impacted 
biochemical domains 
Network 
• integrate statistical and 
multivariate results with the 
experimental design and 
analyte metadata 
Network Mapping 
Sample Variable 
experimental design 
- organism, sex, age etc. 
analyte description and 
metadata 
- biochemical class, mass 
spectra, etc.
Data Quality Assessment 
Quality metrics 
•Precision (replicated 
measurements) 
•Accuracy (reference 
samples) 
Common tasks 
•normalization 
•outlier detection 
•missing values 
imputation
Batch Effects 
Drift in >400 replicated measurements across >100 analytical batches for a single analyte 
Principal Component 
Analysis (PCA) of all 
analytes, showing QC 
sample scores 
Acquisition batch 
Abundance 
QCs embedded 
among >5,5000 
samples (1:10) 
collected over 
1.5 yrs 
If the biological effect 
size is less than the 
analytical variance 
then the experiment 
will incorrectly yield 
insignificant results
Analyte specific data quality 
overview 
Sample specific normalization can be used 
to estimate and remove analytical variance 
Raw Data Normalized Data 
Normalizations need to be 
numerically and visually validated 
low precision 
log mean 
%RSD 
high precision 
Samples 
QCs 
Batch Effects
Outlier Detection 
• 1 variable 
(univariate) 
• 2 variables 
(bivariate) 
• >2 variables 
(multivariate)
bivariate vs. 
multivariate 
(scatter plot) 
outliers? 
mixed up samples 
(PCA scores plot) 
Outlier Detection
Statistical and Multivariate Analyses 
Group 1 
Statistics 
Multivariate 
Context 
+ 
+ 
= 
Network Mapping 
Ranked statistically 
significant differences 
within a a biochemical 
context 
Group 2 
What analytes are 
different between the 
two groups of samples? 
Statistical 
t-Test 
significant differences 
lacking rank and 
context 
Multivariate 
O-PLS-DA 
ranked differences 
lacking significance 
and context
Statistical and Multivariate Analyses 
Group 1 
Statistics 
Multivariate 
Context 
+ 
+ 
= 
Network Mapping 
Group 2 
What analytes are 
different between the 
two groups of samples? 
Statistical 
t-Test 
Multivariate 
O-PLS-DA 
To see the big picture it is necessary too view the data from multiple 
different angles
Statistical Analysis: achieving ‘significance’ 
significance level (α) and power (1-β ) 
effect size (standardized difference in 
means) 
sample size (n) 
Power analyses can be used to 
optimize future experiments 
given preliminary data 
Example: use experimentally 
derived (or literature estimated) 
effect sizes, desired p-value 
(alpha) and power (beta) to 
calculate the optimal number of 
samples per group
Statistical Tests 
Poisson normal 
• Should be chosen based on the distribution 
(shape, type) of the (e.g. normal, negative 
binomial, Poisson) 
• Can be optimized based on data pre-treatment 
(e.g. NSAF, Power Law Global Error 
Model, PLGEM)
False Discovery Rate (FDR) 
Type I Error: False Positives (α) 
•Type II Error: False Negatives (β) 
•Type I risk = 
•1-(1-p.value)m 
m = number of variables tested
False Discovery Rate Adjustment 
FDR adjusted p-value 
p-value 
Benjamini & 
Hochberg (1995) 
(“BH”) 
•Accepted standard 
Bonferroni 
•Very conservative 
•adjusted p-value = 
p-value x # of tests 
(e.g. 0.005 x 148 = 0.74 )
Functional Analysis 
Identify changes or enrichment in biochemical domains 
• decrease 
• increase 
Nucl. Acids Res. (2008) 36 (suppl 2): W423-W426.doi: 10.1093/nar/gkn282
Functional Analysis: Enrichment 
Biochemical Pathway Biochemical Ontology
Common Multivariate Methods 
Clustering 
Projection 
Networks
Artist: Chuck Close 
Cluster Analysis 
Useful for 
•pattern recognition 
•complexity reduction 
Common Methods 
•Hierarchical 
•Model based 
•Other (k-means, k-NN, PAM, 
fuzzy) 
Linkage k-means 
Distribution Density
Hierarchical Clustering 
Similarity 
x 
x 
x 
x 
Dendrogram 
How does my metadata 
match my data structure?
Projection Methods 
single analyte all analytes 
The algorithm defines the position of the light source 
Principal Components Analysis (PCA) 
• unsupervised 
• maximize variance (X) 
Partial Least Squares Projection to 
Latent Structures (PLS) 
• supervised 
• maximize covariance (Y ~ X) 
James X. Li, 2009, VisuMap Tech.
Interpreting scores and loadings 
loadings represent how variables 
contribute to sample scores 
variables with the highest loadings have the 
greatest contribution to sample scores 
loadings 
scores 
Scores represent 
dis/similarities in samples 
based on all variables
Networks 
Biochemical 
•interaction 
• enrichment 
•etc 
Empirical (dependency) 
•correlation 
•partial-correlation 
•clustering 
variable 1 
variable 2 
variable 3
Enrichment Network 
Mapping of parents through children
Interaction Networks
Empirical Networks 
• Correlation based networks (CN) 
(simple, tendency to hairball) 
• GGM or partial correlation based 
networks (advanced, preference 
of direct over indirect 
relationships 
• *Increase in robustness with 
sample size 
10.1007/978-1-4614-1689-0_17
Proteomic Case Study: Diabetes Markers 
• Small sample size (control =12, GDM =6); covariates (time of sample collection) 
• >600 measured colostrum proteins; ~ 300 NSAF normalized proteins retained 
• Multivariate classification with O-PLS-DA used to identify variables to test using 
PLGEM with correction for FDR 
• Partial-correlation protein-protein interaction network analysis
DeviumWeb 
https://github.com/dgrapov/DeviumWeb 
• visualization 
• statistics 
• clustering 
• PCA 
• O-PLS
DeviumWeb 
https://github.com/dgrapov/DeviumWeb 
• visualization 
• statistics 
• clustering 
• PCA 
• O-PLS
Software and Resources 
•DeviumWeb- Dynamic multivariate data analysis and 
visualization platform 
url: https://github.com/dgrapov/DeviumWeb 
•imDEV- Microsoft Excel add-in for multivariate analysis 
url: http://sourceforge.net/projects/imdev/ 
•MetaMapR- Network analysis tools for metabolomics 
url: https://github.com/dgrapov/MetaMapR 
•TeachingDemos- Tutorials and demonstrations 
•url: http://sourceforge.net/projects/teachingdemos/?source=directory 
•url: https://github.com/dgrapov/TeachingDemos 
•Data analysis case studies and Examples 
url: http://imdevsoftware.wordpress.com/
Questions? 
dgrapov@ucdavis.edu 
This research was supported in part by NIH 1 U24 DK097154

More Related Content

PPTX
Some statistical concepts relevant to proteomics data analysis
PPT
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
PPT
Prote-OMIC Data Analysis and Visualization
PPTX
Mapping to the Metabolomic Manifold
PPTX
Data Normalization Approaches for Large-scale Biological Studies
PPTX
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
PPT
Strategies for Metabolomics Data Analysis
PPT
Advanced strategies for Metabolomics Data Analysis
Some statistical concepts relevant to proteomics data analysis
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
Prote-OMIC Data Analysis and Visualization
Mapping to the Metabolomic Manifold
Data Normalization Approaches for Large-scale Biological Studies
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Strategies for Metabolomics Data Analysis
Advanced strategies for Metabolomics Data Analysis

What's hot (20)

PPTX
Metabolomic data analysis and visualization tools
PPTX
Automation of (Biological) Data Analysis and Report Generation
PPTX
3 data normalization (2014 lab tutorial)
PPT
Multivarite and network tools for biological data analysis
PPTX
0 introduction
PPT
Multivariate data analysis and visualization tools for biological data
PPTX
Normalization of Large-Scale Metabolomic Studies 2014
PPTX
Metabolomic Data Analysis Workshop and Tutorials (2014)
PPT
Metabolomic Data Analysis Case Studies
PPTX
High Dimensional Biological Data Analysis and Visualization
PDF
Case Study: Overview of Metabolomic Data Normalization Strategies
PPTX
Data analysis workflows part 2 2015
PPTX
Data analysis workflows part 1 2015
PPTX
Omic Data Integration Strategies
PPTX
4 partial least squares modeling
PPTX
1 statistical analysis
PDF
Ijcatr04051005
PPTX
3 principal components analysis
PPTX
Multivariate data analysis
PPT
Common Method Variance
Metabolomic data analysis and visualization tools
Automation of (Biological) Data Analysis and Report Generation
3 data normalization (2014 lab tutorial)
Multivarite and network tools for biological data analysis
0 introduction
Multivariate data analysis and visualization tools for biological data
Normalization of Large-Scale Metabolomic Studies 2014
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Case Studies
High Dimensional Biological Data Analysis and Visualization
Case Study: Overview of Metabolomic Data Normalization Strategies
Data analysis workflows part 2 2015
Data analysis workflows part 1 2015
Omic Data Integration Strategies
4 partial least squares modeling
1 statistical analysis
Ijcatr04051005
3 principal components analysis
Multivariate data analysis
Common Method Variance
Ad

Viewers also liked (12)

PPT
Gene Ontology Network Enrichment Analysis
PPT
iQconCAT: quantitative proteomics from instrument to browser
PPT
Quantitative Proteomics: From Instrument To Browser
PPTX
OpenMS: Quantitative proteomics at large scale
PPTX
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
PPTX
Proteomics
PPT
Proteomics
PPTX
Mass spectrometry
PPTX
Mass spectrometry final.pptx
PDF
Proteomics analysis: Basics and Applications
PPTX
Techniques in proteomics
Gene Ontology Network Enrichment Analysis
iQconCAT: quantitative proteomics from instrument to browser
Quantitative Proteomics: From Instrument To Browser
OpenMS: Quantitative proteomics at large scale
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
Proteomics
Proteomics
Mass spectrometry
Mass spectrometry final.pptx
Proteomics analysis: Basics and Applications
Techniques in proteomics
Ad

Similar to Multivariate Analysis and Visualization of Proteomic Data (20)

PDF
Statistical analysis
PPT
Intermediate Strategies for Metabolomic Data Analysis
PDF
Basics of Data Analysis in Bioinformatics
PPTX
Complex Systems Biology Informed Data Analysis and Machine Learning
PDF
Cardiology_Metabolomics_workshop_2016_v2
PDF
Bioinformatics Analysis of Metabolomics Data
PDF
Metabolomics Bioinformatics Analysis.pdf
PPT
Biomarker Strategies
PPTX
Types of Data in Machine Learning, Number aand Categorical
PPTX
Unit2.pptx Statistical Interference and Exploratory Data Analysis
PDF
Data Science - Part III - EDA & Model Selection
PPTX
Biodata analysis
PPTX
Lect1.pptxdglsgldjtzjgd csjfsjtskysngfkgfhxvxfhhdhz
PDF
Six sigma using minitab
PPT
Advanced Strategies for Analysis of Metabolomic Data
PDF
Tutorial
PDF
Data preprocessing and unsupervised learning methods in Bioinformatics
PPTX
PDF
Data Mining In Proteomics From Standards To Applications 1st Edition Michael ...
PDF
grizzly - informal overview - pydata boston 2013
Statistical analysis
Intermediate Strategies for Metabolomic Data Analysis
Basics of Data Analysis in Bioinformatics
Complex Systems Biology Informed Data Analysis and Machine Learning
Cardiology_Metabolomics_workshop_2016_v2
Bioinformatics Analysis of Metabolomics Data
Metabolomics Bioinformatics Analysis.pdf
Biomarker Strategies
Types of Data in Machine Learning, Number aand Categorical
Unit2.pptx Statistical Interference and Exploratory Data Analysis
Data Science - Part III - EDA & Model Selection
Biodata analysis
Lect1.pptxdglsgldjtzjgd csjfsjtskysngfkgfhxvxfhhdhz
Six sigma using minitab
Advanced Strategies for Analysis of Metabolomic Data
Tutorial
Data preprocessing and unsupervised learning methods in Bioinformatics
Data Mining In Proteomics From Standards To Applications 1st Edition Michael ...
grizzly - informal overview - pydata boston 2013

More from UC Davis (8)

PPTX
Presentation phinney abrf 2019
PPTX
Prosit google-cloud
PPTX
Phinney 2019 ASMS Proteome software Users group Talk
PPTX
Genome web july 2019 presentation phinney
PPTX
Asms qc Will Thompson Duke
PPTX
Phinney varibility workshop
PPTX
Colangelo asms workshop_061714
PPTX
Moeller proteomics course
Presentation phinney abrf 2019
Prosit google-cloud
Phinney 2019 ASMS Proteome software Users group Talk
Genome web july 2019 presentation phinney
Asms qc Will Thompson Duke
Phinney varibility workshop
Colangelo asms workshop_061714
Moeller proteomics course

Recently uploaded (20)

PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PPT
Computional quantum chemistry study .ppt
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Packaging materials of fruits and vegetables
PPTX
Understanding the Circulatory System……..
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPTX
Probability.pptx pearl lecture first year
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PDF
Science Form five needed shit SCIENEce so
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPT
LEC Synthetic Biology and its application.ppt
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Microbes in human welfare class 12 .pptx
PPT
Presentation of a Romanian Institutee 2.
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Computional quantum chemistry study .ppt
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Packaging materials of fruits and vegetables
Understanding the Circulatory System……..
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Probability.pptx pearl lecture first year
Hypertension_Training_materials_English_2024[1] (1).pptx
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Science Form five needed shit SCIENEce so
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Enhancing Laboratory Quality Through ISO 15189 Compliance
LEC Synthetic Biology and its application.ppt
Introcution to Microbes Burton's Biology for the Health
Microbes in human welfare class 12 .pptx
Presentation of a Romanian Institutee 2.

Multivariate Analysis and Visualization of Proteomic Data

  • 1. Multivariate Analysis and Visualization of ProteOmic Data Dmitry Grapov, PhD
  • 2. State of the art facility producing massive amounts of biological data… >20-30K samples/yr >200 studies
  • 3. Analysis at the ProteOmic Scale and Beyond Genomic Omic Multi-Omic Metabolomic Proteomic integration
  • 4. Sample Data Analysis and Visualization Variable Quality Assessment • use replicated mesurements and/or internal standards to estimate analytical variance Statistical and Multivariate • use the experimental design to test hypotheses and/or identify trends in analytes Functional • use statistical and multivariate results to identify impacted biochemical domains Network • integrate statistical and multivariate results with the experimental design and analyte metadata Sample Variable experimental design - organism, sex, age etc. analyte description and metadata - biochemical class, mass spectra, etc.
  • 5. Sample Data Analysis and Visualization Variable Quality Assessment • use replicated mesurements and/or internal standards to estimate analytical variance Statistical and Multivariate • use the experimental design to test hypotheses and/or identify trends in analytes Functional • use statistical and multivariate results to identify impacted biochemical domains Network • integrate statistical and multivariate results with the experimental design and analyte metadata Network Mapping Sample Variable experimental design - organism, sex, age etc. analyte description and metadata - biochemical class, mass spectra, etc.
  • 6. Data Quality Assessment Quality metrics •Precision (replicated measurements) •Accuracy (reference samples) Common tasks •normalization •outlier detection •missing values imputation
  • 7. Batch Effects Drift in >400 replicated measurements across >100 analytical batches for a single analyte Principal Component Analysis (PCA) of all analytes, showing QC sample scores Acquisition batch Abundance QCs embedded among >5,5000 samples (1:10) collected over 1.5 yrs If the biological effect size is less than the analytical variance then the experiment will incorrectly yield insignificant results
  • 8. Analyte specific data quality overview Sample specific normalization can be used to estimate and remove analytical variance Raw Data Normalized Data Normalizations need to be numerically and visually validated low precision log mean %RSD high precision Samples QCs Batch Effects
  • 9. Outlier Detection • 1 variable (univariate) • 2 variables (bivariate) • >2 variables (multivariate)
  • 10. bivariate vs. multivariate (scatter plot) outliers? mixed up samples (PCA scores plot) Outlier Detection
  • 11. Statistical and Multivariate Analyses Group 1 Statistics Multivariate Context + + = Network Mapping Ranked statistically significant differences within a a biochemical context Group 2 What analytes are different between the two groups of samples? Statistical t-Test significant differences lacking rank and context Multivariate O-PLS-DA ranked differences lacking significance and context
  • 12. Statistical and Multivariate Analyses Group 1 Statistics Multivariate Context + + = Network Mapping Group 2 What analytes are different between the two groups of samples? Statistical t-Test Multivariate O-PLS-DA To see the big picture it is necessary too view the data from multiple different angles
  • 13. Statistical Analysis: achieving ‘significance’ significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n) Power analyses can be used to optimize future experiments given preliminary data Example: use experimentally derived (or literature estimated) effect sizes, desired p-value (alpha) and power (beta) to calculate the optimal number of samples per group
  • 14. Statistical Tests Poisson normal • Should be chosen based on the distribution (shape, type) of the (e.g. normal, negative binomial, Poisson) • Can be optimized based on data pre-treatment (e.g. NSAF, Power Law Global Error Model, PLGEM)
  • 15. False Discovery Rate (FDR) Type I Error: False Positives (α) •Type II Error: False Negatives (β) •Type I risk = •1-(1-p.value)m m = number of variables tested
  • 16. False Discovery Rate Adjustment FDR adjusted p-value p-value Benjamini & Hochberg (1995) (“BH”) •Accepted standard Bonferroni •Very conservative •adjusted p-value = p-value x # of tests (e.g. 0.005 x 148 = 0.74 )
  • 17. Functional Analysis Identify changes or enrichment in biochemical domains • decrease • increase Nucl. Acids Res. (2008) 36 (suppl 2): W423-W426.doi: 10.1093/nar/gkn282
  • 18. Functional Analysis: Enrichment Biochemical Pathway Biochemical Ontology
  • 19. Common Multivariate Methods Clustering Projection Networks
  • 20. Artist: Chuck Close Cluster Analysis Useful for •pattern recognition •complexity reduction Common Methods •Hierarchical •Model based •Other (k-means, k-NN, PAM, fuzzy) Linkage k-means Distribution Density
  • 21. Hierarchical Clustering Similarity x x x x Dendrogram How does my metadata match my data structure?
  • 22. Projection Methods single analyte all analytes The algorithm defines the position of the light source Principal Components Analysis (PCA) • unsupervised • maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) • supervised • maximize covariance (Y ~ X) James X. Li, 2009, VisuMap Tech.
  • 23. Interpreting scores and loadings loadings represent how variables contribute to sample scores variables with the highest loadings have the greatest contribution to sample scores loadings scores Scores represent dis/similarities in samples based on all variables
  • 24. Networks Biochemical •interaction • enrichment •etc Empirical (dependency) •correlation •partial-correlation •clustering variable 1 variable 2 variable 3
  • 25. Enrichment Network Mapping of parents through children
  • 27. Empirical Networks • Correlation based networks (CN) (simple, tendency to hairball) • GGM or partial correlation based networks (advanced, preference of direct over indirect relationships • *Increase in robustness with sample size 10.1007/978-1-4614-1689-0_17
  • 28. Proteomic Case Study: Diabetes Markers • Small sample size (control =12, GDM =6); covariates (time of sample collection) • >600 measured colostrum proteins; ~ 300 NSAF normalized proteins retained • Multivariate classification with O-PLS-DA used to identify variables to test using PLGEM with correction for FDR • Partial-correlation protein-protein interaction network analysis
  • 29. DeviumWeb https://github.com/dgrapov/DeviumWeb • visualization • statistics • clustering • PCA • O-PLS
  • 30. DeviumWeb https://github.com/dgrapov/DeviumWeb • visualization • statistics • clustering • PCA • O-PLS
  • 31. Software and Resources •DeviumWeb- Dynamic multivariate data analysis and visualization platform url: https://github.com/dgrapov/DeviumWeb •imDEV- Microsoft Excel add-in for multivariate analysis url: http://sourceforge.net/projects/imdev/ •MetaMapR- Network analysis tools for metabolomics url: https://github.com/dgrapov/MetaMapR •TeachingDemos- Tutorials and demonstrations •url: http://sourceforge.net/projects/teachingdemos/?source=directory •url: https://github.com/dgrapov/TeachingDemos •Data analysis case studies and Examples url: http://imdevsoftware.wordpress.com/
  • 32. Questions? dgrapov@ucdavis.edu This research was supported in part by NIH 1 U24 DK097154