SlideShare a Scribd company logo
The sbv IMPROVER species translation challenge
Sometimes you can trust a rat
Sahand Hormoz Adel Dayarian
KITP, UC Santa Barbara
Gyan Bhanot
Rutgers Univ.
Michael Biehl
University of Groningen
Johann Bernoulli Institute
www.cs.rug.nl/biehl
m.biehl@rug.nl
Winning the rat race 2
sbv IMPROVER species translation challenge
systems
biology
verification
combined with
industrial
methodology
for
process
verification
in research
IBM Research, Yorktown Heights
Philip Morris International Research and Development
www.sbvimprover.com
Winning the rat race 3
protein phosphorylation
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race 4
protein phosphorylation
chemical stimuli
gene expression
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race 5
www.sbvimprover.com
chemical stimuli
phosphorylation
status
( measured)
gene expression
(Δ measured)
complex network (incomplete snapshot)
Winning the rat race 6
A AB B
• normal bronchial epithelial cells, derived from human and rat
• 52 different chemical stimuli (26 (A) + 26 (B)), additional controls
• phosphorylation status after 5 minutes and 25 minutes
• gene expression after 6 hours
challenge data
• rather low noise levels
• subtract control, median of replicates
challenge organizers: activation
abs(P) > 3 @5min. or @25min.
• ~ 10% positive examples
• noisy data (microarray)
• correct for saturation effects
N= 20110 (human)
N= 13841 (rat)
Winning the rat race 7
www.sbvimprover.com
2
1
3
challenge set-up and goals
1 intra-species prediction of phosphorylation
from gene expression
2 predict the response in human using
data available for rat cells
3 predict gene expression response
across species
Winning the rat race 8
intra-species phosphorylation prediction
sub-challenge 1
combination of two approaches:
• voter method
gene selection based on mutual information
• machine learning analysis
Principal Components representation +
Linear Discriminant Analysis
• weighted combination
based on Leave-One-Out cross validation
Winning the rat race 9
voter method
binarize data by thresholding
gene expression: G=1 if p < 0.01 (p-value for differential expression)
phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.)
for all pairs of genes and proteins:
calculate separate and joint entropies
using frequencies over stimuli
mutual information
assumption: high I indicates that a gene is predictive for the
corresponding protein status
Winning the rat race 10
example:
SYNPR level predictive of AKT1 activation
green = significant phosphorylation
red = significant gene expression
SYNPR under-expressed
 AKT1 phosphorylated
voter method
for each protein:
- determine a set of most predictive genes (varying number ~ 30-70)
- vote according to the presence of significant gene expressions
relative frequency of positive votes determines certainty score in [0,1]
Leave-One-Out (L-1-O) validation:
consider mutual information only over 25 stimuli, predict the 26th
performance estimate with respect to predicting novel data
Winning the rat race 11
voter method prediction
27 ... stimuli … 52
12….proteins…….16
• voting schemes obtained
from examples in A,
applied to the 26 new
stimuli of data set B
416 predictions w.r.t. data set B
• certainties in [0,1]
on average over the
26 L-1-O runs
Winning the rat race 12
machine learning approach
low-dimensional representation of gene expression data
• omit all genes with zero variation or only insignificant (p>0.05)
expression values over all 26 training stimuli (13841 -> 6033 genes)
• Principal Component Analysis (PCA) (pcascat, www.mloss.org
c/o MarcStrickert)
- error free representation of all data possible by max. 52 PCs
- here: use k ≤ 22 leading PCs only (remove small variations due to noise)
• Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify)
- identifies discriminative directions in k-dim. space
based on within-class and between-class variation
- probabilistic output provided, interpreted as certainty score
- if all training examples negative, score 0 is assigned
Winning the rat race 13
machine learning approach
• Leave-One-Out procedure with varying number k of PC projections
for each of the 16 target proteins
for k=1:22
- repeat 26 times: LDA based on 25 stimuli, predict the 26th
yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5)
- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)
- determine the number of false positives (fp), true positives (tp),
false negatives (fn), true negatives (tn)
Winning the rat race 14
machine learning approach
• perform protein-specific
weighted average to obtain certainties:
• prediction: apply to test set (B) (binarized)
27 ... stimuli … 52 27 ... stimuli … 52
proteins
proteins
Winning the rat race 15
machine learning approach
• for fair comparison with voter method:
Nested Leave-One-Out procedure
for each protein, repeat 26 times:
L-1-O using 24 out of 25 stimuli, varying k
mcc-weighted prediction for the 26th stimulus
• averaged certainties as weighted means
(unweighted mean if both mcc=0)
Winning the rat race 16
combined prediction
Winning the rat race 17
combined prediction
12….proteins…….16
27 ... stimuli … 52
Winning the rat race 18
12 © 2013 sbv IMPROVER, PMI and IBM
Scores and ranks of 21 participating teams
0
10
20
30
40
50
60
70
Sumofranks
Teams
AUPR Pearson BAC
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC: Balanced Accuracy
3 teams are separated
from the rest
Team AUPR Pearson BAC
Team_75 0.38 0.72 0.72
Team_49 0.42 0.71 0.69
Team_50 0.38 0.72 0.68
Team_93 0.37 0.70 0.61
Team_111 0.35 0.64 0.67
Team_61 0.35 0.68 0.60
Team_89 0.31 0.65 0.65
Team_112 0.29 0.63 0.66
Team_116 0.27 0.62 0.59
Team_64 0.23 0.59 0.58
Team_90 0.24 0.59 0.56
Team_100 0.23 0.60 0.56
Team_78 0.28 0.56 0.55
Team_72 0.15 0.55 0.58
Team_105 0.19 0.56 0.53
Team_82 0.14 0.55 0.55
Team_106 0.13 0.53 0.55
Team_71 0.14 0.49 0.45
Team_52 0.13 0.49 0.46
Team_84 0.10 0.48 0.49
Team_99 0.07 0.43 0.50
statistically significant (FDR < .05)
1
1
1
Winning the rat race 19
12 © 2013 sbv IMPROVER, PMI and IBM
Scores and ranks of 21 participating teams
0
10
20
30
40
50
60
70
Sumofranks
Teams
AUPR Pearson BAC
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC: Balanced Accuracy
3 teams are separated
from the rest
Team AUPR Pearson BAC
Team_75 0.38 0.72 0.72
Team_49 0.42 0.71 0.69
Team_50 0.38 0.72 0.68
Team_93 0.37 0.70 0.61
Team_111 0.35 0.64 0.67
Team_61 0.35 0.68 0.60
Team_89 0.31 0.65 0.65
Team_112 0.29 0.63 0.66
Team_116 0.27 0.62 0.59
Team_64 0.23 0.59 0.58
Team_90 0.24 0.59 0.56
Team_100 0.23 0.60 0.56
Team_78 0.28 0.56 0.55
Team_72 0.15 0.55 0.58
Team_105 0.19 0.56 0.53
Team_82 0.14 0.55 0.55
Team_106 0.13 0.53 0.55
Team_71 0.14 0.49 0.45
Team_52 0.13 0.49 0.46
Team_84 0.10 0.48 0.49
Team_99 0.07 0.43 0.50
statistically significant (FDR < .05)
LDA 0.34 0.71 0.67 2
voting 0.40 0.67 0.65 2
1
1
1
 combination improved the performance!
Winning the rat race 20
inter-species phosphorylation prediction
sub-challenge 2
Winning the rat race 21
www.sbvimprover.com
sub-challenge 2 set-up
Winning the rat race 22
sub-challenge 2 set-up
restrict ourselves to the use
of phosphorylation data only
reasoning:
immediate response to stimuli should
be comparable between species
www.sbvimprover.com
Winning the rat race 23
data
rat data set A
ratP
rat data set B
ratP
human data set A
humP
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 52
123…16123…16
stimuli
known prediction
proteins
Winning the rat race 24
assume similar activation in both species: “human ≈ rat”
naïve prediction
prediction score, corresponding to threshold 3 for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- threshold 0.5 for crisp classification
- here: scaling factor yields values well-spread in [0,1]
Winning the rat race 25
naïve prediction
AUC ≈ 0.83
sensitivity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
Winning the rat race 26
27 ... stimuli … 52
12….proteins…….16
color-coded certainty
for | humP |>3
in data set B
naïve prediction
Winning the rat race 27
machine learning approach
rat data set A
ratP
rat data set B
ratP
human data set A
|humP | > 3 ?
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 52
123…16123…16
stimuli
training prediction
proteins
16-dim.
vectors
16 separate
binary
classification
problems
Winning the rat race 28
LVQ prediction
LVQ1, one prototype per class
Nearest prototype classification:
here: 16-dim. data
Winning the rat race 29
prediction score / certainty for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- crisp classification for threshold 0.5
- here: scaling factor yields range of values similar to naïve prediction
validation: 26 Leave-One-Out training processes:
split data set A in 25 training / 1 test sample
(if training set is all negative: accept naïve prediction)
prediction: ensemble average of certainties over the 26 LVQ systems
LVQ prediction
Winning the rat race 30
AUC ≈ 0.88
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
obtained in the Leave-One-Out
validation scheme
LVQ prediction
sensitivity
1-specificity
Winning the rat race 31
naïve prediction
AUC ≈ 0.83
sensitivity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
Winning the rat race 32
27 ... stimuli … 52
12….proteins…16 combined prediction
12….proteins…16 27 ... stimuli … 52
combined prediction: weighted average according to
protein-specific performance (AUROC)
Winning the rat race 33
color-coded certainty
for |humP|>3
in data set B
27 ... stimuli … 52
12….proteins…….16 combined prediction
Winning the rat race 34
Winning the rat race 35
naïve (rat) 0.45 0.74 0.79 1
LVQ 0.37 0.69 0.76 3
 naïve scheme: best indiviudal prediction
• L-1-O not confirmed in the test set
 combination improves performance!
 confirmed in “wisdom of the crowd”
analysis
Winning the rat race 36
40 © 2013 sbv IMPROVER, PMI and IBM
Team Classifier Feature Selection Rank
Team_50
Learning Vector
Quantization LVQ1 +
naïve approach
NA 1
Team_111
Neural networks
13489 inputs, 1000
hidden sigmoid units,
32 outputs
2
Team_49
LDA
Rank proteins by
moderated t-test p-
values, threshold;
cross-validate
3
Team_61 Linear Fit PCA 4
Team_52
Least absolute
regression model LBE
NA
5
Team_93
Random forest
Predict activation
matrix of 7 proteins,
use it for remaining 9
6
Team_89
SVM w radial basis
kernel and RF
Biogrid, STRING 7
Classifier Methods for SC2
Winning the rat race 37
inter-species
pathway perturbation
prediction
sub-challenge 3
Winning the rat race 38
additional data / domain knowledge
246 gene sets from the C2CP collection (Broad Institute)
www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP
2) annotation of gene sets representing known pathways and function
1) mapping of rat genes to human orthologs
HGNC Comparison of Ortholog Predictions, HCOP
www.genenames.org/cgi-bin/hcop.pl
3) gene set enrichment analysis
www.broadinstitute.org/gsea/index.jsp
NES: normalized enrichment scores, representing expression
FDR: false discovery rate, i.e. statistical significance
threshold: FDR <0.25
Winning the rat race 39
in stimuli (set A)
genesets
FDR < 0.25
rat vs. human
frequent observation:
negative correlations between significant
rat and human gene sets
biology? data (pre-)processing?
Winning the rat race 40
• PCA: dimension and noise reduction
rat gene set data A and B represented by k (≤52) projections
training
training data: 26 stimuli in rat data set A
246-dim. vectors of rat NES
246 classification problems
targets: binarized human FDR (<0.25?)
• LDA: linear classifier using k projections as features (probabilistic output)
• Leave-One-Out validation: determine optimal k from data set A
• use k=8 to make predictions for
data set B (averaged over 26 L-1-O runs)
machine learning approach
Winning the rat race 41
27 ... stimuli … 52
genesets
final prediction, certanties
human gene set prediction
Winning the rat race 42
20 © 2013 sbv IMPROVER, PMI and IBM
Team scores and ranks
Team AUPR Pearson BAC rank
Team 50 0.19 0.59 0.54 1
Team 133 0.12 0.54 0.54 2
Team 49 0.12 0.53 0.53 3
Team 52 0.10 0.52 0.54 4
Team 131 0.11 0.50 0.52 5
Team 105 0.11 0.52 0.51 6
Team 111 0.06 0.41 0.43 7
0
5
10
15
20
25
Team_50 Team_133 Team_49 Team_52 Team_131 Team_105 Team_111
Sumofranks
BAC
Pearson
AUPR
FDR ≤ 0.01
Better rank to the left
significant
31 © 2013 sbv IMPROVER, PMI and IBM
0
5
10
15
20
25
30
35
40
45
Team_50 top_2 top_4 top_3 top_5 Team_133 Team_49 top_6 Team_52 all_teams Team_131 Team_105 Team_111
Sumofranks
“Teams"
AUPR Pearson BAC
All Teams
Best Individual
Team
Aggregation of results: The Wisdom of Crowds
Winning the rat race 43
summary
 sc-1: intra-species prediction of phosphorylation
gene expression is predictive for phosphorylation status
 sc-3: inter-species prediction of gene sets
weakly predictive, presence of negative correlations
between rat and human genes and gene sets
 sc-2: inter-species prediction of phosphorylation
rat phosphorylation is predictive for human cell response
Winning the rat race 44
outlook
• more sophisticated learning schemes / classifiers
e.g. feature weighting schemes, Matrix Relevance LVQ
• ‘joint’ predictions of protein or gene set tableaus
e.g. predict 1 protein from 16 + 15 values in set A
two-step procedure for set B
• include gene expression in sub-challenge 2
• investigate difficult to predict proteins / gene sets
• infer and enhance network models from experimental data
on-going, new challenge (runs until February 2014)
Network Verification Challenge (NVC)
www.sbvimprover.com
Winning the rat race 45
take home messages
• team work works (and skype is great)
• in case of doubt: PCA
• the smaller the data set, the simpler the method
• committees can be useful!
• if you have won the rat race, you might be a rat

More Related Content

PDF
Clinical prediction models: development, validation and beyond
PPTX
Interpretable machine-learning (in endocrinology and beyond)
PPTX
Quantification of variability and uncertainty in systems medicine models
PDF
project_ppt_merged.pdf
PDF
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
PDF
The basics of prediction modeling
DOCX
Transforming health care with ai powered
PDF
Clinical prediction models: development, validation and beyond
Interpretable machine-learning (in endocrinology and beyond)
Quantification of variability and uncertainty in systems medicine models
project_ppt_merged.pdf
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
The basics of prediction modeling
Transforming health care with ai powered

Similar to 2013: Sometimes you can trust a rat - The sbv improver species translation challenge (20)

PDF
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
PDF
Math, Stats and CS in Public Health and Medical Research
PDF
Chapter 02-logistic regression
PPTX
Active_Learning_Presentation_082229.pptx
PPSX
Biehl hanze-2021
PDF
InSyBio at Open Coffee Athens CI
PPTX
Modelling physiological uncertainty
PPTX
Analysis Report Presentation 041515 - Team 4
PDF
1-s2.0-S1877050915004561-main
PDF
MH Prediction Modeling and Validation -clean
PDF
HEART DISEASE PREDICTION USING MACHINE LEARNING TECHNIQUES
PPTX
Predicting Liver Disease in India: A Machine Learning Approach
PDF
PPTX
Predicting Disease with Machine Learning.pptx
PDF
IRJET- Disease Prediction System
PDF
Simple rules for building robust machine learning models
PDF
Classification examp
PDF
Big Data Analytics for Obesity Prediction
PDF
Genetically Optimized Neural Network for Heart Disease Classification
PDF
Improving support vector machine and backpropagation performance for diabetes...
Robust Prediction of Cancer Disease Using Pattern Classification of Microarra...
Math, Stats and CS in Public Health and Medical Research
Chapter 02-logistic regression
Active_Learning_Presentation_082229.pptx
Biehl hanze-2021
InSyBio at Open Coffee Athens CI
Modelling physiological uncertainty
Analysis Report Presentation 041515 - Team 4
1-s2.0-S1877050915004561-main
MH Prediction Modeling and Validation -clean
HEART DISEASE PREDICTION USING MACHINE LEARNING TECHNIQUES
Predicting Liver Disease in India: A Machine Learning Approach
Predicting Disease with Machine Learning.pptx
IRJET- Disease Prediction System
Simple rules for building robust machine learning models
Classification examp
Big Data Analytics for Obesity Prediction
Genetically Optimized Neural Network for Heart Disease Classification
Improving support vector machine and backpropagation performance for diabetes...
Ad

More from University of Groningen (20)

PDF
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
PDF
ESE-Eyes-2023.pdf
PDF
APPIS-FDGPET.pdf
PDF
stat-phys-appis-reduced.pdf
PDF
prototypes-AMALEA.pdf
PDF
stat-phys-AMALEA.pdf
PDF
Evidence for tissue and stage-specific composition of the ribosome: machine l...
PPTX
The statistical physics of learning revisted: Phase transitions in layered ne...
PPSX
2020: Prototype-based classifiers and relevance learning: medical application...
PPSX
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
PPTX
2020: So you thought the ribosome was constant and conserved ...
PPSX
Prototype-based classifiers and their applications in the life sciences
PPSX
Prototype-based models in machine learning
PPSX
The statistical physics of learning - revisited
PPSX
2013: Prototype-based learning and adaptive distances for classification
PPSX
2015: Distance based classifiers: Basic concepts, recent developments and app...
PPSX
2016: Classification of FDG-PET Brain Data
PPSX
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
PPSX
2017: Prototype-based models in unsupervised and supervised machine learning
PPSX
June 2017: Biomedical applications of prototype-based classifiers and relevan...
Interpretable machine learning in endocrinology, M. Biehl, APPIS 2024
ESE-Eyes-2023.pdf
APPIS-FDGPET.pdf
stat-phys-appis-reduced.pdf
prototypes-AMALEA.pdf
stat-phys-AMALEA.pdf
Evidence for tissue and stage-specific composition of the ribosome: machine l...
The statistical physics of learning revisted: Phase transitions in layered ne...
2020: Prototype-based classifiers and relevance learning: medical application...
2020: Phase transitions in layered neural networks: ReLU vs. sigmoidal activa...
2020: So you thought the ribosome was constant and conserved ...
Prototype-based classifiers and their applications in the life sciences
Prototype-based models in machine learning
The statistical physics of learning - revisited
2013: Prototype-based learning and adaptive distances for classification
2015: Distance based classifiers: Basic concepts, recent developments and app...
2016: Classification of FDG-PET Brain Data
2016: Predicting Recurrence in Clear Cell Renal Cell Carcinoma
2017: Prototype-based models in unsupervised and supervised machine learning
June 2017: Biomedical applications of prototype-based classifiers and relevan...
Ad

Recently uploaded (20)

PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Sciences of Europe No 170 (2025)
PPTX
Microbiology with diagram medical studies .pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
The scientific heritage No 166 (166) (2025)
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
An interstellar mission to test astrophysical black holes
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
protein biochemistry.ppt for university classes
The KM-GBF monitoring framework – status & key messages.pptx
neck nodes and dissection types and lymph nodes levels
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Biophysics 2.pdffffffffffffffffffffffffff
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Sciences of Europe No 170 (2025)
Microbiology with diagram medical studies .pptx
2. Earth - The Living Planet Module 2ELS
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
ECG_Course_Presentation د.محمد صقران ppt
AlphaEarth Foundations and the Satellite Embedding dataset
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
The scientific heritage No 166 (166) (2025)
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
famous lake in india and its disturibution and importance
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
An interstellar mission to test astrophysical black holes
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx

2013: Sometimes you can trust a rat - The sbv improver species translation challenge

  • 1. The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ. Michael Biehl University of Groningen Johann Bernoulli Institute www.cs.rug.nl/biehl m.biehl@rug.nl
  • 2. Winning the rat race 2 sbv IMPROVER species translation challenge systems biology verification combined with industrial methodology for process verification in research IBM Research, Yorktown Heights Philip Morris International Research and Development www.sbvimprover.com
  • 3. Winning the rat race 3 protein phosphorylation reversible protein phosphorylation addition or removal of a phosphate group alters shape and function of proteins
  • 4. Winning the rat race 4 protein phosphorylation chemical stimuli gene expression reversible protein phosphorylation addition or removal of a phosphate group alters shape and function of proteins
  • 5. Winning the rat race 5 www.sbvimprover.com chemical stimuli phosphorylation status ( measured) gene expression (Δ measured) complex network (incomplete snapshot)
  • 6. Winning the rat race 6 A AB B • normal bronchial epithelial cells, derived from human and rat • 52 different chemical stimuli (26 (A) + 26 (B)), additional controls • phosphorylation status after 5 minutes and 25 minutes • gene expression after 6 hours challenge data • rather low noise levels • subtract control, median of replicates challenge organizers: activation abs(P) > 3 @5min. or @25min. • ~ 10% positive examples • noisy data (microarray) • correct for saturation effects N= 20110 (human) N= 13841 (rat)
  • 7. Winning the rat race 7 www.sbvimprover.com 2 1 3 challenge set-up and goals 1 intra-species prediction of phosphorylation from gene expression 2 predict the response in human using data available for rat cells 3 predict gene expression response across species
  • 8. Winning the rat race 8 intra-species phosphorylation prediction sub-challenge 1 combination of two approaches: • voter method gene selection based on mutual information • machine learning analysis Principal Components representation + Linear Discriminant Analysis • weighted combination based on Leave-One-Out cross validation
  • 9. Winning the rat race 9 voter method binarize data by thresholding gene expression: G=1 if p < 0.01 (p-value for differential expression) phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.) for all pairs of genes and proteins: calculate separate and joint entropies using frequencies over stimuli mutual information assumption: high I indicates that a gene is predictive for the corresponding protein status
  • 10. Winning the rat race 10 example: SYNPR level predictive of AKT1 activation green = significant phosphorylation red = significant gene expression SYNPR under-expressed  AKT1 phosphorylated voter method for each protein: - determine a set of most predictive genes (varying number ~ 30-70) - vote according to the presence of significant gene expressions relative frequency of positive votes determines certainty score in [0,1] Leave-One-Out (L-1-O) validation: consider mutual information only over 25 stimuli, predict the 26th performance estimate with respect to predicting novel data
  • 11. Winning the rat race 11 voter method prediction 27 ... stimuli … 52 12….proteins…….16 • voting schemes obtained from examples in A, applied to the 26 new stimuli of data set B 416 predictions w.r.t. data set B • certainties in [0,1] on average over the 26 L-1-O runs
  • 12. Winning the rat race 12 machine learning approach low-dimensional representation of gene expression data • omit all genes with zero variation or only insignificant (p>0.05) expression values over all 26 training stimuli (13841 -> 6033 genes) • Principal Component Analysis (PCA) (pcascat, www.mloss.org c/o MarcStrickert) - error free representation of all data possible by max. 52 PCs - here: use k ≤ 22 leading PCs only (remove small variations due to noise) • Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify) - identifies discriminative directions in k-dim. space based on within-class and between-class variation - probabilistic output provided, interpreted as certainty score - if all training examples negative, score 0 is assigned
  • 13. Winning the rat race 13 machine learning approach • Leave-One-Out procedure with varying number k of PC projections for each of the 16 target proteins for k=1:22 - repeat 26 times: LDA based on 25 stimuli, predict the 26th yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5) - compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1) - determine the number of false positives (fp), true positives (tp), false negatives (fn), true negatives (tn)
  • 14. Winning the rat race 14 machine learning approach • perform protein-specific weighted average to obtain certainties: • prediction: apply to test set (B) (binarized) 27 ... stimuli … 52 27 ... stimuli … 52 proteins proteins
  • 15. Winning the rat race 15 machine learning approach • for fair comparison with voter method: Nested Leave-One-Out procedure for each protein, repeat 26 times: L-1-O using 24 out of 25 stimuli, varying k mcc-weighted prediction for the 26th stimulus • averaged certainties as weighted means (unweighted mean if both mcc=0)
  • 16. Winning the rat race 16 combined prediction
  • 17. Winning the rat race 17 combined prediction 12….proteins…….16 27 ... stimuli … 52
  • 18. Winning the rat race 18 12 © 2013 sbv IMPROVER, PMI and IBM Scores and ranks of 21 participating teams 0 10 20 30 40 50 60 70 Sumofranks Teams AUPR Pearson BAC Better rank to the left AUPR: Area Under Precision Recall Pearson: Pearson correlation between predictions and binarized Gold Standard BAC: Balanced Accuracy 3 teams are separated from the rest Team AUPR Pearson BAC Team_75 0.38 0.72 0.72 Team_49 0.42 0.71 0.69 Team_50 0.38 0.72 0.68 Team_93 0.37 0.70 0.61 Team_111 0.35 0.64 0.67 Team_61 0.35 0.68 0.60 Team_89 0.31 0.65 0.65 Team_112 0.29 0.63 0.66 Team_116 0.27 0.62 0.59 Team_64 0.23 0.59 0.58 Team_90 0.24 0.59 0.56 Team_100 0.23 0.60 0.56 Team_78 0.28 0.56 0.55 Team_72 0.15 0.55 0.58 Team_105 0.19 0.56 0.53 Team_82 0.14 0.55 0.55 Team_106 0.13 0.53 0.55 Team_71 0.14 0.49 0.45 Team_52 0.13 0.49 0.46 Team_84 0.10 0.48 0.49 Team_99 0.07 0.43 0.50 statistically significant (FDR < .05) 1 1 1
  • 19. Winning the rat race 19 12 © 2013 sbv IMPROVER, PMI and IBM Scores and ranks of 21 participating teams 0 10 20 30 40 50 60 70 Sumofranks Teams AUPR Pearson BAC Better rank to the left AUPR: Area Under Precision Recall Pearson: Pearson correlation between predictions and binarized Gold Standard BAC: Balanced Accuracy 3 teams are separated from the rest Team AUPR Pearson BAC Team_75 0.38 0.72 0.72 Team_49 0.42 0.71 0.69 Team_50 0.38 0.72 0.68 Team_93 0.37 0.70 0.61 Team_111 0.35 0.64 0.67 Team_61 0.35 0.68 0.60 Team_89 0.31 0.65 0.65 Team_112 0.29 0.63 0.66 Team_116 0.27 0.62 0.59 Team_64 0.23 0.59 0.58 Team_90 0.24 0.59 0.56 Team_100 0.23 0.60 0.56 Team_78 0.28 0.56 0.55 Team_72 0.15 0.55 0.58 Team_105 0.19 0.56 0.53 Team_82 0.14 0.55 0.55 Team_106 0.13 0.53 0.55 Team_71 0.14 0.49 0.45 Team_52 0.13 0.49 0.46 Team_84 0.10 0.48 0.49 Team_99 0.07 0.43 0.50 statistically significant (FDR < .05) LDA 0.34 0.71 0.67 2 voting 0.40 0.67 0.65 2 1 1 1  combination improved the performance!
  • 20. Winning the rat race 20 inter-species phosphorylation prediction sub-challenge 2
  • 21. Winning the rat race 21 www.sbvimprover.com sub-challenge 2 set-up
  • 22. Winning the rat race 22 sub-challenge 2 set-up restrict ourselves to the use of phosphorylation data only reasoning: immediate response to stimuli should be comparable between species www.sbvimprover.com
  • 23. Winning the rat race 23 data rat data set A ratP rat data set B ratP human data set A humP human data set B | humP | > 3 ? 1 2 3 … 25 26 27 28 29 … 51 52 123…16123…16 stimuli known prediction proteins
  • 24. Winning the rat race 24 assume similar activation in both species: “human ≈ rat” naïve prediction prediction score, corresponding to threshold 3 for activation - precise (monotonic!) form is irrelevant for ROC, PR etc. - threshold 0.5 for crisp classification - here: scaling factor yields values well-spread in [0,1]
  • 25. Winning the rat race 25 naïve prediction AUC ≈ 0.83 sensitivity 1-specificity ROC with respect to the full panel (416 predictions) of | humP | > 3
  • 26. Winning the rat race 26 27 ... stimuli … 52 12….proteins…….16 color-coded certainty for | humP |>3 in data set B naïve prediction
  • 27. Winning the rat race 27 machine learning approach rat data set A ratP rat data set B ratP human data set A |humP | > 3 ? human data set B | humP | > 3 ? 1 2 3 … 25 26 27 28 29 … 51 52 123…16123…16 stimuli training prediction proteins 16-dim. vectors 16 separate binary classification problems
  • 28. Winning the rat race 28 LVQ prediction LVQ1, one prototype per class Nearest prototype classification: here: 16-dim. data
  • 29. Winning the rat race 29 prediction score / certainty for activation - precise (monotonic!) form is irrelevant for ROC, PR etc. - crisp classification for threshold 0.5 - here: scaling factor yields range of values similar to naïve prediction validation: 26 Leave-One-Out training processes: split data set A in 25 training / 1 test sample (if training set is all negative: accept naïve prediction) prediction: ensemble average of certainties over the 26 LVQ systems LVQ prediction
  • 30. Winning the rat race 30 AUC ≈ 0.88 ROC with respect to the full panel (416 predictions) of | humP | > 3 obtained in the Leave-One-Out validation scheme LVQ prediction sensitivity 1-specificity
  • 31. Winning the rat race 31 naïve prediction AUC ≈ 0.83 sensitivity 1-specificity ROC with respect to the full panel (416 predictions) of | humP | > 3
  • 32. Winning the rat race 32 27 ... stimuli … 52 12….proteins…16 combined prediction 12….proteins…16 27 ... stimuli … 52 combined prediction: weighted average according to protein-specific performance (AUROC)
  • 33. Winning the rat race 33 color-coded certainty for |humP|>3 in data set B 27 ... stimuli … 52 12….proteins…….16 combined prediction
  • 34. Winning the rat race 34
  • 35. Winning the rat race 35 naïve (rat) 0.45 0.74 0.79 1 LVQ 0.37 0.69 0.76 3  naïve scheme: best indiviudal prediction • L-1-O not confirmed in the test set  combination improves performance!  confirmed in “wisdom of the crowd” analysis
  • 36. Winning the rat race 36 40 © 2013 sbv IMPROVER, PMI and IBM Team Classifier Feature Selection Rank Team_50 Learning Vector Quantization LVQ1 + naïve approach NA 1 Team_111 Neural networks 13489 inputs, 1000 hidden sigmoid units, 32 outputs 2 Team_49 LDA Rank proteins by moderated t-test p- values, threshold; cross-validate 3 Team_61 Linear Fit PCA 4 Team_52 Least absolute regression model LBE NA 5 Team_93 Random forest Predict activation matrix of 7 proteins, use it for remaining 9 6 Team_89 SVM w radial basis kernel and RF Biogrid, STRING 7 Classifier Methods for SC2
  • 37. Winning the rat race 37 inter-species pathway perturbation prediction sub-challenge 3
  • 38. Winning the rat race 38 additional data / domain knowledge 246 gene sets from the C2CP collection (Broad Institute) www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP 2) annotation of gene sets representing known pathways and function 1) mapping of rat genes to human orthologs HGNC Comparison of Ortholog Predictions, HCOP www.genenames.org/cgi-bin/hcop.pl 3) gene set enrichment analysis www.broadinstitute.org/gsea/index.jsp NES: normalized enrichment scores, representing expression FDR: false discovery rate, i.e. statistical significance threshold: FDR <0.25
  • 39. Winning the rat race 39 in stimuli (set A) genesets FDR < 0.25 rat vs. human frequent observation: negative correlations between significant rat and human gene sets biology? data (pre-)processing?
  • 40. Winning the rat race 40 • PCA: dimension and noise reduction rat gene set data A and B represented by k (≤52) projections training training data: 26 stimuli in rat data set A 246-dim. vectors of rat NES 246 classification problems targets: binarized human FDR (<0.25?) • LDA: linear classifier using k projections as features (probabilistic output) • Leave-One-Out validation: determine optimal k from data set A • use k=8 to make predictions for data set B (averaged over 26 L-1-O runs) machine learning approach
  • 41. Winning the rat race 41 27 ... stimuli … 52 genesets final prediction, certanties human gene set prediction
  • 42. Winning the rat race 42 20 © 2013 sbv IMPROVER, PMI and IBM Team scores and ranks Team AUPR Pearson BAC rank Team 50 0.19 0.59 0.54 1 Team 133 0.12 0.54 0.54 2 Team 49 0.12 0.53 0.53 3 Team 52 0.10 0.52 0.54 4 Team 131 0.11 0.50 0.52 5 Team 105 0.11 0.52 0.51 6 Team 111 0.06 0.41 0.43 7 0 5 10 15 20 25 Team_50 Team_133 Team_49 Team_52 Team_131 Team_105 Team_111 Sumofranks BAC Pearson AUPR FDR ≤ 0.01 Better rank to the left significant 31 © 2013 sbv IMPROVER, PMI and IBM 0 5 10 15 20 25 30 35 40 45 Team_50 top_2 top_4 top_3 top_5 Team_133 Team_49 top_6 Team_52 all_teams Team_131 Team_105 Team_111 Sumofranks “Teams" AUPR Pearson BAC All Teams Best Individual Team Aggregation of results: The Wisdom of Crowds
  • 43. Winning the rat race 43 summary  sc-1: intra-species prediction of phosphorylation gene expression is predictive for phosphorylation status  sc-3: inter-species prediction of gene sets weakly predictive, presence of negative correlations between rat and human genes and gene sets  sc-2: inter-species prediction of phosphorylation rat phosphorylation is predictive for human cell response
  • 44. Winning the rat race 44 outlook • more sophisticated learning schemes / classifiers e.g. feature weighting schemes, Matrix Relevance LVQ • ‘joint’ predictions of protein or gene set tableaus e.g. predict 1 protein from 16 + 15 values in set A two-step procedure for set B • include gene expression in sub-challenge 2 • investigate difficult to predict proteins / gene sets • infer and enhance network models from experimental data on-going, new challenge (runs until February 2014) Network Verification Challenge (NVC) www.sbvimprover.com
  • 45. Winning the rat race 45 take home messages • team work works (and skype is great) • in case of doubt: PCA • the smaller the data set, the simpler the method • committees can be useful! • if you have won the rat race, you might be a rat