SlideShare a Scribd company logo
1
Application of a Selective Gaussian
Naïve Bayes Model for Diffuse-Large
B-Cell Lymphoma Classification
A. Cano, J. García, A. Masegosa, S. Moral
Dpt. Computer Science and Artificial Intelligence
University of Granada. Spain.
E-mail: {acu,fjgc,andrew,smc}@decsai.ugr.es
2
Diffuse Large B-Cell Lymphoma
Diffuse Large B-Cell Lymphoma (DLBCL), the most common subtype of non-
hodkin’s lymphoma, has long been enigmatic in that 40 per cent of patients
can be cured by combination of chemotherapy whereas the remainder
succumb to this disease.
Alizadeh et al (2000) discovered, using gene expression profiling, that DLBCL
comprises actually two different diseases that are indistinguishable by current
diagnostic methods. One subtype of DLBCL, termed germinal center B-like
DLBCL (GCB), has a high survival index, whereas the other DLBCL, termed
activated B-cell like (ABC), has a low survival index .
The gene expression profiling was carried out by a specialized cDNA
microarray, 'the lymphochip', that allows to quantify the expression of
thousands of genes in parallel that are preferentially expressed in lymphoid
cells.
3
cDNA Microarray
This is a tipical image of hybridization
of fluorescent in cDNA mirocrarray.
Each row represents a separate cDNA
clone on the microarray and each
column a separate mRNA sample.
In this case, there is 96 samples and
4096 cDNA clones.
These ratios are a measure of relative
gene expression in each experimental
sample.
As indicated, the scale extends from
fuorescence ratios of 0.25 to 4 (-2 to
+2 in log base 2 units). Grey indicates
missing or excluded data
In this data sets, there is many
missing or excluded datas.
4
Diffuse Large B-Cell Lymphoma
Classification
After Alizadeth et al (2000), there are three
important approaches for the classification of
Diffuse Large B-Cell Lymphoma.
– Rosenwald et al (2002): Get a new data base with more
cases. They get 274 cases. Alizadeth et al had only 42 cases
of Diffuse Large B-Cell Lymphoma
– Wright et al (2003): Find 27 genes to classify GCB
versus ABC.
– Lossos et al (2004): Find only 6 genes to estimate the
survival index of a patient.
5
Rosenwald et al (2002)
Andreas Rosenwald et al. 2002. The use of molecular profiling to
predict survival after chemotherapy for diffuse large-B-cell
lymphoma. New England Journal of Medicine, 346:1937–1947,
June.
Biopsy samples of diffuse large-B-cell lymphoma from 240
patients were examined for gene expression with the aid of a
DNA microarrays.
It is found a new subclass of DLBCL, called Type III, with an
intermediate probability of survival using hierarchical clustering,
They construct a predictor of overall survival after chemotherapy
based a linear combination of four signatures. A signature is a
biological group of genes.
6
Wright et al (2003)
Wright et al. 2003. A gene expression-based method to diagnose
clinically distinct subgroups of diffuse large b cell lymphoma.
Proceedings of National Academy of Sciences of the United States of
America, 100:9991–9996, August.
It is proposed a Bayesian predictor that estimates the probability of
membership to one of the two cancer subgroups (GCB or ABC), with
the data set of Rosenwald et al (2002).
Gene Expression Data: http://llmpp.nih.gov/DLBCLpredictor
– 8503 genes.
– 134 cases of GCB, 83 cases of ABC and 57 cases of Type III.
7
Wright et al. (2003)
DLBC subgroup predictor:
– Linear Predictor Score:
LPS(X)= X = (X1, X2, ...., Xn)
– Only the k genes with the most significant t statistic were used to
form the LPS, the optimal k was determined by a leave one out
method. A model including 27 genes had the lowest average error
rate.
where N(x, , ) represents a Normal density function with mean
 and deviation .
Training set: 67 GCB + 42 ABC. Validation set: 67 GCB + 41
ABC + 57 Type III.
j
jj Xa
2211
11
1
,,,,
,,
XLPSNXLPSN
XLPSN
GXP
8
Wright et al (2003)
This predictor choses a cutoff of 90% certainty. The samples for
which there was <90% probability of being in either subgroup
are termed ‘unclassified’.
Results:
- It has a 9.3% of unclassified.
- If the predictor classifies a sample, it is correct 97.0 % of the times.
9
Lossos et al (2004)
In this paper, the authors studied 36 genes that had been
reported by experts to predict survival in DLBCL.
In a univariate analysis, genes were ranked on the basis of their
ability to predict survival. With this ranking, they developed a
multivariate model that was only based on the expression of six
genes.
Finally, they proved that this model was sufficient to predict the
survival of a patient with DLBCL.
Mortality-predictor score =
-0.0273xLMO2 – 0.2103xBCL6 – 0.1878xFN1
+ 0.0346xCCND2 + 0.1888xSCYA3 + 0.5527xBCL2
10
Our Approach:
Selective Gaussian Naïve Bayes
It is a modified wrapper method to construct an optimal Naïve
Bayes classifier with a minimum number of predictive genes.
The main steps of the algorithm are:
– Step 1: Anova Phase. Filter method based on one way analysis of
variance (ANOVA).Selection of the most significant genes non correlated
between them.
– Step 2: Wrapper Phase. Application of a wrapper search method to
reduce the genes subset selected by Anova phase. It uses a Gaussian
Naïve Bayes classifier. In addition, a wrapper algorithm is applied M
times, so the output of the algorithm is a group of M features subsets.
– Step 3: Abduction Phase. A unique genes subset is selected using the
group of the candidates of the Wrapper Phase.
11
We propose a filter method to reduce significantly the
high number of genes, 8503, to apply a hard process as
a wrapper search.
Firstly, the genes are ranked by its F-statistic. A gene
with high F-statistic has differents expression values for
each subtype of DLBCL, that is to say, is a good gene
to discriminate each subtype of DLBCL.
In addition, the correlation between the genes is
considered too. So, the correlation is calculated to each
pair of genes in the two subclasses, so when two genes
are correlated in any subclass, the one with the lowest
F-satistic is eliminated.
Anova Phase
12
Anova Phase: ‘The Algorithm’ (1)
1. The genes are ranked by its F-statistic: {G1,
..., Gn}.
2. The gene G1 is selected:
Space of the genes
Selected gene
13
Anova Phase: ‘The Algorithm’ (2)
3. The genes correlated in the actual subclass with G1
are calculated and removed. G1 is a final selected
gene.
Space of the genes
Selected gene
Cluster of genes
Final Selected gene
14
Anova Phase: ‘The Algorithm’ (3)
4. The next not removed gene in the ranking Gj
is selected:
5. The genes correlated with Gj are calculated
and removed too.
Space of the genes
Selected gene
Final Selected gene
Cluster of genes
15
Anova Phase: ‘The Algorithm’ (4)
6. The steps 4 and 5 are repeated until all genes are removed or
included in the final subset of selected genes.
Space of the genes
Final Selected gene
This is the reduced final subset of genes selected by the Anova
Phase. The initial data set projected over this subset of genes
will be the working data subset for the Wrapper Phase.
16
Anova Phase
There are several important properties for this filter
method:
– The more informative genes non positively correlated among
them are selected.
– The genes negatively correlated among them are not
eliminated with the aim of introducing redundance in the final
data set.
– A gene with a low score is not eliminated if it is not positively
correlated with the other genes. This case do not occur in
traditional filter methods based on genes ranking, where the
least informative genes are directly eliminated.
17
Wrapper Phase
In this phase a wrapper algorithm is applied to find a
reduced genes subset from the one obtained in the the
Anova Phase.
The underlying classification model is a Naïve Bayes
classifier with continuous variables. We assume these
variables follow a Gaussian distribution given the class.
This phase uses a KFC methodology. It is used not only
to estimate the classifier accuracy, but rather it is used
to get a group of candidates features subsets for the
final classification subset.
18
As it is indicated in KFC methodology, the data set D is
randomly partitioned in K disjoint subsets, {D1, ..., Dk}.
So, the training algorithm is applied K times to the
subset Tk=D-Dk.
So, if a wrapper algorithm is applied to each Tk subset,
an optimal Gk feature subset will be gotten for each
case. So at the end of the procedure, K features
subsets are gotten by the Wrapper Phase.
In this way, if this process is repeated m times (the
randomly partition of data set D in K subsets ... ), we
obtain M=K x m feature subsets.
Wrapper Phase
19
A wrapper algorithm is applied again in the new trainining data set.
And an optimal gen subset is obtained too.
Wrapper Phase
The data set is randomly partitioned in 5 subsets
Training Data Set
Testing Data Set
A wrapper algorithm is applied in the trainining data set.
And an optimal gen subset is obtained.
Data
Set
G1
G4
G5
G7
G10
G1
G4
G5
G6
G9
A new training data set is defined with the 5-fold-cross methodology
5 fold-cross Methodology in Wrapper Phase
20
Wrapper Phase
At the end 5 gene subset are obtained in this Wrapper
Phase:
G1
G4
G5
G7
G10
G1
G4
G5
G6
G9
G2
G4
G5
G7
G11
G1
G4
G2
G8
G9
G1
G4
G2
G6
G9
If this process is repeated again (a new randomly
partitioned of the data set in 5 subsets), a new 5 gene
subset will be obtained again. Therefore, by this way, if
the process is repeated m times then we obtained M= 5 x
m distinct gene subsets.
21
Gaussian Naïve Bayes
Classifier
It is a simple classifier introduced by Langley et al (1992)
This classifier model assumes the following hypothesis:
– Each gene is independent of the other genes known the class.
– Each gene is considered as a random continuous variable that
follows a Normal distribution given the class.
An important advantage of this classifier is that it can work with missing
values in the data set.
C
G1 G2 G3 G4
Naïve Bayes Classifier
22
The Wrapper Algorithm
A wrapper algorithm is a stage divided algorithm. In each stage
of the algorithm, an unique feature subset is obtained.
A simple wrapper search algorithm is implemented with the
following parameters:
– Search Strategy: Sequential Selection
– Initial Subset: Empty Set.
– Evaluation Function: Accuracy of the Gaussian Naïve Bayes classifier
composed by the selected feature subset of this stage. This accuracy is
calculated using again the K-fold cross validation.
The wrapper algorithm begins with an empty set of features and
sequentially add the feature that maximizes the evaluation
function at each stage.
When there are several features that maximizes the evaluation
function, the one with the highest F-statistic is selected.
23
Wrapper Algorithm:
Stop Condition
Due to the low number of cases in the training data set (100 cases in our
case), it is more probable that two similar subsets of features have the
same accuracy (>1% in our case) than in the traditional data sets with a
large number of cases. So it is reasonable to think that the evolution of the
error rate will be very discontinuous and with a slow rate.
The typical error rate evolution is described in the following graphics:
0
0,1
0,2
0,3
0,4
0,5
0,6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
24
Wrapper Algorithm:
Stop Condition
Generals stop conditions:
– Stop if ∆r ≥ 0 (Stop if there is not an improvement):
• Early stopping. It is hard get an improvement in each stage. (In the preceding case, this condition stops in stage
number 4).
– Stop if ∆r > 0 (Stop if there is a deterioration)
• There is an overffiting problem, because many features are introduced without being necessary. (In the preceding
case, this condition stops in stage number 12).
The parameters are: #(P), stage of the algorithm; rp, actual error rate; ∆r = rp - rp-1, increment
error rate).
The heuristic stop condition implemented :
0
0,1
0,2
0,3
0,4
0,5
0,6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
r
False
True
-0,2
-0,15
-0,1
-0,05
0
0,05
0,1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
∆r
False
True
Avoid overfitting OR Avoid early stopping
25
Abduction Phase
In the previous Wrapper Phase, M feature subsets were gotten as
candidates for the final classification subset.
The procedure to select a unique feature subset is based on an
abductive inference over a Bayesian network, BN, encoding the joint
distribution of that a feature subset was selected by the wrapper
algorithm.
Firstly, a BN is learned from the M feature selected subsets in the
Wrapper Phase {F1, ..., FM} following this procedure:
– Let Phi={G1, ..., Gp} the set of all features in the subsets F1, ..., FM. For
each Gi, we define a new discrete variable Yi with two states {absent,
present}.
– Learn a BN by K2-Learnin algorithm.
26
Abduction Phase: An Expample
Let
– F1={G1,G3,G7,G8} F2={G2,G5,G7,G8}
– F3={G1,G3,G2,G5} F4={G1,G9,G10} F5={G1,G3,G7,G8,G10}
So Phi={G1,G2,G3,G5,G6,G7,G8,G9,G10} so we create this subset of
variables {Y1,Y2,Y3,Y5,Y6,Y7,Y8,Y9,Y10}.
So the cases are built as:
So, with this data set, a BN is learnt by the K2-Learning algorithm. A secondly,
it is applied an abduction algortihm over BN to get its two most probable
configurations.
27
Abduction Phase
Second, an abduction algortihm is applied to get its T most
probable configurations: {y*1, ..., y*T}.
For each configuration y*i = (y1,....,yp), we get a new feature
subset Hi with the variables at ‘present’ state.
At the end, we have the T most probable subsets of being
selected in the Wrapper Phase, {H1, ..., HT}.
The final classification subset is the one that minimezes the
average of the –logarithm of the likelihood of the class in the
complete training data set using a GNB model and leave-
one-out methodology.
28
Abduction Phase: An Expample
So, for example, we get this two configurations as the most ones
probable:
Form this two configurations, we get the two features subset
most probable to be selected by a Wrapper algorithm. In this
case are:
- H1= {G1,G3,G7,G8}
- H2= {G1,G2,G3,G5,G7,G8}
29
DLBCL Classification
The data set used in this paper is the one in Rosenwald et al
(2002). This data set considers 8503 genes in the following
cases:
– 134 cases for GCB.
– 83 cases for ABC.
– 57 cases for Type III.
The data set was randomly divided in a training and testing
group in the following way:
– Training Data Set: 67 cases for GCB + 42 cases for ACB.
– Testing Data Set: 67 GCB + 42 ACB + 57 cases for Type III.
The classifier accuracy estimation is obtained trough the
classifier evaluation in 10 different divisions of this data.
30
DLBCL Classification
The parameters for the implementation of the three phases
are:
– A gene X is considered correlated with a gen Y if the lower limit
of the confidence interval for the correlation coefficient is greater
than 0.15.
– The procedures of KFC Validation were carried out with a K=10.
– M = 3 x 10 = 30 candidates feature subsets were gotten in the
Wrapper Phase.
– The 20 most probable explanations were evaluated in the
Abduction Phase.
– The samples that has a probability of being in either subgroup
lower than 80% were termed ‘Unclassified’.
31
Results I
 Phase Anova: (Confidence Intervals at 95%)
– Size (gene number): [74.3, 83.1]
– Train accuracy rate (%): [96.8, 98.6]
– Test accuracy rate (%): [92.8, 95.4]
– Test -log likelihood: [0.38, 0.68]
– TypeIII Test -log likelihood: ‘Infinity’
Model Prediction
DLBCL
subgroup
Training set Validation set
32
Results I: Conclusions
Anova Phase only select the 1 % of the all genes, 78
genes of 8503 genes, and the Gaussian Naïve Bayes
classifier has a 94.1% of accuracy with these genes.
In this case, there are several Type III cases for wich
the classifier assigns them extreme probability.
The classifier only has a 2.5 % of unclassified cases. If
the predictor classifies a sample, it is correct 95.4% of
the cases.
33
Results II
 Phase Anova + Phase Search: (Confidence Intervals at
95%)
– Size (gene number): [6.17, 7.82]
– Train accuracy rate (%): [95.2, 98.0]
– Test accuracy rate (%): [88.83, 91.9]
– Test -log likelihood: [0.25, 0.37]
– TypeIII Test -log likelihood: [4.12, 5.08]
Model Prediction
DLBCL
subgroup
Training set Validation set
34
Results II: Conclusions
The number of genes is reduced to a 10 %, from 78 to 7 genes.
The average of –log likelihood is reduced at the half of the one obtained in
the Anova Phase. Therefore, this is a better classifier.
For Type III class, the predictor does not assign a full probability of
belonging to a class in none of its samples.
The classifier has a 9.1 % of unclassified cases. If the predictor classifier a
sample, it is correct in the 93.2% of the cases.
In Wright et al., there is a 9.2% of unclassified cases and if the predcitor
classifies a sample, it is correct in the 96.9% of the cases. But the
evaluation is done in a unique partition of the data set, so this percent is
not very reliable. In addition, we had some evaluations of our classifier
with better accuraccy.
In the other hand, we reduced substancially the number of genes in all the
cases. We get similars results with 7 genes, respect to the 27 ones of the
Wright et al.
35
Conclusions
We obtain a simple classifcation method that provides good
results.
We have developed a new method for feature subset selection
very robust for data sets with a many more features than cases.
The use of an Abduction process can provide us several
candidates that resume the distinct applications of the wrapper
algorithm.
There are three genes (LMO2, BCL6 and CCND2) that are
selected several times by our classification model and that
coinciden with the 6 selected ones by Lossos et al. using medical
information to predict survival in DLBCL.
36
Future works
 Develop more sophisticated models:
– Include replacement variables to manage missing
data.
– Consider Multidimensionals Gaussian distributions.
– Improve the MTE Gaussian Naïve Bayes model.
 Apply this model to other data sets as breast
cancer, colon cancer ...
 Compare with other models with discrete
variables.

More Related Content

PPTX
QTL MAPPING & ANALYSIS
PPTX
Association mapping
PDF
Predicting of the Embryogenic Performances of 5 Upper Amazon Cocoa Parents Us...
PPTX
Lecture 7 gwas full
DOCX
Report- Genome wide association studies.
PPTX
Association mapping
PPTX
Genome wide association studies seminar
PDF
Application of Genome-Wide Association Study (GWAS) and transcriptomics to st...
QTL MAPPING & ANALYSIS
Association mapping
Predicting of the Embryogenic Performances of 5 Upper Amazon Cocoa Parents Us...
Lecture 7 gwas full
Report- Genome wide association studies.
Association mapping
Genome wide association studies seminar
Application of Genome-Wide Association Study (GWAS) and transcriptomics to st...

What's hot (18)

PPTX
QTL mapping
PPTX
Fine QTL Mapping- A step towards Marker Assisted Selection (II)
PPT
20100515 bioinformatics kapushesky_lecture06
PPTX
GWAS of Resistance to Stem and Sheath Diseases of Uruguayan Advanced Rice Bre...
DOC
FA abstract
PDF
CDAC 2018 Ciccolella inferring
PDF
PICS: Pathway Informed Classification System for cancer analysis using gene e...
PPTX
Mapping and association mapping
 
PDF
Genotyping, linkage mapping and binary data
 
PPT
American Statistical Association October 23 2009 Presentation Part 1
PPTX
Project_702
PPT
Gene expression profiling ii
PPTX
Genomic selection, prediction models, GEBV values, genomic selection in plant...
PPTX
PPTX
Raj Lab Meeting May/01/2019
PPTX
Genomic Selection in Plants
PPTX
FunGen JC Presentation - Mostafavi et al. (2019)
PPTX
genome wide linkage mapping
QTL mapping
Fine QTL Mapping- A step towards Marker Assisted Selection (II)
20100515 bioinformatics kapushesky_lecture06
GWAS of Resistance to Stem and Sheath Diseases of Uruguayan Advanced Rice Bre...
FA abstract
CDAC 2018 Ciccolella inferring
PICS: Pathway Informed Classification System for cancer analysis using gene e...
Mapping and association mapping
 
Genotyping, linkage mapping and binary data
 
American Statistical Association October 23 2009 Presentation Part 1
Project_702
Gene expression profiling ii
Genomic selection, prediction models, GEBV values, genomic selection in plant...
Raj Lab Meeting May/01/2019
Genomic Selection in Plants
FunGen JC Presentation - Mostafavi et al. (2019)
genome wide linkage mapping
Ad

Similar to Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification (20)

PDF
coad_machine_learning
PDF
BRITEREU_finalposter
PDF
Survey and Evaluation of Methods for Tissue Classification
PPT
Microarray Statistics
PDF
chromosomal abnormalities by Iqra malik
PDF
MCQs on DNA MicroArray.pdf
PDF
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
PDF
A novel phylum-level archaea characterized by combining single-cell and metag...
PPTX
Applications of microarray
PDF
OncoRep: A n-of-1 reporting tool to support genome-guided treatment for breas...
PDF
Bioinformatics in dermato-oncology
PDF
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
DOCX
Data preprocessing
PPTX
Microarray CGH
PDF
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
PDF
METHODS USED FOR IDENTIFICATION OF DIFFERENTIALLY EXPRESSING GENES (DEGS) FRO...
PDF
From reads to pathways for efficient disease gene finding
PPT
Advances in Breast Tumor Biomarker Discovery Methods
PDF
Digging into thousands of variants to find disease genes in Mendelian and com...
DOCX
antiviral coursework
coad_machine_learning
BRITEREU_finalposter
Survey and Evaluation of Methods for Tissue Classification
Microarray Statistics
chromosomal abnormalities by Iqra malik
MCQs on DNA MicroArray.pdf
Limit of Detection of Rare Targets Using Digital PCR | ESHG 2015 Poster PS14.031
A novel phylum-level archaea characterized by combining single-cell and metag...
Applications of microarray
OncoRep: A n-of-1 reporting tool to support genome-guided treatment for breas...
Bioinformatics in dermato-oncology
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
Data preprocessing
Microarray CGH
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
METHODS USED FOR IDENTIFICATION OF DIFFERENTIALLY EXPRESSING GENES (DEGS) FRO...
From reads to pathways for efficient disease gene finding
Advances in Breast Tumor Biomarker Discovery Methods
Digging into thousands of variants to find disease genes in Mendelian and com...
antiviral coursework
Ad

More from NTNU (18)

PDF
Varying parameter in classification based on imprecise probabilities
PDF
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
PDF
Bagging Decision Trees on Data Sets with Classification Noise
PDF
lassification with decision trees from a nonparametric predictive inference p...
PDF
Locally Averaged Bayesian Dirichlet Metrics
PDF
An interactive approach for cleaning noisy observations in Bayesian networks ...
PDF
Learning classifiers from discretized expression quantitative trait loci
PDF
Split Criterions for Variable Selection Using Decision Trees
PDF
A Semi-naive Bayes Classifier with Grouping of Cases
PDF
Combining Decision Trees Based on Imprecise Probabilities and Uncertainty Mea...
PDF
Interactive Learning of Bayesian Networks
PDF
A Bayesian approach to estimate probabilities in classification trees
PDF
A Bayesian Random Split to Build Ensembles of Classification Trees
PDF
An Experimental Study about Simple Decision Trees for Bagging Ensemble on Dat...
PDF
Selective Gaussian Naïve Bayes Model for Diffuse Large-B-Cell Lymphoma Classi...
PDF
Evaluating query-independent object features for relevancy prediction
PDF
Effects of Highly Agreed Documents in Relevancy Prediction
PDF
Conference poster 6
Varying parameter in classification based on imprecise probabilities
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
Bagging Decision Trees on Data Sets with Classification Noise
lassification with decision trees from a nonparametric predictive inference p...
Locally Averaged Bayesian Dirichlet Metrics
An interactive approach for cleaning noisy observations in Bayesian networks ...
Learning classifiers from discretized expression quantitative trait loci
Split Criterions for Variable Selection Using Decision Trees
A Semi-naive Bayes Classifier with Grouping of Cases
Combining Decision Trees Based on Imprecise Probabilities and Uncertainty Mea...
Interactive Learning of Bayesian Networks
A Bayesian approach to estimate probabilities in classification trees
A Bayesian Random Split to Build Ensembles of Classification Trees
An Experimental Study about Simple Decision Trees for Bagging Ensemble on Dat...
Selective Gaussian Naïve Bayes Model for Diffuse Large-B-Cell Lymphoma Classi...
Evaluating query-independent object features for relevancy prediction
Effects of Highly Agreed Documents in Relevancy Prediction
Conference poster 6

Recently uploaded (20)

PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPT
Chemical bonding and molecular structure
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Microbiology with diagram medical studies .pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
diccionario toefl examen de ingles para principiante
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
7. General Toxicologyfor clinical phrmacy.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Chemical bonding and molecular structure
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Phytochemical Investigation of Miliusa longipes.pdf
Microbiology with diagram medical studies .pptx
Biophysics 2.pdffffffffffffffffffffffffff
. Radiology Case Scenariosssssssssssssss
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Derivatives of integument scales, beaks, horns,.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
diccionario toefl examen de ingles para principiante
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx

Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification

  • 1. 1 Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification A. Cano, J. García, A. Masegosa, S. Moral Dpt. Computer Science and Artificial Intelligence University of Granada. Spain. E-mail: {acu,fjgc,andrew,smc}@decsai.ugr.es
  • 2. 2 Diffuse Large B-Cell Lymphoma Diffuse Large B-Cell Lymphoma (DLBCL), the most common subtype of non- hodkin’s lymphoma, has long been enigmatic in that 40 per cent of patients can be cured by combination of chemotherapy whereas the remainder succumb to this disease. Alizadeh et al (2000) discovered, using gene expression profiling, that DLBCL comprises actually two different diseases that are indistinguishable by current diagnostic methods. One subtype of DLBCL, termed germinal center B-like DLBCL (GCB), has a high survival index, whereas the other DLBCL, termed activated B-cell like (ABC), has a low survival index . The gene expression profiling was carried out by a specialized cDNA microarray, 'the lymphochip', that allows to quantify the expression of thousands of genes in parallel that are preferentially expressed in lymphoid cells.
  • 3. 3 cDNA Microarray This is a tipical image of hybridization of fluorescent in cDNA mirocrarray. Each row represents a separate cDNA clone on the microarray and each column a separate mRNA sample. In this case, there is 96 samples and 4096 cDNA clones. These ratios are a measure of relative gene expression in each experimental sample. As indicated, the scale extends from fuorescence ratios of 0.25 to 4 (-2 to +2 in log base 2 units). Grey indicates missing or excluded data In this data sets, there is many missing or excluded datas.
  • 4. 4 Diffuse Large B-Cell Lymphoma Classification After Alizadeth et al (2000), there are three important approaches for the classification of Diffuse Large B-Cell Lymphoma. – Rosenwald et al (2002): Get a new data base with more cases. They get 274 cases. Alizadeth et al had only 42 cases of Diffuse Large B-Cell Lymphoma – Wright et al (2003): Find 27 genes to classify GCB versus ABC. – Lossos et al (2004): Find only 6 genes to estimate the survival index of a patient.
  • 5. 5 Rosenwald et al (2002) Andreas Rosenwald et al. 2002. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England Journal of Medicine, 346:1937–1947, June. Biopsy samples of diffuse large-B-cell lymphoma from 240 patients were examined for gene expression with the aid of a DNA microarrays. It is found a new subclass of DLBCL, called Type III, with an intermediate probability of survival using hierarchical clustering, They construct a predictor of overall survival after chemotherapy based a linear combination of four signatures. A signature is a biological group of genes.
  • 6. 6 Wright et al (2003) Wright et al. 2003. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large b cell lymphoma. Proceedings of National Academy of Sciences of the United States of America, 100:9991–9996, August. It is proposed a Bayesian predictor that estimates the probability of membership to one of the two cancer subgroups (GCB or ABC), with the data set of Rosenwald et al (2002). Gene Expression Data: http://llmpp.nih.gov/DLBCLpredictor – 8503 genes. – 134 cases of GCB, 83 cases of ABC and 57 cases of Type III.
  • 7. 7 Wright et al. (2003) DLBC subgroup predictor: – Linear Predictor Score: LPS(X)= X = (X1, X2, ...., Xn) – Only the k genes with the most significant t statistic were used to form the LPS, the optimal k was determined by a leave one out method. A model including 27 genes had the lowest average error rate. where N(x, , ) represents a Normal density function with mean  and deviation . Training set: 67 GCB + 42 ABC. Validation set: 67 GCB + 41 ABC + 57 Type III. j jj Xa 2211 11 1 ,,,, ,, XLPSNXLPSN XLPSN GXP
  • 8. 8 Wright et al (2003) This predictor choses a cutoff of 90% certainty. The samples for which there was <90% probability of being in either subgroup are termed ‘unclassified’. Results: - It has a 9.3% of unclassified. - If the predictor classifies a sample, it is correct 97.0 % of the times.
  • 9. 9 Lossos et al (2004) In this paper, the authors studied 36 genes that had been reported by experts to predict survival in DLBCL. In a univariate analysis, genes were ranked on the basis of their ability to predict survival. With this ranking, they developed a multivariate model that was only based on the expression of six genes. Finally, they proved that this model was sufficient to predict the survival of a patient with DLBCL. Mortality-predictor score = -0.0273xLMO2 – 0.2103xBCL6 – 0.1878xFN1 + 0.0346xCCND2 + 0.1888xSCYA3 + 0.5527xBCL2
  • 10. 10 Our Approach: Selective Gaussian Naïve Bayes It is a modified wrapper method to construct an optimal Naïve Bayes classifier with a minimum number of predictive genes. The main steps of the algorithm are: – Step 1: Anova Phase. Filter method based on one way analysis of variance (ANOVA).Selection of the most significant genes non correlated between them. – Step 2: Wrapper Phase. Application of a wrapper search method to reduce the genes subset selected by Anova phase. It uses a Gaussian Naïve Bayes classifier. In addition, a wrapper algorithm is applied M times, so the output of the algorithm is a group of M features subsets. – Step 3: Abduction Phase. A unique genes subset is selected using the group of the candidates of the Wrapper Phase.
  • 11. 11 We propose a filter method to reduce significantly the high number of genes, 8503, to apply a hard process as a wrapper search. Firstly, the genes are ranked by its F-statistic. A gene with high F-statistic has differents expression values for each subtype of DLBCL, that is to say, is a good gene to discriminate each subtype of DLBCL. In addition, the correlation between the genes is considered too. So, the correlation is calculated to each pair of genes in the two subclasses, so when two genes are correlated in any subclass, the one with the lowest F-satistic is eliminated. Anova Phase
  • 12. 12 Anova Phase: ‘The Algorithm’ (1) 1. The genes are ranked by its F-statistic: {G1, ..., Gn}. 2. The gene G1 is selected: Space of the genes Selected gene
  • 13. 13 Anova Phase: ‘The Algorithm’ (2) 3. The genes correlated in the actual subclass with G1 are calculated and removed. G1 is a final selected gene. Space of the genes Selected gene Cluster of genes Final Selected gene
  • 14. 14 Anova Phase: ‘The Algorithm’ (3) 4. The next not removed gene in the ranking Gj is selected: 5. The genes correlated with Gj are calculated and removed too. Space of the genes Selected gene Final Selected gene Cluster of genes
  • 15. 15 Anova Phase: ‘The Algorithm’ (4) 6. The steps 4 and 5 are repeated until all genes are removed or included in the final subset of selected genes. Space of the genes Final Selected gene This is the reduced final subset of genes selected by the Anova Phase. The initial data set projected over this subset of genes will be the working data subset for the Wrapper Phase.
  • 16. 16 Anova Phase There are several important properties for this filter method: – The more informative genes non positively correlated among them are selected. – The genes negatively correlated among them are not eliminated with the aim of introducing redundance in the final data set. – A gene with a low score is not eliminated if it is not positively correlated with the other genes. This case do not occur in traditional filter methods based on genes ranking, where the least informative genes are directly eliminated.
  • 17. 17 Wrapper Phase In this phase a wrapper algorithm is applied to find a reduced genes subset from the one obtained in the the Anova Phase. The underlying classification model is a Naïve Bayes classifier with continuous variables. We assume these variables follow a Gaussian distribution given the class. This phase uses a KFC methodology. It is used not only to estimate the classifier accuracy, but rather it is used to get a group of candidates features subsets for the final classification subset.
  • 18. 18 As it is indicated in KFC methodology, the data set D is randomly partitioned in K disjoint subsets, {D1, ..., Dk}. So, the training algorithm is applied K times to the subset Tk=D-Dk. So, if a wrapper algorithm is applied to each Tk subset, an optimal Gk feature subset will be gotten for each case. So at the end of the procedure, K features subsets are gotten by the Wrapper Phase. In this way, if this process is repeated m times (the randomly partition of data set D in K subsets ... ), we obtain M=K x m feature subsets. Wrapper Phase
  • 19. 19 A wrapper algorithm is applied again in the new trainining data set. And an optimal gen subset is obtained too. Wrapper Phase The data set is randomly partitioned in 5 subsets Training Data Set Testing Data Set A wrapper algorithm is applied in the trainining data set. And an optimal gen subset is obtained. Data Set G1 G4 G5 G7 G10 G1 G4 G5 G6 G9 A new training data set is defined with the 5-fold-cross methodology 5 fold-cross Methodology in Wrapper Phase
  • 20. 20 Wrapper Phase At the end 5 gene subset are obtained in this Wrapper Phase: G1 G4 G5 G7 G10 G1 G4 G5 G6 G9 G2 G4 G5 G7 G11 G1 G4 G2 G8 G9 G1 G4 G2 G6 G9 If this process is repeated again (a new randomly partitioned of the data set in 5 subsets), a new 5 gene subset will be obtained again. Therefore, by this way, if the process is repeated m times then we obtained M= 5 x m distinct gene subsets.
  • 21. 21 Gaussian Naïve Bayes Classifier It is a simple classifier introduced by Langley et al (1992) This classifier model assumes the following hypothesis: – Each gene is independent of the other genes known the class. – Each gene is considered as a random continuous variable that follows a Normal distribution given the class. An important advantage of this classifier is that it can work with missing values in the data set. C G1 G2 G3 G4 Naïve Bayes Classifier
  • 22. 22 The Wrapper Algorithm A wrapper algorithm is a stage divided algorithm. In each stage of the algorithm, an unique feature subset is obtained. A simple wrapper search algorithm is implemented with the following parameters: – Search Strategy: Sequential Selection – Initial Subset: Empty Set. – Evaluation Function: Accuracy of the Gaussian Naïve Bayes classifier composed by the selected feature subset of this stage. This accuracy is calculated using again the K-fold cross validation. The wrapper algorithm begins with an empty set of features and sequentially add the feature that maximizes the evaluation function at each stage. When there are several features that maximizes the evaluation function, the one with the highest F-statistic is selected.
  • 23. 23 Wrapper Algorithm: Stop Condition Due to the low number of cases in the training data set (100 cases in our case), it is more probable that two similar subsets of features have the same accuracy (>1% in our case) than in the traditional data sets with a large number of cases. So it is reasonable to think that the evolution of the error rate will be very discontinuous and with a slow rate. The typical error rate evolution is described in the following graphics: 0 0,1 0,2 0,3 0,4 0,5 0,6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 24. 24 Wrapper Algorithm: Stop Condition Generals stop conditions: – Stop if ∆r ≥ 0 (Stop if there is not an improvement): • Early stopping. It is hard get an improvement in each stage. (In the preceding case, this condition stops in stage number 4). – Stop if ∆r > 0 (Stop if there is a deterioration) • There is an overffiting problem, because many features are introduced without being necessary. (In the preceding case, this condition stops in stage number 12). The parameters are: #(P), stage of the algorithm; rp, actual error rate; ∆r = rp - rp-1, increment error rate). The heuristic stop condition implemented : 0 0,1 0,2 0,3 0,4 0,5 0,6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 r False True -0,2 -0,15 -0,1 -0,05 0 0,05 0,1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ∆r False True Avoid overfitting OR Avoid early stopping
  • 25. 25 Abduction Phase In the previous Wrapper Phase, M feature subsets were gotten as candidates for the final classification subset. The procedure to select a unique feature subset is based on an abductive inference over a Bayesian network, BN, encoding the joint distribution of that a feature subset was selected by the wrapper algorithm. Firstly, a BN is learned from the M feature selected subsets in the Wrapper Phase {F1, ..., FM} following this procedure: – Let Phi={G1, ..., Gp} the set of all features in the subsets F1, ..., FM. For each Gi, we define a new discrete variable Yi with two states {absent, present}. – Learn a BN by K2-Learnin algorithm.
  • 26. 26 Abduction Phase: An Expample Let – F1={G1,G3,G7,G8} F2={G2,G5,G7,G8} – F3={G1,G3,G2,G5} F4={G1,G9,G10} F5={G1,G3,G7,G8,G10} So Phi={G1,G2,G3,G5,G6,G7,G8,G9,G10} so we create this subset of variables {Y1,Y2,Y3,Y5,Y6,Y7,Y8,Y9,Y10}. So the cases are built as: So, with this data set, a BN is learnt by the K2-Learning algorithm. A secondly, it is applied an abduction algortihm over BN to get its two most probable configurations.
  • 27. 27 Abduction Phase Second, an abduction algortihm is applied to get its T most probable configurations: {y*1, ..., y*T}. For each configuration y*i = (y1,....,yp), we get a new feature subset Hi with the variables at ‘present’ state. At the end, we have the T most probable subsets of being selected in the Wrapper Phase, {H1, ..., HT}. The final classification subset is the one that minimezes the average of the –logarithm of the likelihood of the class in the complete training data set using a GNB model and leave- one-out methodology.
  • 28. 28 Abduction Phase: An Expample So, for example, we get this two configurations as the most ones probable: Form this two configurations, we get the two features subset most probable to be selected by a Wrapper algorithm. In this case are: - H1= {G1,G3,G7,G8} - H2= {G1,G2,G3,G5,G7,G8}
  • 29. 29 DLBCL Classification The data set used in this paper is the one in Rosenwald et al (2002). This data set considers 8503 genes in the following cases: – 134 cases for GCB. – 83 cases for ABC. – 57 cases for Type III. The data set was randomly divided in a training and testing group in the following way: – Training Data Set: 67 cases for GCB + 42 cases for ACB. – Testing Data Set: 67 GCB + 42 ACB + 57 cases for Type III. The classifier accuracy estimation is obtained trough the classifier evaluation in 10 different divisions of this data.
  • 30. 30 DLBCL Classification The parameters for the implementation of the three phases are: – A gene X is considered correlated with a gen Y if the lower limit of the confidence interval for the correlation coefficient is greater than 0.15. – The procedures of KFC Validation were carried out with a K=10. – M = 3 x 10 = 30 candidates feature subsets were gotten in the Wrapper Phase. – The 20 most probable explanations were evaluated in the Abduction Phase. – The samples that has a probability of being in either subgroup lower than 80% were termed ‘Unclassified’.
  • 31. 31 Results I  Phase Anova: (Confidence Intervals at 95%) – Size (gene number): [74.3, 83.1] – Train accuracy rate (%): [96.8, 98.6] – Test accuracy rate (%): [92.8, 95.4] – Test -log likelihood: [0.38, 0.68] – TypeIII Test -log likelihood: ‘Infinity’ Model Prediction DLBCL subgroup Training set Validation set
  • 32. 32 Results I: Conclusions Anova Phase only select the 1 % of the all genes, 78 genes of 8503 genes, and the Gaussian Naïve Bayes classifier has a 94.1% of accuracy with these genes. In this case, there are several Type III cases for wich the classifier assigns them extreme probability. The classifier only has a 2.5 % of unclassified cases. If the predictor classifies a sample, it is correct 95.4% of the cases.
  • 33. 33 Results II  Phase Anova + Phase Search: (Confidence Intervals at 95%) – Size (gene number): [6.17, 7.82] – Train accuracy rate (%): [95.2, 98.0] – Test accuracy rate (%): [88.83, 91.9] – Test -log likelihood: [0.25, 0.37] – TypeIII Test -log likelihood: [4.12, 5.08] Model Prediction DLBCL subgroup Training set Validation set
  • 34. 34 Results II: Conclusions The number of genes is reduced to a 10 %, from 78 to 7 genes. The average of –log likelihood is reduced at the half of the one obtained in the Anova Phase. Therefore, this is a better classifier. For Type III class, the predictor does not assign a full probability of belonging to a class in none of its samples. The classifier has a 9.1 % of unclassified cases. If the predictor classifier a sample, it is correct in the 93.2% of the cases. In Wright et al., there is a 9.2% of unclassified cases and if the predcitor classifies a sample, it is correct in the 96.9% of the cases. But the evaluation is done in a unique partition of the data set, so this percent is not very reliable. In addition, we had some evaluations of our classifier with better accuraccy. In the other hand, we reduced substancially the number of genes in all the cases. We get similars results with 7 genes, respect to the 27 ones of the Wright et al.
  • 35. 35 Conclusions We obtain a simple classifcation method that provides good results. We have developed a new method for feature subset selection very robust for data sets with a many more features than cases. The use of an Abduction process can provide us several candidates that resume the distinct applications of the wrapper algorithm. There are three genes (LMO2, BCL6 and CCND2) that are selected several times by our classification model and that coinciden with the 6 selected ones by Lossos et al. using medical information to predict survival in DLBCL.
  • 36. 36 Future works  Develop more sophisticated models: – Include replacement variables to manage missing data. – Consider Multidimensionals Gaussian distributions. – Improve the MTE Gaussian Naïve Bayes model.  Apply this model to other data sets as breast cancer, colon cancer ...  Compare with other models with discrete variables.