Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification

1
Application of a Selective Gaussian
Naïve Bayes Model for Diffuse-Large
B-Cell Lymphoma Classification
A. Cano, J. García, A. Masegosa, S. Moral
Dpt. Computer Science and Artificial Intelligence
University of Granada. Spain.
E-mail: {acu,fjgc,andrew,smc}@decsai.ugr.es

2
Diffuse Large B-Cell Lymphoma
Diffuse Large B-Cell Lymphoma (DLBCL), the most common subtype of non-
hodkin’s lymphoma, has long been enigmatic in that 40 per cent of patients
can be cured by combination of chemotherapy whereas the remainder
succumb to this disease.
Alizadeh et al (2000) discovered, using gene expression profiling, that DLBCL
comprises actually two different diseases that are indistinguishable by current
diagnostic methods. One subtype of DLBCL, termed germinal center B-like
DLBCL (GCB), has a high survival index, whereas the other DLBCL, termed
activated B-cell like (ABC), has a low survival index .
The gene expression profiling was carried out by a specialized cDNA
microarray, 'the lymphochip', that allows to quantify the expression of
thousands of genes in parallel that are preferentially expressed in lymphoid
cells.

3
cDNA Microarray
This is a tipical image of hybridization
of fluorescent in cDNA mirocrarray.
Each row represents a separate cDNA
clone on the microarray and each
column a separate mRNA sample.
In this case, there is 96 samples and
4096 cDNA clones.
These ratios are a measure of relative
gene expression in each experimental
sample.
As indicated, the scale extends from
fuorescence ratios of 0.25 to 4 (-2 to
+2 in log base 2 units). Grey indicates
missing or excluded data
In this data sets, there is many
missing or excluded datas.

4
Diffuse Large B-Cell Lymphoma
Classification
After Alizadeth et al (2000), there are three
important approaches for the classification of
Diffuse Large B-Cell Lymphoma.
– Rosenwald et al (2002): Get a new data base with more
cases. They get 274 cases. Alizadeth et al had only 42 cases
of Diffuse Large B-Cell Lymphoma
– Wright et al (2003): Find 27 genes to classify GCB
versus ABC.
– Lossos et al (2004): Find only 6 genes to estimate the
survival index of a patient.

5
Rosenwald et al (2002)
Andreas Rosenwald et al. 2002. The use of molecular profiling to
predict survival after chemotherapy for diffuse large-B-cell
lymphoma. New England Journal of Medicine, 346:1937–1947,
June.
Biopsy samples of diffuse large-B-cell lymphoma from 240
patients were examined for gene expression with the aid of a
DNA microarrays.
It is found a new subclass of DLBCL, called Type III, with an
intermediate probability of survival using hierarchical clustering,
They construct a predictor of overall survival after chemotherapy
based a linear combination of four signatures. A signature is a
biological group of genes.

6
Wright et al (2003)
Wright et al. 2003. A gene expression-based method to diagnose
clinically distinct subgroups of diffuse large b cell lymphoma.
Proceedings of National Academy of Sciences of the United States of
America, 100:9991–9996, August.
It is proposed a Bayesian predictor that estimates the probability of
membership to one of the two cancer subgroups (GCB or ABC), with
the data set of Rosenwald et al (2002).
Gene Expression Data: http://llmpp.nih.gov/DLBCLpredictor
– 8503 genes.
– 134 cases of GCB, 83 cases of ABC and 57 cases of Type III.

7
Wright et al. (2003)
DLBC subgroup predictor:
– Linear Predictor Score:
LPS(X)= X = (X1, X2, ...., Xn)
– Only the k genes with the most significant t statistic were used to
form the LPS, the optimal k was determined by a leave one out
method. A model including 27 genes had the lowest average error
rate.
where N(x, , ) represents a Normal density function with mean
 and deviation .
Training set: 67 GCB + 42 ABC. Validation set: 67 GCB + 41
ABC + 57 Type III.
j
jj Xa
2211
11
1
,,,,
,,
XLPSNXLPSN
XLPSN
GXP

8
Wright et al (2003)
This predictor choses a cutoff of 90% certainty. The samples for
which there was <90% probability of being in either subgroup
are termed ‘unclassified’.
Results:
- It has a 9.3% of unclassified.
- If the predictor classifies a sample, it is correct 97.0 % of the times.

9
Lossos et al (2004)
In this paper, the authors studied 36 genes that had been
reported by experts to predict survival in DLBCL.
In a univariate analysis, genes were ranked on the basis of their
ability to predict survival. With this ranking, they developed a
multivariate model that was only based on the expression of six
genes.
Finally, they proved that this model was sufficient to predict the
survival of a patient with DLBCL.
Mortality-predictor score =
-0.0273xLMO2 – 0.2103xBCL6 – 0.1878xFN1
+ 0.0346xCCND2 + 0.1888xSCYA3 + 0.5527xBCL2

10
Our Approach:
Selective Gaussian Naïve Bayes
It is a modified wrapper method to construct an optimal Naïve
Bayes classifier with a minimum number of predictive genes.
The main steps of the algorithm are:
– Step 1: Anova Phase. Filter method based on one way analysis of
variance (ANOVA).Selection of the most significant genes non correlated
between them.
– Step 2: Wrapper Phase. Application of a wrapper search method to
reduce the genes subset selected by Anova phase. It uses a Gaussian
Naïve Bayes classifier. In addition, a wrapper algorithm is applied M
times, so the output of the algorithm is a group of M features subsets.
– Step 3: Abduction Phase. A unique genes subset is selected using the
group of the candidates of the Wrapper Phase.

11
We propose a filter method to reduce significantly the
high number of genes, 8503, to apply a hard process as
a wrapper search.
Firstly, the genes are ranked by its F-statistic. A gene
with high F-statistic has differents expression values for
each subtype of DLBCL, that is to say, is a good gene
to discriminate each subtype of DLBCL.
In addition, the correlation between the genes is
considered too. So, the correlation is calculated to each
pair of genes in the two subclasses, so when two genes
are correlated in any subclass, the one with the lowest
F-satistic is eliminated.
Anova Phase

12
Anova Phase: ‘The Algorithm’ (1)
1. The genes are ranked by its F-statistic: {G1,
..., Gn}.
2. The gene G1 is selected:
Space of the genes
Selected gene

13
3. The genes correlated in the actual subclass with G1
are calculated and removed. G1 is a final selected
gene.
Space of the genes
Selected gene
Cluster of genes
Final Selected gene

14
4. The next not removed gene in the ranking Gj
is selected:
5. The genes correlated with Gj are calculated
and removed too.
Space of the genes
Selected gene
Final Selected gene
Cluster of genes

15
6. The steps 4 and 5 are repeated until all genes are removed or
included in the final subset of selected genes.
Space of the genes
Final Selected gene
This is the reduced final subset of genes selected by the Anova
Phase. The initial data set projected over this subset of genes
will be the working data subset for the Wrapper Phase.

16
Anova Phase
There are several important properties for this filter
method:
– The more informative genes non positively correlated among
them are selected.
– The genes negatively correlated among them are not
eliminated with the aim of introducing redundance in the final
data set.
– A gene with a low score is not eliminated if it is not positively
correlated with the other genes. This case do not occur in
traditional filter methods based on genes ranking, where the
least informative genes are directly eliminated.

17
Wrapper Phase
In this phase a wrapper algorithm is applied to find a
reduced genes subset from the one obtained in the the
Anova Phase.
The underlying classification model is a Naïve Bayes
classifier with continuous variables. We assume these
variables follow a Gaussian distribution given the class.
This phase uses a KFC methodology. It is used not only
to estimate the classifier accuracy, but rather it is used
to get a group of candidates features subsets for the
final classification subset.

18
As it is indicated in KFC methodology, the data set D is
randomly partitioned in K disjoint subsets, {D1, ..., Dk}.
So, the training algorithm is applied K times to the
subset Tk=D-Dk.
So, if a wrapper algorithm is applied to each Tk subset,
an optimal Gk feature subset will be gotten for each
case. So at the end of the procedure, K features
subsets are gotten by the Wrapper Phase.
In this way, if this process is repeated m times (the
randomly partition of data set D in K subsets ... ), we
obtain M=K x m feature subsets.
Wrapper Phase

19
A wrapper algorithm is applied again in the new trainining data set.
And an optimal gen subset is obtained too.
Wrapper Phase
The data set is randomly partitioned in 5 subsets
Training Data Set
Testing Data Set
A wrapper algorithm is applied in the trainining data set.
And an optimal gen subset is obtained.
Data
Set
G1
G4
G5
G7
G10
G1
G4
G5
G6
G9
A new training data set is defined with the 5-fold-cross methodology
5 fold-cross Methodology in Wrapper Phase

20
Wrapper Phase
At the end 5 gene subset are obtained in this Wrapper
Phase:
G1
G4
G5
G7
G10
G1
G4
G5
G6
G9
G2
G4
G5
G7
G11
G1
G4
G2
G8
G9
G1
G4
G2
G6
G9
If this process is repeated again (a new randomly
partitioned of the data set in 5 subsets), a new 5 gene
subset will be obtained again. Therefore, by this way, if
the process is repeated m times then we obtained M= 5 x
m distinct gene subsets.

21
Gaussian Naïve Bayes
Classifier
It is a simple classifier introduced by Langley et al (1992)
This classifier model assumes the following hypothesis:
– Each gene is independent of the other genes known the class.
– Each gene is considered as a random continuous variable that
follows a Normal distribution given the class.
An important advantage of this classifier is that it can work with missing
values in the data set.
C
G1 G2 G3 G4
Naïve Bayes Classifier

22
The Wrapper Algorithm
A wrapper algorithm is a stage divided algorithm. In each stage
of the algorithm, an unique feature subset is obtained.
A simple wrapper search algorithm is implemented with the
following parameters:
– Search Strategy: Sequential Selection
– Initial Subset: Empty Set.
– Evaluation Function: Accuracy of the Gaussian Naïve Bayes classifier
composed by the selected feature subset of this stage. This accuracy is
calculated using again the K-fold cross validation.
The wrapper algorithm begins with an empty set of features and
sequentially add the feature that maximizes the evaluation
function at each stage.
When there are several features that maximizes the evaluation
function, the one with the highest F-statistic is selected.

23
Wrapper Algorithm:
Stop Condition
Due to the low number of cases in the training data set (100 cases in our
case), it is more probable that two similar subsets of features have the
same accuracy (>1% in our case) than in the traditional data sets with a
large number of cases. So it is reasonable to think that the evolution of the
error rate will be very discontinuous and with a slow rate.
The typical error rate evolution is described in the following graphics:
0
0,1
0,2
0,3
0,4
0,5
0,6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

24
Wrapper Algorithm:
Stop Condition
Generals stop conditions:
– Stop if ∆r ≥ 0 (Stop if there is not an improvement):
• Early stopping. It is hard get an improvement in each stage. (In the preceding case, this condition stops in stage
number 4).
– Stop if ∆r > 0 (Stop if there is a deterioration)
• There is an overffiting problem, because many features are introduced without being necessary. (In the preceding
case, this condition stops in stage number 12).
The parameters are: #(P), stage of the algorithm; rp, actual error rate; ∆r = rp - rp-1, increment
error rate).
The heuristic stop condition implemented :
0
0,1
0,2
0,3
0,4
0,5
0,6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
r
False
True
-0,2
-0,15
-0,1
-0,05
0
0,05
0,1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
∆r
False
True
Avoid overfitting OR Avoid early stopping

25
Abduction Phase
In the previous Wrapper Phase, M feature subsets were gotten as
candidates for the final classification subset.
The procedure to select a unique feature subset is based on an
abductive inference over a Bayesian network, BN, encoding the joint
distribution of that a feature subset was selected by the wrapper
algorithm.
Firstly, a BN is learned from the M feature selected subsets in the
Wrapper Phase {F1, ..., FM} following this procedure:
– Let Phi={G1, ..., Gp} the set of all features in the subsets F1, ..., FM. For
each Gi, we define a new discrete variable Yi with two states {absent,
present}.
– Learn a BN by K2-Learnin algorithm.

26
Abduction Phase: An Expample
Let
– F1={G1,G3,G7,G8} F2={G2,G5,G7,G8}
– F3={G1,G3,G2,G5} F4={G1,G9,G10} F5={G1,G3,G7,G8,G10}
So Phi={G1,G2,G3,G5,G6,G7,G8,G9,G10} so we create this subset of
variables {Y1,Y2,Y3,Y5,Y6,Y7,Y8,Y9,Y10}.
So the cases are built as:
So, with this data set, a BN is learnt by the K2-Learning algorithm. A secondly,
it is applied an abduction algortihm over BN to get its two most probable
configurations.

27
Abduction Phase
Second, an abduction algortihm is applied to get its T most
probable configurations: {y*1, ..., y*T}.
For each configuration y*i = (y1,....,yp), we get a new feature
subset Hi with the variables at ‘present’ state.
At the end, we have the T most probable subsets of being
selected in the Wrapper Phase, {H1, ..., HT}.
The final classification subset is the one that minimezes the
average of the –logarithm of the likelihood of the class in the
complete training data set using a GNB model and leave-
one-out methodology.

28
Abduction Phase: An Expample
So, for example, we get this two configurations as the most ones
probable:
Form this two configurations, we get the two features subset
most probable to be selected by a Wrapper algorithm. In this
case are:
- H1= {G1,G3,G7,G8}
- H2= {G1,G2,G3,G5,G7,G8}

29
DLBCL Classification
The data set used in this paper is the one in Rosenwald et al
(2002). This data set considers 8503 genes in the following
cases:
– 134 cases for GCB.
– 83 cases for ABC.
– 57 cases for Type III.
The data set was randomly divided in a training and testing
group in the following way:
– Training Data Set: 67 cases for GCB + 42 cases for ACB.
– Testing Data Set: 67 GCB + 42 ACB + 57 cases for Type III.
The classifier accuracy estimation is obtained trough the
classifier evaluation in 10 different divisions of this data.

30
DLBCL Classification
The parameters for the implementation of the three phases
are:
– A gene X is considered correlated with a gen Y if the lower limit
of the confidence interval for the correlation coefficient is greater
than 0.15.
– The procedures of KFC Validation were carried out with a K=10.
– M = 3 x 10 = 30 candidates feature subsets were gotten in the
Wrapper Phase.
– The 20 most probable explanations were evaluated in the
Abduction Phase.
– The samples that has a probability of being in either subgroup
lower than 80% were termed ‘Unclassified’.

31
Results I
 Phase Anova: (Confidence Intervals at 95%)
– Size (gene number): [74.3, 83.1]
– Train accuracy rate (%): [96.8, 98.6]
– Test accuracy rate (%): [92.8, 95.4]
– Test -log likelihood: [0.38, 0.68]
– TypeIII Test -log likelihood: ‘Infinity’
Model Prediction
DLBCL
subgroup
Training set Validation set

32
Results I: Conclusions
Anova Phase only select the 1 % of the all genes, 78
genes of 8503 genes, and the Gaussian Naïve Bayes
classifier has a 94.1% of accuracy with these genes.
In this case, there are several Type III cases for wich
the classifier assigns them extreme probability.
The classifier only has a 2.5 % of unclassified cases. If
the predictor classifies a sample, it is correct 95.4% of
the cases.

33
Results II
 Phase Anova + Phase Search: (Confidence Intervals at
95%)
– Size (gene number): [6.17, 7.82]
– Train accuracy rate (%): [95.2, 98.0]
– Test accuracy rate (%): [88.83, 91.9]
– Test -log likelihood: [0.25, 0.37]
– TypeIII Test -log likelihood: [4.12, 5.08]
Model Prediction
DLBCL
subgroup
Training set Validation set

34
Results II: Conclusions
The number of genes is reduced to a 10 %, from 78 to 7 genes.
The average of –log likelihood is reduced at the half of the one obtained in
the Anova Phase. Therefore, this is a better classifier.
For Type III class, the predictor does not assign a full probability of
belonging to a class in none of its samples.
The classifier has a 9.1 % of unclassified cases. If the predictor classifier a
sample, it is correct in the 93.2% of the cases.
In Wright et al., there is a 9.2% of unclassified cases and if the predcitor
classifies a sample, it is correct in the 96.9% of the cases. But the
evaluation is done in a unique partition of the data set, so this percent is
not very reliable. In addition, we had some evaluations of our classifier
with better accuraccy.
In the other hand, we reduced substancially the number of genes in all the
cases. We get similars results with 7 genes, respect to the 27 ones of the
Wright et al.

35
Conclusions
We obtain a simple classifcation method that provides good
results.
We have developed a new method for feature subset selection
very robust for data sets with a many more features than cases.
The use of an Abduction process can provide us several
candidates that resume the distinct applications of the wrapper
algorithm.
There are three genes (LMO2, BCL6 and CCND2) that are
selected several times by our classification model and that
coinciden with the 6 selected ones by Lossos et al. using medical
information to predict survival in DLBCL.

36
Future works
 Develop more sophisticated models:
– Include replacement variables to manage missing
data.
– Consider Multidimensionals Gaussian distributions.
– Improve the MTE Gaussian Naïve Bayes model.
 Apply this model to other data sets as breast
cancer, colon cancer ...
 Compare with other models with discrete
variables.

Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification

More Related Content

What's hot (18)

Similar to Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification (20)

More from NTNU (18)

Recently uploaded (20)

Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification