ICML2015 Slides

Advanced Computing Laboratory
Electrical and Computer Engineering
Seoul National University
Taehoon Lee SungrohYoon
BoostedCategorical Restricted Boltzmann Machine
forComputational Prediction of Splice Junctions

• Motivation
• Preliminary
• Boosted contrastive divergence
• Categorical restricted Boltzmann machine
• Experiment results
• Conclusion
Outline
2/25

• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
Motivation
3/25

recognition tasks.
Motivation
negative positive
3/25

recognition tasks.
Motivation
negative positive
easy to
misclassify
query images
3/25

recognition tasks.
• Q. How can we learn minor but important features using neural networks?
Motivation
negative positive
easy to
misclassify
query images
3/25

recognition tasks.
• We propose a new RBM training method called boosted CD.
Motivation
negative positive
easy to
misclassify
query images
3/25

recognition tasks.
• We propose a new RBM training method called boosted CD.
• We also devise a regularization term for sparsity of DNA sequences.
Motivation
negative positive
easy to
misclassify
query images
3/25

• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
4/25

DNA
RNA
protein
4/25

DNA
RNA
protein
gene expression
4/25

DNA
RNA
protein
gene expression
exon
4/25

DNA
RNA
protein
gene expression
exon
intron
4/25

exon
GT: false boundary
GT: true boundary
ACGTCGACTGCTACGTAGCAGCGA
TACGTACCGATCATCACTATCATC
GAGGTACGATCGATCGATCGATCA
GTCGATCGTCGTTCAGTCAGTCGA
TATCAGTCATATGCACATCTCAGT
DNA
RNA
protein
gene expression
exon
intron
4/25

exon
GT: false boundary
GT: true boundary
ACGTCGACTGCTACGTAGCAGCGA
TACGTACCGATCATCACTATCATC
GAGGTACGATCGATCGATCGATCA
GTCGATCGTCGTTCAGTCAGTCGA
TATCAGTCATATGCACATCTCAGT
DNA
RNA
protein
gene expression
GT (or AG)
16K
76M
true sites
exon
intron
160K
(=0.21% over 76M)
4/25

• Two approaches:
• Machine learning-based:
• ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991),
• SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007),
• HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006).
• Sequence alignment-based:
• TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010),
RUM (Grant et al., 2011).
PreviousWork on Junction Prediction
1
2
5/25

• Two approaches:
We want to construct a learning model which can boost prediction
performance in a complementary way to alignment-based method.
1
2
1
2
5/25

• Two approaches:
We want to construct a learning model which can boost prediction
performance in a complementary way to alignment-based method.
1
2
1
2
We propose a learning model based on (multilayer) RBMs
and its training scheme.
5/25

• Training methods of RBM
• RBM for categorical values
• Softmax input units (Salakhutdinov et al., ICML 2007).
• Class-imbalance problems
• Refer to a review byGalar et al. (IEEET SMC 2012).
Related Methodologies
Description
Training
cost
Noise
handling
Class-imbalance
handling
CD (Hinton,
Neural Comp. 2002)
Standard and
widely used
- - -
Persistent CD
(Tieleman, ICML 2008)
Use of a single
Markov chain
- -
Parallel tempering
(Cho et al., IJCNN 2010)
Simultaneous Markov
chains generation
6/25

Main Contributions
New RBM training methods
called boosted CD
7/25

Main Contributions
called boosted CD
New penalty term to handle
sparsity of DNA sequences
7/25

Main Contributions
Significant boosts in splicing
prediction performance
called boosted CD
7/25

Main Contributions
Robustness to high-dimensional
class-imbalanced data
called boosted CD
7/25

Main Contributions
Robustness to high-dimensional
class-imbalanced data
called boosted CD
25/25
The ability to detect subtle
non-canonical splicing signals

• Motivation
• Preliminary
• Conclusion
Outline
8/25

• RBM is a type of logistic belief network whose structure is a bipartite graph.
• Nodes:
• Input layer:
• Hidden layer:
• Probability of a configuration :
•
•
• Each node is a stochastic binary unit:
•
• can be used as a feature.
Restricted Boltzmann Machines
9/25

• Training weights to minimize negative log-likelihood of data.
• Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps.
• The CD-𝑘 updates after seeing example 𝒗:
Contrastive Divergence (CD) forTraining RBMs
𝒗(0)
= 𝒗
𝒉(0) 𝒉(1) 𝒉(𝑘)
𝒗(1)
𝒗(𝑘)
10/25

• Training weights to minimize negative log-likelihood of data.
• Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps.
• The CD-𝑘 updates after seeing example 𝒗:
Contrastive Divergence (CD) forTraining RBMs
approximated by
k-step Markov chain
𝒗(0)
= 𝒗
𝒉(0) 𝒉(1) 𝒉(𝑘)
𝒗(1)
𝒗(𝑘)
10/25

• Motivation
• Preliminary
• Conclusion
Outline
11/25

Overview of Proposed Methodology
12/25

• Boosting is a meta-algorithm which converts weak learners to strong ones.
• Most boosting algorithms consist of iteratively learning weak classifiers with
respect to a distribution and adding them to a final strong classifier.
• The main variation between many boosting algorithms:
• The method of weighting training data points and hypotheses.
• AdaBoost, LPBoost,TotalBoost, …
What Boosting Is
from lecture notes @ UCIrvine CS 271 Fall 2007
13/25

• Contrastive divergence training is looped over all mini-batches and known to
be stable.
BoostedContrastive Divergence (1/2)
14/25

be stable.
14/25
hardly
observed
regions

be stable.
• However, for a class-imbalance distribution, we need to assign higher weights
to rare samples in order to jump to unseen examples byGibbs chains.
14/25
hardly
observed
regions

be stable.
assign higher
weights to
rare samples
14/25
hardly
observed
regions

be stable.
assign lower
weights to
ordinary samples
assign higher
weights to
rare samples
14/25
hardly
observed
regions

• If we assign the same weight to all the data, the performance ofGibbs
sampling would degrade in the regions that are hardly observed.
• Whenever sampling, we therefore re-weight each observation by the energy
of its reconstruction 𝐸(𝒗 𝑛
(𝑘), 𝒉 𝑛
(𝑘)
).
15/25

(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
Relative locations of samples
and corresponding Markov
chains by CD
hardly
observed
regions

(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
chains by the proposed
chains by CD
hardly
observed
regions

(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
chains by PT
chains by the proposed
chains by CD
hardly
observed
regions

• For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001).
• A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively.
• In encoded binary vectors, 75% of the elements are zero.
Categorical Gradient
16/25

• To resolve sparsity of 1-hot encoding vectors, we devise a new regularization
technique that incorporates prior knowledge on the sparsity.
16/25

sparsity term
16/25
reconstruction with and w/o
the sparsity term

sparsity term
16/25
reconstruction with and w/o
the sparsity term
derived from
the sparsity term

ProposedTraining Algorithm
categorical gradient
boosted CD
17/25

• Motivation
• Preliminary
• Conclusion
Outline
18/25

• Data preparation:
• Real human DNA sequences with known boundary information.
Results
Effects of
Effects of boosting
Effects on the
splicing prediction
19/25

Results
Effects of
Effects of boosting
Effects on the
splicing prediction
CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG
19/25

Results
Effects of
Effects of boosting
Effects on the
splicing prediction
true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor
19/25

Results
Effects of
Effects of boosting
Effects on the
splicing prediction
false acceptor 1false donor 1
19/25

• GWH dataset: 2-class (boundary or not).
• UCSC dataset: 3-class (acceptor, donor, or non-boundary).
Results
Effects of
Effects of boosting
Effects on the
splicing prediction
false acceptor 1false donor 1
19/25

• The proposed method shows the best performance in terms of reconstruction
error for both training and testing.
• Compare to the softmax approach, the proposed regularized RBM succeeds in
achieving lower error by slightly sacrificing the probability sum constraint.
Results: Effects ofCategorical Gradient
Data: chromosome 19 in
GWH-donor
Sequence Length: 200nt
(800 dimension)
# of iterations: 500
Learning rate: 0.1
L2-decay: 0.001
over-fitted best
20/25

• For simulating a class-
imbalance situation
• we randomly
dropped samples
with different drop
rates for different
classes.
Results: Effects of Boosting

• For simulating a class-
imbalance situation
• we randomly
dropped samples
with different drop
rates for different
classes.
Results: Effects of Boosting
Description
Training
cost
Noise
handling
Class-imbalance
handling
CD (Hinton,
Neural Comp. 2002)
Standard and
widely used
- - -
Persistent CD
(Tieleman, ICML 2008)
Use of a single
Markov chain
- -
Parallel tempering
(Cho et al., IJCNN 2010)
Simultaneous Markov
chains generation
Proposed boosted CD Reweighting samples -

Results: Improved Performance and Robustness
2-class classification performance 3-class classification Runtime
22/25

Insensitivity to sequence lengths
22/25

Insensitivity to sequence lengths Robustness to negative samples
22/25

exon intron
• (Important biological finding) non-canonical splicing can arise if:
• Introns containGCA or NAA sequences at their boundaries.
• Exons include contiguousA’s around the boundaries.
Results: Identification of Non-Canonical Splice Sites
We used 162,951
examples excluding
canonical splice sites.
23/25

• We proposed a new RBM training method called boosted CD with categorical
gradients that improves conventionalCD for class-imbalanced data.
• Significant boosts in splicing prediction in terms of accuracy and runtime.
• Increased robustness to high-dimensional class-imbalanced data.
• The proposed scheme shows the ability to detect subtle non-canonical
splicing signals that often could not be identified by traditional methods.
• Future work: additional validation using various class-imbalance datasets.
24/25
Conclusion

• Our lab members
• Financial supports
• ICML 2015 travel scholarship
Acknowledgements
June 2, 2015
25/25

• The proposed DBN showed xx% higher performance in terms of the F1-score.
• RNN is appropriate for sequence modeling. However, splicing signals are often
too far from the boundaries and hard to maintain splicing information.
Backup:Comparison with Recurrent Neural Networks (RNNs)
To be placed
Backup/25

ICML2015 Slides

More Related Content

Similar to ICML2015 Slides (20)

Recently uploaded (20)

ICML2015 Slides