SlideShare a Scribd company logo
Advanced Computing Laboratory
Electrical and Computer Engineering
Seoul National University
Taehoon Lee SungrohYoon
BoostedCategorical Restricted Boltzmann Machine
forComputational Prediction of Splice Junctions
• Motivation
• Preliminary
• Boosted contrastive divergence
• Categorical restricted Boltzmann machine
• Experiment results
• Conclusion
Outline
2/25
• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
Motivation
3/25
• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
Motivation
negative positive
3/25
• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
Motivation
negative positive
easy to
misclassify
query images
3/25
• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
• Q. How can we learn minor but important features using neural networks?
Motivation
negative positive
easy to
misclassify
query images
3/25
• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
• Q. How can we learn minor but important features using neural networks?
• We propose a new RBM training method called boosted CD.
Motivation
negative positive
easy to
misclassify
query images
3/25
• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
• Q. How can we learn minor but important features using neural networks?
• We propose a new RBM training method called boosted CD.
• We also devise a regularization term for sparsity of DNA sequences.
Motivation
negative positive
easy to
misclassify
query images
3/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
4/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
DNA
RNA
protein
4/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
DNA
RNA
protein
gene expression
4/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
DNA
RNA
protein
gene expression
exon
4/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
DNA
RNA
protein
gene expression
exon
intron
4/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
exon
GT: false boundary
GT: true boundary
ACGTCGACTGCTACGTAGCAGCGA
TACGTACCGATCATCACTATCATC
GAGGTACGATCGATCGATCGATCA
GTCGATCGTCGTTCAGTCAGTCGA
TATCAGTCATATGCACATCTCAGT
DNA
RNA
protein
gene expression
exon
intron
4/25
• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
exon
GT: false boundary
GT: true boundary
ACGTCGACTGCTACGTAGCAGCGA
TACGTACCGATCATCACTATCATC
GAGGTACGATCGATCGATCGATCA
GTCGATCGTCGTTCAGTCAGTCGA
TATCAGTCATATGCACATCTCAGT
DNA
RNA
protein
gene expression
GT (or AG)
16K
76M
true sites
exon
intron
160K
(=0.21% over 76M)
4/25
• Two approaches:
• Machine learning-based:
• ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991),
• SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007),
• HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006).
• Sequence alignment-based:
• TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010),
RUM (Grant et al., 2011).
PreviousWork on Junction Prediction
1
2
5/25
• Two approaches:
• Machine learning-based:
• ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991),
• SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007),
• HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006).
• Sequence alignment-based:
• TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010),
RUM (Grant et al., 2011).
PreviousWork on Junction Prediction
We want to construct a learning model which can boost prediction
performance in a complementary way to alignment-based method.
1
2
1
2
5/25
• Two approaches:
• Machine learning-based:
• ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991),
• SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007),
• HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006).
• Sequence alignment-based:
• TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010),
RUM (Grant et al., 2011).
PreviousWork on Junction Prediction
We want to construct a learning model which can boost prediction
performance in a complementary way to alignment-based method.
1
2
1
2
We propose a learning model based on (multilayer) RBMs
and its training scheme.
5/25
• Training methods of RBM
• RBM for categorical values
• Softmax input units (Salakhutdinov et al., ICML 2007).
• Class-imbalance problems
• Refer to a review byGalar et al. (IEEET SMC 2012).
Related Methodologies
Description
Training
cost
Noise
handling
Class-imbalance
handling
CD (Hinton,
Neural Comp. 2002)
Standard and
widely used
- - -
Persistent CD
(Tieleman, ICML 2008)
Use of a single
Markov chain
- -
Parallel tempering
(Cho et al., IJCNN 2010)
Simultaneous Markov
chains generation
6/25
Main Contributions
7/25
Main Contributions
New RBM training methods
called boosted CD
7/25
Main Contributions
New RBM training methods
called boosted CD
New penalty term to handle
sparsity of DNA sequences
7/25
Main Contributions
Significant boosts in splicing
prediction performance
New RBM training methods
called boosted CD
New penalty term to handle
sparsity of DNA sequences
7/25
Main Contributions
Significant boosts in splicing
prediction performance
Robustness to high-dimensional
class-imbalanced data
New RBM training methods
called boosted CD
New penalty term to handle
sparsity of DNA sequences
7/25
Main Contributions
Significant boosts in splicing
prediction performance
Robustness to high-dimensional
class-imbalanced data
New RBM training methods
called boosted CD
New penalty term to handle
sparsity of DNA sequences
25/25
The ability to detect subtle
non-canonical splicing signals
• Motivation
• Preliminary
• Boosted contrastive divergence
• Categorical restricted Boltzmann machine
• Experiment results
• Conclusion
Outline
8/25
• RBM is a type of logistic belief network whose structure is a bipartite graph.
• Nodes:
• Input layer:
• Hidden layer:
• Probability of a configuration :
•
•
• Each node is a stochastic binary unit:
•
• can be used as a feature.
Restricted Boltzmann Machines
9/25
• Training weights to minimize negative log-likelihood of data.
• Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps.
• The CD-𝑘 updates after seeing example 𝒗:
Contrastive Divergence (CD) forTraining RBMs
𝒗(0)
= 𝒗
𝒉(0) 𝒉(1) 𝒉(𝑘)
𝒗(1)
𝒗(𝑘)
10/25
• Training weights to minimize negative log-likelihood of data.
• Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps.
• The CD-𝑘 updates after seeing example 𝒗:
Contrastive Divergence (CD) forTraining RBMs
approximated by
k-step Markov chain
𝒗(0)
= 𝒗
𝒉(0) 𝒉(1) 𝒉(𝑘)
𝒗(1)
𝒗(𝑘)
10/25
• Motivation
• Preliminary
• Boosted contrastive divergence
• Categorical restricted Boltzmann machine
• Experiment results
• Conclusion
Outline
11/25
Overview of Proposed Methodology
12/25
Overview of Proposed Methodology
12/25
Overview of Proposed Methodology
12/25
Overview of Proposed Methodology
12/25
Overview of Proposed Methodology
12/25
Overview of Proposed Methodology
12/25
• Boosting is a meta-algorithm which converts weak learners to strong ones.
• Most boosting algorithms consist of iteratively learning weak classifiers with
respect to a distribution and adding them to a final strong classifier.
• The main variation between many boosting algorithms:
• The method of weighting training data points and hypotheses.
• AdaBoost, LPBoost,TotalBoost, …
What Boosting Is
from lecture notes @ UCIrvine CS 271 Fall 2007
13/25
• Contrastive divergence training is looped over all mini-batches and known to
be stable.
BoostedContrastive Divergence (1/2)
14/25
• Contrastive divergence training is looped over all mini-batches and known to
be stable.
BoostedContrastive Divergence (1/2)
14/25
hardly
observed
regions
• Contrastive divergence training is looped over all mini-batches and known to
be stable.
• However, for a class-imbalance distribution, we need to assign higher weights
to rare samples in order to jump to unseen examples byGibbs chains.
BoostedContrastive Divergence (1/2)
14/25
hardly
observed
regions
• Contrastive divergence training is looped over all mini-batches and known to
be stable.
• However, for a class-imbalance distribution, we need to assign higher weights
to rare samples in order to jump to unseen examples byGibbs chains.
BoostedContrastive Divergence (1/2)
14/25
hardly
observed
regions
• Contrastive divergence training is looped over all mini-batches and known to
be stable.
• However, for a class-imbalance distribution, we need to assign higher weights
to rare samples in order to jump to unseen examples byGibbs chains.
BoostedContrastive Divergence (1/2)
assign higher
weights to
rare samples
14/25
hardly
observed
regions
• Contrastive divergence training is looped over all mini-batches and known to
be stable.
• However, for a class-imbalance distribution, we need to assign higher weights
to rare samples in order to jump to unseen examples byGibbs chains.
BoostedContrastive Divergence (1/2)
assign lower
weights to
ordinary samples
assign higher
weights to
rare samples
14/25
hardly
observed
regions
• If we assign the same weight to all the data, the performance ofGibbs
sampling would degrade in the regions that are hardly observed.
• Whenever sampling, we therefore re-weight each observation by the energy
of its reconstruction 𝐸(𝒗 𝑛
(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
BoostedContrastive Divergence (2/2)
• If we assign the same weight to all the data, the performance ofGibbs
sampling would degrade in the regions that are hardly observed.
• Whenever sampling, we therefore re-weight each observation by the energy
of its reconstruction 𝐸(𝒗 𝑛
(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
BoostedContrastive Divergence (2/2)
Relative locations of samples
and corresponding Markov
chains by CD
hardly
observed
regions
• If we assign the same weight to all the data, the performance ofGibbs
sampling would degrade in the regions that are hardly observed.
• Whenever sampling, we therefore re-weight each observation by the energy
of its reconstruction 𝐸(𝒗 𝑛
(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
BoostedContrastive Divergence (2/2)
Relative locations of samples
and corresponding Markov
chains by the proposed
Relative locations of samples
and corresponding Markov
chains by CD
hardly
observed
regions
• If we assign the same weight to all the data, the performance ofGibbs
sampling would degrade in the regions that are hardly observed.
• Whenever sampling, we therefore re-weight each observation by the energy
of its reconstruction 𝐸(𝒗 𝑛
(𝑘), 𝒉 𝑛
(𝑘)
).
15/25
BoostedContrastive Divergence (2/2)
Relative locations of samples
and corresponding Markov
chains by PT
Relative locations of samples
and corresponding Markov
chains by the proposed
Relative locations of samples
and corresponding Markov
chains by CD
hardly
observed
regions
• For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001).
• A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively.
• In encoded binary vectors, 75% of the elements are zero.
Categorical Gradient
16/25
• For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001).
• A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively.
• In encoded binary vectors, 75% of the elements are zero.
• To resolve sparsity of 1-hot encoding vectors, we devise a new regularization
technique that incorporates prior knowledge on the sparsity.
Categorical Gradient
16/25
• For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001).
• A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively.
• In encoded binary vectors, 75% of the elements are zero.
• To resolve sparsity of 1-hot encoding vectors, we devise a new regularization
technique that incorporates prior knowledge on the sparsity.
Categorical Gradient
sparsity term
16/25
reconstruction with and w/o
the sparsity term
• For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001).
• A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively.
• In encoded binary vectors, 75% of the elements are zero.
• To resolve sparsity of 1-hot encoding vectors, we devise a new regularization
technique that incorporates prior knowledge on the sparsity.
Categorical Gradient
sparsity term
16/25
reconstruction with and w/o
the sparsity term
derived from
the sparsity term
ProposedTraining Algorithm
categorical gradient
boosted CD
17/25
• Motivation
• Preliminary
• Boosted contrastive divergence
• Categorical restricted Boltzmann machine
• Experiment results
• Conclusion
Outline
18/25
• Data preparation:
• Real human DNA sequences with known boundary information.
Results
Effects of
categorical gradient
Effects of boosting
Effects on the
splicing prediction
19/25
• Data preparation:
• Real human DNA sequences with known boundary information.
Results
Effects of
categorical gradient
Effects of boosting
Effects on the
splicing prediction
CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG
19/25
• Data preparation:
• Real human DNA sequences with known boundary information.
Results
Effects of
categorical gradient
Effects of boosting
Effects on the
splicing prediction
CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG
true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor
19/25
• Data preparation:
• Real human DNA sequences with known boundary information.
Results
Effects of
categorical gradient
Effects of boosting
Effects on the
splicing prediction
CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG
true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor
false acceptor 1false donor 1
19/25
• Data preparation:
• Real human DNA sequences with known boundary information.
• GWH dataset: 2-class (boundary or not).
• UCSC dataset: 3-class (acceptor, donor, or non-boundary).
Results
Effects of
categorical gradient
Effects of boosting
Effects on the
splicing prediction
CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG
true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor
false acceptor 1false donor 1
19/25
• The proposed method shows the best performance in terms of reconstruction
error for both training and testing.
• Compare to the softmax approach, the proposed regularized RBM succeeds in
achieving lower error by slightly sacrificing the probability sum constraint.
Results: Effects ofCategorical Gradient
Data: chromosome 19 in
GWH-donor
Sequence Length: 200nt
(800 dimension)
# of iterations: 500
Learning rate: 0.1
L2-decay: 0.001
over-fitted best
20/25
• For simulating a class-
imbalance situation
• we randomly
dropped samples
with different drop
rates for different
classes.
Results: Effects of Boosting
• For simulating a class-
imbalance situation
• we randomly
dropped samples
with different drop
rates for different
classes.
Results: Effects of Boosting
Description
Training
cost
Noise
handling
Class-imbalance
handling
CD (Hinton,
Neural Comp. 2002)
Standard and
widely used
- - -
Persistent CD
(Tieleman, ICML 2008)
Use of a single
Markov chain
- -
Parallel tempering
(Cho et al., IJCNN 2010)
Simultaneous Markov
chains generation
Proposed boosted CD Reweighting samples -
Results: Improved Performance and Robustness
2-class classification performance 3-class classification Runtime
22/25
Results: Improved Performance and Robustness
2-class classification performance 3-class classification Runtime
Insensitivity to sequence lengths
22/25
Results: Improved Performance and Robustness
2-class classification performance 3-class classification Runtime
Insensitivity to sequence lengths Robustness to negative samples
22/25
exon intron
• (Important biological finding) non-canonical splicing can arise if:
• Introns containGCA or NAA sequences at their boundaries.
• Exons include contiguousA’s around the boundaries.
Results: Identification of Non-Canonical Splice Sites
We used 162,951
examples excluding
canonical splice sites.
23/25
• We proposed a new RBM training method called boosted CD with categorical
gradients that improves conventionalCD for class-imbalanced data.
• Significant boosts in splicing prediction in terms of accuracy and runtime.
• Increased robustness to high-dimensional class-imbalanced data.
• The proposed scheme shows the ability to detect subtle non-canonical
splicing signals that often could not be identified by traditional methods.
• Future work: additional validation using various class-imbalance datasets.
24/25
Conclusion
• Our lab members
• Financial supports
• ICML 2015 travel scholarship
Acknowledgements
June 2, 2015
25/25
• Our lab members
• Financial supports
• ICML 2015 travel scholarship
Acknowledgements
June 2, 2015
25/25
• The proposed DBN showed xx% higher performance in terms of the F1-score.
• RNN is appropriate for sequence modeling. However, splicing signals are often
too far from the boundaries and hard to maintain splicing information.
Backup:Comparison with Recurrent Neural Networks (RNNs)
To be placed
Backup/25

More Related Content

PDF
On Partitioned Fitness Distributions of Genetic Operators for Predicting GA P...
PDF
Kx for wine tasting
PDF
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
PDF
Quantitative finance in q
PDF
Meetup_Consumer_Credit_Default_Vers_2_All
PDF
Exploring Simple Siamese Representation Learning
PDF
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
On Partitioned Fitness Distributions of Genetic Operators for Predicting GA P...
Kx for wine tasting
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Quantitative finance in q
Meetup_Consumer_Credit_Default_Vers_2_All
Exploring Simple Siamese Representation Learning
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...

Similar to ICML2015 Slides (20)

PPTX
deeplearningpresentation-180625071236.pptx
PPTX
Deep learning presentation
PPTX
Introduction of deep learning in cse.pptx
PPT
Deep Beleif Networks
PDF
PhD Defense
PPT
ECCV2010: feature learning for image classification, part 4
PDF
Dl1 deep learning_algorithms
PPTX
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
PDF
Conv xg
PDF
(DL Hacks輪読) How transferable are features in deep neural networks?
PPTX
Introsuction to restricted boltzman machine
PDF
A Survey of Deep Learning Algorithms for Malware Detection
PPT
deeplearning
PPTX
CNN and its applications by ketaki
PPT
An Introduction to boosting
PPTX
Diabetes prediction using Machine Leanring and Data Preprocessing techniques
PPTX
20131019 生物物理若手 Journal Club
PPTX
Deep Learning in Computer Vision
PDF
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
deeplearningpresentation-180625071236.pptx
Deep learning presentation
Introduction of deep learning in cse.pptx
Deep Beleif Networks
PhD Defense
ECCV2010: feature learning for image classification, part 4
Dl1 deep learning_algorithms
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Conv xg
(DL Hacks輪読) How transferable are features in deep neural networks?
Introsuction to restricted boltzman machine
A Survey of Deep Learning Algorithms for Malware Detection
deeplearning
CNN and its applications by ketaki
An Introduction to boosting
Diabetes prediction using Machine Leanring and Data Preprocessing techniques
20131019 生物物理若手 Journal Club
Deep Learning in Computer Vision
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
ISS -ESG Data flows What is ESG and HowHow
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Foundation of Data Science unit number two notes
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
Business Ppt On Nestle.pptx huunnnhhgfvu
oil_refinery_comprehensive_20250804084928 (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Clinical guidelines as a resource for EBP(1).pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Database Infoormation System (DBIS).pptx
Introduction to Knowledge Engineering Part 1
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Fluorescence-microscope_Botany_detailed content
ISS -ESG Data flows What is ESG and HowHow
Ad

ICML2015 Slides

  • 1. Advanced Computing Laboratory Electrical and Computer Engineering Seoul National University Taehoon Lee SungrohYoon BoostedCategorical Restricted Boltzmann Machine forComputational Prediction of Splice Junctions
  • 2. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 2/25
  • 3. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. Motivation 3/25
  • 4. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. Motivation negative positive 3/25
  • 5. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. Motivation negative positive easy to misclassify query images 3/25
  • 6. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? Motivation negative positive easy to misclassify query images 3/25
  • 7. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? • We propose a new RBM training method called boosted CD. Motivation negative positive easy to misclassify query images 3/25
  • 8. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? • We propose a new RBM training method called boosted CD. • We also devise a regularization term for sparsity of DNA sequences. Motivation negative positive easy to misclassify query images 3/25
  • 9. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem 4/25
  • 10. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein 4/25
  • 11. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein gene expression 4/25
  • 12. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein gene expression exon 4/25
  • 13. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein gene expression exon intron 4/25
  • 14. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem exon GT: false boundary GT: true boundary ACGTCGACTGCTACGTAGCAGCGA TACGTACCGATCATCACTATCATC GAGGTACGATCGATCGATCGATCA GTCGATCGTCGTTCAGTCAGTCGA TATCAGTCATATGCACATCTCAGT DNA RNA protein gene expression exon intron 4/25
  • 15. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem exon GT: false boundary GT: true boundary ACGTCGACTGCTACGTAGCAGCGA TACGTACCGATCATCACTATCATC GAGGTACGATCGATCGATCGATCA GTCGATCGTCGTTCAGTCAGTCGA TATCAGTCATATGCACATCTCAGT DNA RNA protein gene expression GT (or AG) 16K 76M true sites exon intron 160K (=0.21% over 76M) 4/25
  • 16. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction 1 2 5/25
  • 17. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction We want to construct a learning model which can boost prediction performance in a complementary way to alignment-based method. 1 2 1 2 5/25
  • 18. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction We want to construct a learning model which can boost prediction performance in a complementary way to alignment-based method. 1 2 1 2 We propose a learning model based on (multilayer) RBMs and its training scheme. 5/25
  • 19. • Training methods of RBM • RBM for categorical values • Softmax input units (Salakhutdinov et al., ICML 2007). • Class-imbalance problems • Refer to a review byGalar et al. (IEEET SMC 2012). Related Methodologies Description Training cost Noise handling Class-imbalance handling CD (Hinton, Neural Comp. 2002) Standard and widely used - - - Persistent CD (Tieleman, ICML 2008) Use of a single Markov chain - - Parallel tempering (Cho et al., IJCNN 2010) Simultaneous Markov chains generation 6/25
  • 21. Main Contributions New RBM training methods called boosted CD 7/25
  • 22. Main Contributions New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 7/25
  • 23. Main Contributions Significant boosts in splicing prediction performance New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 7/25
  • 24. Main Contributions Significant boosts in splicing prediction performance Robustness to high-dimensional class-imbalanced data New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 7/25
  • 25. Main Contributions Significant boosts in splicing prediction performance Robustness to high-dimensional class-imbalanced data New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 25/25 The ability to detect subtle non-canonical splicing signals
  • 26. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 8/25
  • 27. • RBM is a type of logistic belief network whose structure is a bipartite graph. • Nodes: • Input layer: • Hidden layer: • Probability of a configuration : • • • Each node is a stochastic binary unit: • • can be used as a feature. Restricted Boltzmann Machines 9/25
  • 28. • Training weights to minimize negative log-likelihood of data. • Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps. • The CD-𝑘 updates after seeing example 𝒗: Contrastive Divergence (CD) forTraining RBMs 𝒗(0) = 𝒗 𝒉(0) 𝒉(1) 𝒉(𝑘) 𝒗(1) 𝒗(𝑘) 10/25
  • 29. • Training weights to minimize negative log-likelihood of data. • Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps. • The CD-𝑘 updates after seeing example 𝒗: Contrastive Divergence (CD) forTraining RBMs approximated by k-step Markov chain 𝒗(0) = 𝒗 𝒉(0) 𝒉(1) 𝒉(𝑘) 𝒗(1) 𝒗(𝑘) 10/25
  • 30. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 11/25
  • 31. Overview of Proposed Methodology 12/25
  • 32. Overview of Proposed Methodology 12/25
  • 33. Overview of Proposed Methodology 12/25
  • 34. Overview of Proposed Methodology 12/25
  • 35. Overview of Proposed Methodology 12/25
  • 36. Overview of Proposed Methodology 12/25
  • 37. • Boosting is a meta-algorithm which converts weak learners to strong ones. • Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. • The main variation between many boosting algorithms: • The method of weighting training data points and hypotheses. • AdaBoost, LPBoost,TotalBoost, … What Boosting Is from lecture notes @ UCIrvine CS 271 Fall 2007 13/25
  • 38. • Contrastive divergence training is looped over all mini-batches and known to be stable. BoostedContrastive Divergence (1/2) 14/25
  • 39. • Contrastive divergence training is looped over all mini-batches and known to be stable. BoostedContrastive Divergence (1/2) 14/25 hardly observed regions
  • 40. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) 14/25 hardly observed regions
  • 41. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) 14/25 hardly observed regions
  • 42. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) assign higher weights to rare samples 14/25 hardly observed regions
  • 43. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) assign lower weights to ordinary samples assign higher weights to rare samples 14/25 hardly observed regions
  • 44. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2)
  • 45. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by CD hardly observed regions
  • 46. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by the proposed Relative locations of samples and corresponding Markov chains by CD hardly observed regions
  • 47. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by PT Relative locations of samples and corresponding Markov chains by the proposed Relative locations of samples and corresponding Markov chains by CD hardly observed regions
  • 48. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. Categorical Gradient 16/25
  • 49. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient 16/25
  • 50. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient sparsity term 16/25 reconstruction with and w/o the sparsity term
  • 51. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient sparsity term 16/25 reconstruction with and w/o the sparsity term derived from the sparsity term
  • 53. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 18/25
  • 54. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction 19/25
  • 55. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG 19/25
  • 56. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor 19/25
  • 57. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor false acceptor 1false donor 1 19/25
  • 58. • Data preparation: • Real human DNA sequences with known boundary information. • GWH dataset: 2-class (boundary or not). • UCSC dataset: 3-class (acceptor, donor, or non-boundary). Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor false acceptor 1false donor 1 19/25
  • 59. • The proposed method shows the best performance in terms of reconstruction error for both training and testing. • Compare to the softmax approach, the proposed regularized RBM succeeds in achieving lower error by slightly sacrificing the probability sum constraint. Results: Effects ofCategorical Gradient Data: chromosome 19 in GWH-donor Sequence Length: 200nt (800 dimension) # of iterations: 500 Learning rate: 0.1 L2-decay: 0.001 over-fitted best 20/25
  • 60. • For simulating a class- imbalance situation • we randomly dropped samples with different drop rates for different classes. Results: Effects of Boosting
  • 61. • For simulating a class- imbalance situation • we randomly dropped samples with different drop rates for different classes. Results: Effects of Boosting Description Training cost Noise handling Class-imbalance handling CD (Hinton, Neural Comp. 2002) Standard and widely used - - - Persistent CD (Tieleman, ICML 2008) Use of a single Markov chain - - Parallel tempering (Cho et al., IJCNN 2010) Simultaneous Markov chains generation Proposed boosted CD Reweighting samples -
  • 62. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime 22/25
  • 63. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime Insensitivity to sequence lengths 22/25
  • 64. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime Insensitivity to sequence lengths Robustness to negative samples 22/25
  • 65. exon intron • (Important biological finding) non-canonical splicing can arise if: • Introns containGCA or NAA sequences at their boundaries. • Exons include contiguousA’s around the boundaries. Results: Identification of Non-Canonical Splice Sites We used 162,951 examples excluding canonical splice sites. 23/25
  • 66. • We proposed a new RBM training method called boosted CD with categorical gradients that improves conventionalCD for class-imbalanced data. • Significant boosts in splicing prediction in terms of accuracy and runtime. • Increased robustness to high-dimensional class-imbalanced data. • The proposed scheme shows the ability to detect subtle non-canonical splicing signals that often could not be identified by traditional methods. • Future work: additional validation using various class-imbalance datasets. 24/25 Conclusion
  • 67. • Our lab members • Financial supports • ICML 2015 travel scholarship Acknowledgements June 2, 2015 25/25
  • 68. • Our lab members • Financial supports • ICML 2015 travel scholarship Acknowledgements June 2, 2015 25/25
  • 69. • The proposed DBN showed xx% higher performance in terms of the F1-score. • RNN is appropriate for sequence modeling. However, splicing signals are often too far from the boundaries and hard to maintain splicing information. Backup:Comparison with Recurrent Neural Networks (RNNs) To be placed Backup/25