SlideShare a Scribd company logo
MicroRNA Detection
Khan Shing
CS374
May 8, 2008
Source: Science 2 September 2005: Vol. 30
Outline
 Biological background
• Gene regulation
• microRNAs
 microRNA detection
• Random forests
• Comparative genomics
 microRNA target recognition
• Site accessibility
Information Flow
Source: http://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology
Gene Regulation
• Transcriptional regulation
◦ Enhancers, promoters, transcription factors,
epigenetic modifications
• Post-transcriptional regulation
◦ mRNA processing, small RNAs
• Post-translational regulation
◦ Protein activation, inhibition, degradation
Source: Stark A. et al. 2007. Systematic discovery and characterization of
fly microRNAs using 12 Drosophila genomes. Genome Res.
microRNA
• RNA can fold like proteins:
possess primary,
secondary and tertiary
structure
• Secondary hairpin
structure crucial to
processing of small RNAs
Source: Zamore, P.D. and Haley, B. 2005. Ribo-gnome: The big world of
small RNAs. Science 309: 1519–1524.
miRNA Processing
Source: Zamore, P.D. and Haley, B. 2005. Ribo-gnome: The big world of
small RNAs. Science 309: 1519–1524.
miRNA Processing
Source: Zamore, P.D. and Haley, B. 2005. Ribo-gnome: The big world of
small RNAs. Science 309: 1519–1524.
miRNAs Suppress Gene
Expression
microRNA Detection
Stark A. et al. 2007. Systematic discovery
and characterization of fly microRNAs
using 12 Drosophila genomes. Genome
Res. doi:10.1101/gr.6593807.
Source: Leo Breiman, Random Forests, Machine Learning, v.45 n.1, p.5-
32, October 1 2001.
microRNA Detection
• Machine learning approach
◦ Find characteristics that distinguish miRNAs
◦ Use these features to train a model
• Random forests
◦ Collection of many independently constructed
classification trees
◦ Each tree “votes” and the tallied votes yield a
score
Source: http://www.gmupolicy.net/its/incidentduration/image351.gif
How to Classify Objects?
How to Classify?
Training
Node B Node C
Source:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forest
N cases in training set, M input variables
• Sample N cases at random, with replacement, from
the original data. This sample will be the training set
for growing the tree.
• At each node, m variables (m << M) are selected at
random out of the M and the best split on these m is
used to split the node. The value of m is held constant
during the forest growing.
• Each tree is grown to the largest extent possible.
There is no pruning.
Source: http://www.jfsowa.com/figs/bintree.gif
Random Forest
• Trained on RFAM data set of 60 cloned
miRNAs and random negative set (250 putative
miRNA hairpins) with a variety of features
• Independently construct 500 trees
Source: CS262 Lecture 17, Win07, Batzoglou
Comparative Genomics
Source: Stark A. et al. 2007. Systematic discovery and characterization of
fly microRNAs using 12 Drosophila genomes. Genome Res.
Structural Features
Compare the 60 cloned miRNAs in the RFAM database to
random “miRNA like” hairpins (~760,000)
Source: Stark A. et al. 2007. Systematic discovery and characterization of
fly microRNAs using 12 Drosophila genomes. Genome Res.
Conservation Features
Source: Stark A. et al. 2007. Systematic discovery and characterization of
fly microRNAs using 12 Drosophila genomes. Genome Res.
Discovery and validation of new
miRNAs
Alone, each feature does not provide enough
discriminatory power, but trained into the model,
~4500 fold enrichment
Discovery and validation of new
miRNAs
• Rank all 760,355 putative miRNAs
according to this combined score
• Finds 41 novel miRNA candidates
• Validate by sequencing and other
methods
Source: Stark A. et al. 2007. Systematic dis
Source: Stark A. et al. 2007. Systematic dis
Results
• Antisense strand miRNAs
• miRNA* sequences
Source: Stark A. et al. 2007. Systematic dis
Accurate Prediction of Mature
miRNAs
microRNA Target Recognition
Kertesz, M., Iovino, N., Unnerstall, U., Gaul,
U. & Segal, E. The role of site accessibility
in microRNA target recognition. Nat.
Genet. 39, 1278–1284 (2007).
Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The
role of site accessibility in microRNA target recognition. Nat. Genet. 39,
Motivation for looking at site
accessibility
• Existing methods for
finding miRNA targets rely
mostly on sequence
specificity
• But miRNAs act as part of
a protein complex. They
have size and can be
blocked by mRNA
secondary structure
Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The
role of site accessibility in microRNA target recognition. Nat. Genet. 39,
Proof of Principle
Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The
role of site accessibility in microRNA target recognition. Nat. Genet. 39,
How to use this fact?
• Develop an energy based score to rate miRNA-
target interactions
• Explain ∆G – free energy of molecular
interactions
• ∆∆G – the difference between free energy gain
of the system when an miRNA binds to its target
and the free energy loss of unpairing the mRNA
target sequence secondary structure.
Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The
role of site accessibility in microRNA target recognition. Nat. Genet. 39,
Test how good ∆∆G is
Correlates well with
repression in
luciferase assays:
Even better if
flanking regions
are included:
Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The
role of site accessibility in microRNA target recognition. Nat. Genet. 39,
Comparison to other target
predictors
References
Ruby J.G. et al. 2007. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of
Drosophila microRNAs. Genome Res. doi:10.1101/gr.6597907
E. Berezikov, F. Thuemmler, L.W. van Laake, I. Kondova, R. Bontrop, E. Cuppen and R.H. Plasterk, Diversity of
microRNAs in human and chimpanzee brain, Nat. Genet. 38 (2006), pp. 1375–1377.
Source: Stark A. et al. 2007. Systematic discovery and characterization of
fly microRNAs using 12 Drosophila genomes. Genome Res.
Other figures
Lecture12

More Related Content

PPTX
Micro RNA (miRNA) en route to the clinic as next generation medicine
PPTX
microRNA “miRNA”mi RNA
PPTX
Micro RNA in plants and roots
PPTX
microRNA in Plant Defence and Pathogen Counter-defence
PPTX
PPTX
miRNA - Biogenesis, Function and Regulation
PPTX
PPTX
Role of MicroRNA in Phosphorus Defficiency
Micro RNA (miRNA) en route to the clinic as next generation medicine
microRNA “miRNA”mi RNA
Micro RNA in plants and roots
microRNA in Plant Defence and Pathogen Counter-defence
miRNA - Biogenesis, Function and Regulation
Role of MicroRNA in Phosphorus Defficiency

What's hot (20)

PDF
G0562033042
PPTX
Micro RNA biogenesis, function and nomenclature
PPT
Micro RNAs
PPTX
Micro rna
PPTX
Micro rna
PPTX
Biogenesis of mi RNA
PDF
Micro RNA biogenesis pathways in cancer - Article.
PPT
mi RNA regulation
PDF
What is MicroRNA? - Leading The Way with MicroRNA
PPT
Presentation on host virus interaction(2008432018)
PPTX
Functions of micro RNA
PPTX
Micro rna
PPT
Micro rna
PPTX
micro RNA
PDF
Mi rna part iii_2013
PPT
PDF
microRNA for Clinical Research and Tumor Analysis
PPTX
Insilico & Genomics Approaches for the Characterization of Abiotic Stress ...
PPTX
Mirna and its applications
PPTX
MicroRNA and thier role in gene regulation
G0562033042
Micro RNA biogenesis, function and nomenclature
Micro RNAs
Micro rna
Micro rna
Biogenesis of mi RNA
Micro RNA biogenesis pathways in cancer - Article.
mi RNA regulation
What is MicroRNA? - Leading The Way with MicroRNA
Presentation on host virus interaction(2008432018)
Functions of micro RNA
Micro rna
Micro rna
micro RNA
Mi rna part iii_2013
microRNA for Clinical Research and Tumor Analysis
Insilico & Genomics Approaches for the Characterization of Abiotic Stress ...
Mirna and its applications
MicroRNA and thier role in gene regulation
Ad

Similar to Lecture12 (20)

PDF
Mi rna part i_2013
PDF
Mi rna series i-dec 2012
PDF
Meeting the challenges of miRNA research: miRNA and its Role in Human Disease...
PDF
Mi rna array 2013
PDF
Micrornas In Development Methods And Protocols 1st Edition Dylan Sweetman Auth
PDF
Nextgeneration Microrna Expression Profiling Technology Methods And Protocols...
PDF
MicroRNA Methods 1st Edition John J. Rossi
PDF
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
PDF
Plant MicroRNAs Methods and Protocols 1st Edition Zhixin Xie (Auth.)
PPT
Micro RNA.ppt
PDF
Plant MicroRNAs Methods and Protocols 1st Edition Zhixin Xie (Auth.)
PDF
2011-NAR
PDF
Mirnapcrarray
PPT
Lecture bioinformatics Part2.next generation
PDF
Bioinformatics In Microrna Research 1st Edition Jingshan Huang
PPTX
microrna.pptx......................................
PDF
MicroRNA-Disease Predictions Based On Genomic Data
PDF
miRNA NOVOS HORMÔNIOS?
PDF
The Role of MicroRNAs in the Progression, Prognostication, and Treatment of B...
Mi rna part i_2013
Mi rna series i-dec 2012
Meeting the challenges of miRNA research: miRNA and its Role in Human Disease...
Mi rna array 2013
Micrornas In Development Methods And Protocols 1st Edition Dylan Sweetman Auth
Nextgeneration Microrna Expression Profiling Technology Methods And Protocols...
MicroRNA Methods 1st Edition John J. Rossi
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Plant MicroRNAs Methods and Protocols 1st Edition Zhixin Xie (Auth.)
Micro RNA.ppt
Plant MicroRNAs Methods and Protocols 1st Edition Zhixin Xie (Auth.)
2011-NAR
Mirnapcrarray
Lecture bioinformatics Part2.next generation
Bioinformatics In Microrna Research 1st Edition Jingshan Huang
microrna.pptx......................................
MicroRNA-Disease Predictions Based On Genomic Data
miRNA NOVOS HORMÔNIOS?
The Role of MicroRNAs in the Progression, Prognostication, and Treatment of B...
Ad

Recently uploaded (20)

PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPT
Total quality management ppt for engineering students
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
communication and presentation skills 01
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Information Storage and Retrieval Techniques Unit III
Categorization of Factors Affecting Classification Algorithms Selection
Total quality management ppt for engineering students
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
R24 SURVEYING LAB MANUAL for civil enggi
communication and presentation skills 01
distributed database system" (DDBS) is often used to refer to both the distri...
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Soil Improvement Techniques Note - Rabbi
Nature of X-rays, X- Ray Equipment, Fluoroscopy
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt

Lecture12

  • 2. Source: Science 2 September 2005: Vol. 30
  • 3. Outline  Biological background • Gene regulation • microRNAs  microRNA detection • Random forests • Comparative genomics  microRNA target recognition • Site accessibility
  • 5. Gene Regulation • Transcriptional regulation ◦ Enhancers, promoters, transcription factors, epigenetic modifications • Post-transcriptional regulation ◦ mRNA processing, small RNAs • Post-translational regulation ◦ Protein activation, inhibition, degradation
  • 6. Source: Stark A. et al. 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. microRNA • RNA can fold like proteins: possess primary, secondary and tertiary structure • Secondary hairpin structure crucial to processing of small RNAs
  • 7. Source: Zamore, P.D. and Haley, B. 2005. Ribo-gnome: The big world of small RNAs. Science 309: 1519–1524. miRNA Processing
  • 8. Source: Zamore, P.D. and Haley, B. 2005. Ribo-gnome: The big world of small RNAs. Science 309: 1519–1524. miRNA Processing
  • 9. Source: Zamore, P.D. and Haley, B. 2005. Ribo-gnome: The big world of small RNAs. Science 309: 1519–1524. miRNAs Suppress Gene Expression
  • 10. microRNA Detection Stark A. et al. 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. doi:10.1101/gr.6593807.
  • 11. Source: Leo Breiman, Random Forests, Machine Learning, v.45 n.1, p.5- 32, October 1 2001. microRNA Detection • Machine learning approach ◦ Find characteristics that distinguish miRNAs ◦ Use these features to train a model • Random forests ◦ Collection of many independently constructed classification trees ◦ Each tree “votes” and the tallied votes yield a score
  • 16. Source: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Random Forest N cases in training set, M input variables • Sample N cases at random, with replacement, from the original data. This sample will be the training set for growing the tree. • At each node, m variables (m << M) are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. • Each tree is grown to the largest extent possible. There is no pruning.
  • 17. Source: http://www.jfsowa.com/figs/bintree.gif Random Forest • Trained on RFAM data set of 60 cloned miRNAs and random negative set (250 putative miRNA hairpins) with a variety of features • Independently construct 500 trees
  • 18. Source: CS262 Lecture 17, Win07, Batzoglou Comparative Genomics
  • 19. Source: Stark A. et al. 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. Structural Features Compare the 60 cloned miRNAs in the RFAM database to random “miRNA like” hairpins (~760,000)
  • 20. Source: Stark A. et al. 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. Conservation Features
  • 21. Source: Stark A. et al. 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. Discovery and validation of new miRNAs Alone, each feature does not provide enough discriminatory power, but trained into the model, ~4500 fold enrichment
  • 22. Discovery and validation of new miRNAs • Rank all 760,355 putative miRNAs according to this combined score • Finds 41 novel miRNA candidates • Validate by sequencing and other methods
  • 23. Source: Stark A. et al. 2007. Systematic dis
  • 24. Source: Stark A. et al. 2007. Systematic dis Results • Antisense strand miRNAs • miRNA* sequences
  • 25. Source: Stark A. et al. 2007. Systematic dis Accurate Prediction of Mature miRNAs
  • 26. microRNA Target Recognition Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility in microRNA target recognition. Nat. Genet. 39, 1278–1284 (2007).
  • 27. Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility in microRNA target recognition. Nat. Genet. 39, Motivation for looking at site accessibility • Existing methods for finding miRNA targets rely mostly on sequence specificity • But miRNAs act as part of a protein complex. They have size and can be blocked by mRNA secondary structure
  • 28. Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility in microRNA target recognition. Nat. Genet. 39, Proof of Principle
  • 29. Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility in microRNA target recognition. Nat. Genet. 39, How to use this fact? • Develop an energy based score to rate miRNA- target interactions • Explain ∆G – free energy of molecular interactions • ∆∆G – the difference between free energy gain of the system when an miRNA binds to its target and the free energy loss of unpairing the mRNA target sequence secondary structure.
  • 30. Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility in microRNA target recognition. Nat. Genet. 39, Test how good ∆∆G is Correlates well with repression in luciferase assays: Even better if flanking regions are included:
  • 31. Source: Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility in microRNA target recognition. Nat. Genet. 39, Comparison to other target predictors
  • 32. References Ruby J.G. et al. 2007. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Genome Res. doi:10.1101/gr.6597907 E. Berezikov, F. Thuemmler, L.W. van Laake, I. Kondova, R. Bontrop, E. Cuppen and R.H. Plasterk, Diversity of microRNAs in human and chimpanzee brain, Nat. Genet. 38 (2006), pp. 1375–1377.
  • 33. Source: Stark A. et al. 2007. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Genome Res. Other figures

Editor's Notes

  • #3: From a special issue of Science magazine devoted to RNA in September 2005. Shows some of the complexity involved in RNA silencing and some of the relationships between siRNA and miRNA. In 2006, andy fire won the nobel prize in medicine for his work on RNAi
  • #5: Information is stored in the DNA in genes. Genes are transcribed into mRNAs which are then translated into proteins that perform most biological activities. We’re discovering more and more that different types of regulation can occur at each level in this diagram and that different regulatory mechanisms can act in concert to produce combinatorial regulatory effects.
  • #6: Transcriptional regulation occurs at the DNA level, controlling which genes get transcribed by the transcriptional machinery. Enhancers, promoters and TFs can help recruit the transcriptional machinery to genes. Epigenetic modifications (like nucleosome positioning that we learned about earlier) i.e. modifying the DNA and chromatin structure, can also control transcription of genes. Once transcribed, mRNAs need to be modified into their mature forms. Another class of RNAs, small RNAs ~ 20 nt long can act here as well to prevent translation into proteins. These include siRNAs (Andy Fire got a nobel for discovery of this) and miRNAs. They are closely related in terms of processing and function. Once translated, proteins can still be regulated by modifications like phosphorylation that activate / inhibit. Proteins can also be sequestered or degraded as needed. Our talk today will focus on the post-transcriptional regulation of mRNA translation by miRNAs.
  • #7: As we saw last time, RNAs can fold much like proteins. They have primary, secondary and tertiary structure and their 3D structure is crucial to the function of RNA enzymes like riboswitches. For miRNAs, the most important aspect is their secondary structure. The mature miRNA is formed from one arm of the stem of the characteristic RNA hairpin structure. (highlighted in red here). Recent research is showing that miRNAs are a very important source of gene regulation. Computational estimates show that &amp;gt; 30% of our genes may be regulated by miRNAs and 1-5% of animal genes may be miRNAs.
  • #8: The first paper I’ll be talking about today deals with finding microRNAs in the genome. A little background on how miRNAs are formed will be helpful for later. What we’ll be searching for are miRNA genes.
  • #9: After processing, we’re left with a pre-miRNA which is then exported to the cytoplasm. There it undergoes more processing by dicer, which clips off the loop, leaving us with only the stem containing the mature miRNA. This duplex is unpaired and the miRNA is inserted into a protein complex (RNA induced silencing complex, aka RISC). The RISC complex is the key to the regulatory activities of miRNA
  • #10: Targets complementary sequences on mRNAs, typically in 3’UTRs. “Guides” the RISC complex in to the target sequence. One mRNA can and typically does have multiple miRNA targets. Once bound, it acts through several mechanisms to silence expression: site directed cleavage of the mRNA that leads to its degradation, mediate transport to areas where it can be sequestered or degraded.
  • #11: First paper is about finding microRNAs. Up till now, most of the miRNAs were discovered through experimental approaches in the lab. Sequencing small RNA libraries, and cloning them. This is a difficult task because of the small size and relatively low expression of many miRNAs. For example, the authors find that there are 760,355 putative miRNA hairpins in the genome. When this paper was written, there were about 60 confirmed and cloned miRNAs found in drosophila. This paper tries to develop computational methods for de novo discovery of miRNAs.
  • #12: The authors take a machine learning approach to miRNA discovery. They try to find a set of characteristics (variables) that allow for differentiation of true miRNAs from putative hairpins and then use these features to train classification trees. The classifier they use in this paper is a random tree. Coined by Leo Breiman, statistics professor at Berkeley. They are basically a collection of many classification trees. Breiman showed that using this technique, as the number of trees grows, the generalization error converges. Independently construct a lot of these classification trees. Each one “votes” and the tallied votes give us a score for the putative miRNA. In the original paper on random forests, it was shown that the forest error rate depends on two things: The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.
  • #13: The problem we need to solve is that we have a bunch of objects, in this case a lot of short RNA sequences, and we want to classify them, aka determine whether they are true miRNA’s or not. To do this, we can use classification trees. Quick example of one variation of a classification tree. When we train one, we basically start with a set of labeled training examples and a set of variables that convey certain characteristics. At each node, we try to split the training set using the variable (or variables) to create child nodes. The leaf nodes are the final classification. Once trained, to classify a new object, we start it at the root and pass it down through the tree, applying the test at each node. The terminal node where it ends up is its classification. In training, the ultimate goal is to separate out all the classes so that every terminal node consists of objects of only one class. I’ll try to explain pictorially a little more clearly the method that the authors in this paper use to build their decision trees.
  • #14: Lets try to illustrate this graphically. Lets say, abstractly, that these stars represent objects. We don’t know their labels and we want to classify them as either red or blue.
  • #15: What we do is we take a training set, where we know the labels, and use this set to train the model. So we start off with the full data set. This would be like starting off at the node. We try to find a “good” way to split this set into two groups that are mostly one set or the other. Remember that the goal is to have each node here at the end have only one class of objects in it. The hard part is figuring out what a “good” split is. We can take this split here, it looks pretty good, but not perfect. There are still mixed classes in each set. So what we do is create a child node for each set.
  • #16: Node c is complex and will probably need a lot more child nodes to separate those, but node B can be finished right here if we add this line to split the set. We can then make the terminal nodes D and E.
  • #17: More formal statement of random trees.
  • #18: We independently construct each tree, and when we have lots of trees, we’ve got a forest. So, as before, each tree votes and the tallied votes determine the score of the putative miRNA. So why a forest instead of just using one tree constructed with all the features? The problem is that each defining feature does not apply to 100% of training set. So for example, there are 60 confirmed miRNAs but any given feature may only apply to 70 or 80% of these. There is no black or white feature we can use to distinguish miRNAs from random RNA hairpins. Because of this, using the collective score from a forest of trees reduces the error rate in classification.
  • #19: Another concept the authors make use of is comparative genomics. Common way to find genes. This is based around the idea that functional regions of the genome are more tightly conserved than surrounding pseudogenes or non functional regions. So if we compare sequences across several species, patterns should show up in the conservation profiles of different sequences. Differential conservation patterns give us a signal we can use to search the genome. This is especially good for miRNAs because previous studies have shown that they are very tightly conserved. This may be because of their critical functions in regulating genes as well as their dependence on sequence specificity to function. Taken from Serafim’s lecture on gene finding. When we talk about conservation, we need to think about how we are defining conservation. In this case, we are aligning protein coding genes, so we’re basically looking at conservation of proteins. Protein function is dependant on its structure, which in turn is dependant on its sequence. But not completely because there is some flexibility in the genetic code such that the same amino acid can have multiple codons and some amino acids can substitute for others. We’re mostly focused on how the protein sequence is conserved.
  • #20: With microRNAs, we have to define the conservation we are looking for differently. Here, we are basically talking about structural features of the miRNA, which are a direct consequence of the sequence. The structure of the miRNA hairpin is heavily conserved because its crucial to correct processing of the miRNA. A lot of the processing machinery is dependant upon this stem structure and a certain amount of base paring and certain types of loops in various places. So the first thing they need to do is find these structural features they can train on. Start by looking at confirmed miRNAs in the RFAM database. For comparison, they identify ALL potential miRNAs by looking in the genome for miRNA like hairpins. Lots of programs that predict RNA secondary structure based on sequence and they’re very accurate. So they basically search all 120 nt windows (90 nt overlap) in the fly genome for things that will look roughly like hairpins if transcribed. They filter these some (remove all hairpins &amp;lt;63 nt, with an arm &amp;lt; 20 nt, or with &amp;lt;70% arm base-pairing). They end up with 760,355 putative miRNAs. This chart shows distributions for various structural features. Some things they found: length parameters for miRNAs are much more tightly regulated than for other hairpins. The overall length of the hairpin is more tightly defined, arm length, and loop length. miRNAs have more stable structure. They have more symmetric loops but fewer asymmetric or bulged ones.
  • #21: To look for patterns created by evolutionary constraints, they align each miRNA with its flanking regions across all 12 species with blast. Measured the conservation in 14 regions: 4 in each arm, 2 in the loop, 2 in the flanking regions. The conservation profile is shown here. Note the very strong conservation in the arms highlighted in red and blue as compared to the loop and flanking regions, both of which are excised in the processing of a mature miRNA. Loops and flanking regions have lots of mutations while the arms are perfectly conserved.
  • #23: Discovery rate of cloned miRNAs plateaus at a score of .95. Use this for a cutoff, and they find 101 hairpins, including 51/60 of the cloned miRNAs and 41 completely novel miRNA candidates. To validate, they used ultra high throughput sequencing of RNA libraries, looking for reads that matched their candidates. Confirm total of 84/101 predictions. Also confirm by looking at other properties of miRNAs. Look for predictions that are family members of known miRNAs, or that have orthologs in other species. End up finding that 28/41 novel predictions have strong evidence for being functional miRNAs.
  • #24: This is a table showing the top scoring miRNAs they predicted. Notice how well most of these are conserved. Many are conserved perfectly across all 12 species.
  • #25: Antisense strand, the strand of DNA opposite the miRNA coding gene can also produce miRNAs, as well as the miRNA* sequence – the opposite arm of the stem structure on the pre-miRNA.
  • #28: If this is the target sequence, we could imagine that if the mRNA is folded into this tight secondary structure, the miRNA would have a hard time targeting to it. Needs open region to bind to.
  • #29: Luciferase assay to demonstrate that mutating known miRNA targets to make them more closed reduces the ability of miRNAs to repress expression. Grim is naturally closed target, so we don’t’ see much of an increase in luciferase.
  • #32: These other methods apply filters while this model uses no parameters or thresholds. We can see that it performs better and is simpler mechanistically than the others.