SlideShare a Scribd company logo
Gene Prediction:
Similarity-Based
Approaches
Outline
1. Introduction
2. Exon Chaining Problem
3. Spliced Alignment
4. Gene Prediction Tools
Section 1:
Introduction
Similarity-Based Approach to Gene Prediction
• Some genomes may be well-studied, with many genes having
been experimentally verified.
• Closely-related organisms may have similar genes.
• In order to determine the functions of unknown genes,
researchers may compare them to known genes in a closely-
related species.
• This is the idea behind the similarity-based approach to gene
prediction.
Similarity-Based Approach to Gene Prediction
• There  is  one  issue  with  comparison:  genes  are  often  “split.”
• The exons, or coding sections of a gene, are separated from
introns, or the non-coding sections.
• Example:
• Our Problem: Given a known gene and an unannotated
genome sequence, find a set of substrings in the genomic
sequence whose concatenation best matches the known gene.
• This concatenation will be our candidate gene.
…AGGGTCTCATTGTAGACAGTGGTACTGATCAACGCAGGACTT…
Coding Non-coding Coding Non-coding
• Small islands of similarity corresponding to similarities between
exons
Comparing Genes in Two Genomes
FrogGene(known)
Human Genome
Using Similarities to Find Exon Structure
• A (known) frog gene is aligned to different locations in the
human genome.
• Find  the  “best”  path  to  reveal  the  exon  structure  of  the  
corresponding human gene.
Finding Local Alignments
• A (known) frog gene is aligned to different locations in the
human genome.
• Find  the  “best”  path  to  reveal  the  exon  structure  of  the  
corresponding human gene.
• Use local
alignments to
find all islands
of similarity.
FrogGene(known)
Human Genome
genome
mRNA
Exon 3Exon 1 Exon 2
{
{
{
Intron 1 Intron 2
{
{
Reverse Translation
• Reverse Translation Problem: Given a known protein, find a
gene which codes for it.
• Inexact: amino acids map to > 1 codon
• This problem is essentially reduced to an alignment
problem.
• Example: Comparing Genomic DNA Against mRNA.
Portion of genome
mRNA
(codonsequence)
Exon 3Exon 1 Exon 2
{
{
{
Intron 1 Intron 2
{
{
Comparing Genomic DNA Against mRNA
Reverse Translation
• The reverse translation problem can be modeled as traveling in
Manhattan  grid  with  “free”  horizontal  jumps.
• Each horizontal jump models insertion of an intron.
• Complexity of Manhattan grid is O(n3).
• Issue: Would match nucleotides pointwise and use
horizontal jumps at every opportunity.
Section 2:
Exon Chaining Problem
Chaining Local Alignments
• Aim: Find candidate exons, or substrings that match a given
gene sequence.
• Define a candidate exon as (l, r, w):
• l = starting position of exon
• r = ending position of exon
• w = weight of exon, defined as score of local alignment or
some other weighting score
• Idea: Look for a chain of substrings with maximal score.
• Chain: a set of non-overlapping nonadjacent intervals.
• Locate the beginning and end of each interval (2n points).
• Find  the  “best”  concatenation  of  some  of  the  intervals.
3
4
11
9
15
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Exon Chaining Problem: Illustration
• Locate the beginning and end of each interval (2n points).
• Find  the  “best”  concatenation  of  some  of  the  intervals.
3
4
11
9
15
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Exon Chaining Problem: Illustration
• Locate the beginning and end of each interval (2n points).
• Find  the  “best”  concatenation  of  some  of  the  intervals.
3
4
11
9
15
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
Exon Chaining Problem: Illustration
Exon Chaining Problem: Formulation
• Goal: Given a set of putative exons, find a maximum set of
non-overlapping putative exons (chain).
• Input: A set of weighted intervals (putative exons).
• Output: A maximum chain of intervals from this set.
Exon Chaining Problem: Formulation
• Goal: Given a set of putative exons, find a maximum set of
non-overlapping putative exons (chain).
• Input: A set of weighted intervals (putative exons).
• Output: A maximum chain of intervals from this set.
• Question: Would a greedy algorithm solve this problem?
• This problem can be solved with dynamic programming in
O(n) time.
• Idea: Connect adjacent endpoints with weight zero edges,
connect ends of weight w exon with weight w edge.
Exon Chaining Problem: Graph Representation
ExonChaining (G, n) //Graph, number of intervals
1. for i ← 1 to 2n
2. si ← 0
3. for i ← 1 to 2n
4. if vertex vi in G corresponds to right end of the interval I
5. j ← index of vertex for left end of the interval I
6. w ← weight of the interval I
7. sj ← max {sj + w, si-1}
8. else
9. si ← si-1
10.return s2n
Exon Chaining Problem: Pseudocode
Exon Chaining: Deficiencies
1. Poor definition of the putative exon endpoints.
2. Optimal chain of intervals may not correspond to a valid
alignment.
• Example: First interval may correspond to a suffix, whereas
second interval may correspond to a prefix.
Human Genome
FrogGenes(known)
Infeasible Chains: Illustration
• Red local similarities form two non-overlapping intervals but
do not form a valid global alignment.
• The cell carries DNA as a blueprint for producing proteins,
like a manufacturer carries a blueprint for producing a car.
Gene Prediction Analogy
Gene Prediction Analogy: Using Blueprint
• Each protein has its own distinct blueprint for construction.
Gene Prediction Analogy: Assembling Exons
Gene  Prediction  Analogy:  Still  Assembling…
Section 3:
Spliced Alignment
Spliced Alignment
• Mikhail Gelfand and colleagues proposed a spliced alignment
approach of using a protein from one genome to reconstruct
the exon-intron structure of a (related) gene in another
genome.
• Spliced alignment begins by selecting either all putative exons
between potential acceptor and donor sites or by finding all
substrings similar to the target protein (as in the Exon
Chaining Problem).
• This set is further filtered in a such a way that attempts to
retain all true exons, with some false ones.
Spliced Alignment Problem: Formulation
• Goal: Find a chain of blocks in a genomic sequence that best
fits a target sequence.
• Input: Genomic sequences G, target sequence T, and a set of
candidate exons B.
• Output: A chain of exons C such that the global alignment
score between C* and T is maximum among all chains of
blocks from B.
• Here C* is the concatenation of all exons from chain C.
Example:  Lewis  Carroll’s  “Jabberwocky”
• Genomic  Sequence:  “It  was  a  brilliant  thrilling  morning  and  
the slimy, hellish, lithe doves gyrated and gamboled nimbly in
the  waves.”
• Target  Sequence:  “Twas  brillig,  and  the  slithy  toves  did  gyre  
and  gimble  in  the  wabe.”
• Alignment:  Next  Slide…    
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Example:  Lewis  Carroll’s  “Jabberwocky”
Spliced Alignment: Idea
• Compute the best alignment between i-prefix of genomic
sequence G and j-prefix of target T: S(i,j)
• But what is “i-prefix”  of G?
• There may be a few i-prefixes of G, depending on which block
B we are in.
Spliced Alignment: Idea
• Compute the best alignment between i-prefix of genomic
sequence G and j-prefix of target T: S(i,j)
• But what is “i-prefix”  of G?
• There may be a few i-prefixes of G, depending on which block
B we are in.
• Compute the best alignment between i-prefix of genomic
sequence G and j-prefix of target T under the assumption that
the alignment uses the block B at position i: S(i, j, B).
Spliced Alignment Recurrence
• If i is not the starting vertex of block B:
• If i is the starting vertex of block B:
where p is  some  indel  penalty  and  δ  is  a  scoring  matrix.

S(i, j,B)  max
S(i  1, j,B)  p
S(i, j  1,B)  p
S(i  1, j  1,B)   (gi,t j )





S(i, j,B)  max
B ' preceding B
S(i, j  1,B)  p
S(end (B'), j,B')  p
S(end (B'), j  1,B')   (gi,t j )





Spliced Alignment Recurrence
• After computing the three-dimensional table S(i, j, B), the
score of the optimal spliced alignment is:

max
B
S end (B), length (T), B 
Spliced Alignment: Complications
• Considering multiple i-prefixes leads to slow down.
• Running time:
where m is the target length, n is the genomic sequence
length and |B| is the number of blocks.
• A mosaic effect: short exons are easily combined to fit any
target protein.

O mn
2
B 
Spliced Alignment: Speedup
Spliced Alignment: Speedup
Spliced Alignment: Speedup

P i, j   max
B preceding i
S end B , j, B 
Exon Chaining vs Spliced Alignment
• In Spliced Alignment, every path spells out the string obtained
by concatenation of labels of its edges.
• The weight of the path is defined as optimal alignment score
between concatenated labels (blocks) and target sequence.
• Defines weight of entire path in graph, but not the weights
for individual edges.
• Exon Chaining assumes the positions and weights of exons are
pre-defined.
Section 4:
Gene Prediction Tools
Gene Prediction: Aligning Genome vs. Genome
• Goal: Align entire human and mouse genomes.
• Predict genes in both sequences simultaneously as chains of
aligned blocks (exons).
• This approach does not assume any annotation of either human
or mouse genes.
Gene Prediction Tools
1. GENSCAN/Genome Scan
2. TwinScan
3. Glimmer
4. GenMark
The GENSCAN Algorithm
• Algorithm is based on probabilistic model of gene structure.
• GENSCAN  uses  a  “training  set,”  then  the  algorithm  returns  
the exon structure using maximum likelihood approach
standard to many HMM algorithms (Viterbi algorithm).
• Biological input: Codon bias in coding regions, gene structure
(start and stop codons, typical exon and intron length, presence
of promoters, presence of genes on both strands, etc).
• Benefit: Covers cases where input sequence contains no gene,
partial gene, complete gene, or multiple genes.
GENSCAN Limitations
• Does not use similarity search to predict genes.
• Does not address alternative splicing.
• Could combine two exons from consecutive genes together.
GenomeScan
• Incorporates similarity information into GENSCAN: predicts
gene structure which corresponds to maximum probability
conditional on similarity information.
• Algorithm is a combination of two sources of information:
1. Probabilistic models of exons-introns.
2. Sequence similarity information.
http://www.stanford.edu/class/cs262/Spring2003/Notes/ln10.pdf
TwinScan
• Aligns two sequences and marks each base as gap ( - ),
mismatch (:), or match (|).
• Results in a new alphabet of 12 letters
Σ = {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}.
• Then run Viterbi algorithm using emissions ek(b) where
b ∊ {A-,  A:,  A|,  …,  T|}.
TwinScan
• The emission probabilities are estimated from human/mouse
gene pairs.
• Example: eI(x|) < eE(x|) since matches are favored in exons,
and eI(x-) > eE(x-) since gaps (as well as mismatches) are
favored in introns.
• Benefit: Compensates for dominant occurrence of poly-A
region in introns.
Glimmer
• Glimmer: Stands for
Gene Locator and Interpolated Markov ModelER
• Finds genes in bacterial DNA
• Uses interpolated Markov Models
Glimmer
• Made of 2 programs:
1. BuildIMM:
• Takes sequences as input and outputs the Interpolated
Markov Models (IMMs).
2. Glimmer:
• Takes IMMs and outputs all candidate genes.
• Automatically resolves overlapping genes by choosing
one, hence limited.
• Marks  “suspected  to  truly  overlap”  genes  for  closer  
inspection by user.
GenMark
• Based on non-stationary Markov chain models.
• Results displayed graphically with coding vs. noncoding
probability dependent on position in nucleotide sequence.

More Related Content

PPTX
Gene identification and discovery
PDF
Bioalgo 2012-01-gene-prediction-stat
PPTX
B.sc biochem i bobi u 4 gene prediction
PDF
Gene prediction methods vijay
PPT
artificial neural network-gene prediction
PDF
BIOL335: How to annotate a genome
PDF
Tyler functional annotation thurs 1120
PPTX
Assembly and gene_prediction
Gene identification and discovery
Bioalgo 2012-01-gene-prediction-stat
B.sc biochem i bobi u 4 gene prediction
Gene prediction methods vijay
artificial neural network-gene prediction
BIOL335: How to annotate a genome
Tyler functional annotation thurs 1120
Assembly and gene_prediction

What's hot (20)

PPT
Bioinformatics
PDF
Apollo Introduction for i5K Groups 2015-10-07
PPT
PDF
Gene prediction strategies
PPTX
prediction methods for ORF
PDF
Apollo Introduction for the Chestnut Research Community
PPTX
Gene prediction and expression
PDF
Functional annotation
PDF
Introduction to Apollo: A webinar for the i5K Research Community
PDF
2 md2016 annotation
PDF
Introduction to Apollo: i5K E affinis
PPTX
Genome annotation
PDF
Introduction to Apollo for i5k
PDF
Genome Curation using Apollo
PDF
Genome Curation using Apollo - Workshop at UTK
DOCX
Open Reading Frames
PPTX
gene prediction programs
PDF
Apollo - A webinar for the Phascolarctos cinereus research community
PDF
Apollo : A workshop for the Manakin Research Coordination Network
Bioinformatics
Apollo Introduction for i5K Groups 2015-10-07
Gene prediction strategies
prediction methods for ORF
Apollo Introduction for the Chestnut Research Community
Gene prediction and expression
Functional annotation
Introduction to Apollo: A webinar for the i5K Research Community
2 md2016 annotation
Introduction to Apollo: i5K E affinis
Genome annotation
Introduction to Apollo for i5k
Genome Curation using Apollo
Genome Curation using Apollo - Workshop at UTK
Open Reading Frames
gene prediction programs
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo : A workshop for the Manakin Research Coordination Network
Ad

Viewers also liked (20)

PDF
Local vs. Global Models for Effort Estimation and Defect Prediction
PPTX
PDF
Analyzing and integrating probabilistic and deterministic computational model...
PPT
Global local alignment
PPTX
Micro array based comparative genomic hybridisation -Dr Yogesh D
PPT
Prediction of protein function from sequence derived protein features
PPTX
Sequence comparison techniques
PDF
IntelliGO semantic similarity measure for Gene Ontology annotations
PPT
Kishor Presentation
PPTX
Sequence alignment
PDF
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
PPTX
SCoT and RAPD
PPTX
Global and local alignment (bioinformatics)
PDF
The Needleman-Wunsch Algorithm for Sequence Alignment
PDF
Ch06 alignment
PPTX
PPT
Research proposal
PPT
gene regulation sdk 2013
PPT
Bioinformatica 08-12-2011-t8-go-hmm
Local vs. Global Models for Effort Estimation and Defect Prediction
Analyzing and integrating probabilistic and deterministic computational model...
Global local alignment
Micro array based comparative genomic hybridisation -Dr Yogesh D
Prediction of protein function from sequence derived protein features
Sequence comparison techniques
IntelliGO semantic similarity measure for Gene Ontology annotations
Kishor Presentation
Sequence alignment
MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...
SCoT and RAPD
Global and local alignment (bioinformatics)
The Needleman-Wunsch Algorithm for Sequence Alignment
Ch06 alignment
Research proposal
gene regulation sdk 2013
Bioinformatica 08-12-2011-t8-go-hmm
Ad

Similar to Bioalgo 2012-01-gene-prediction-sim (20)

PDF
Comparative analysis of dynamic programming
PDF
Comparative analysis of dynamic programming algorithms to find similarity in ...
PDF
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
PPT
AlgoAlignementGenomicSequences.ppt
PPTX
bioinformatics lecture 2.pptx and computational Boilogygy
PPTX
Introduction to sequence alignment
PPTX
Sequence alignment
PDF
Bioinformatics2015.pdf
PDF
Bioinformatics2015.pdf
PDF
Basics of bioinformatics
PPT
Bioinformatics detailed explaination with diagrams
PPTX
PCB_Lect07_Gen_genetic_yes I am like this Fin.pptx
PPTX
proteome.pptx
PPTX
Virus Sequence Alignment and Phylogenetic Analysis 2019
PPTX
Sequence alignment global vs. local
PPTX
Biological sequences analysis
PPTX
Comparative genomics
PDF
06_Alignment_2022.pdf
Comparative analysis of dynamic programming
Comparative analysis of dynamic programming algorithms to find similarity in ...
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
AlgoAlignementGenomicSequences.ppt
bioinformatics lecture 2.pptx and computational Boilogygy
Introduction to sequence alignment
Sequence alignment
Bioinformatics2015.pdf
Bioinformatics2015.pdf
Basics of bioinformatics
Bioinformatics detailed explaination with diagrams
PCB_Lect07_Gen_genetic_yes I am like this Fin.pptx
proteome.pptx
Virus Sequence Alignment and Phylogenetic Analysis 2019
Sequence alignment global vs. local
Biological sequences analysis
Comparative genomics
06_Alignment_2022.pdf

More from BioinformaticsInstitute (20)

PPTX
PDF
Nanopores sequencing
PDF
A superglue for string comparison
PDF
Comparative Genomics and de Bruijn graphs
PDF
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
PPTX
Вперед в прошлое. Методы генетической диагностики древней днк
PDF
Knime &amp; bioinformatics
PDF
"Зачем биологам суперкомпьютеры", Александр Предеус
PDF
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
PDF
Рак 101 (Мария Шутова, ИоГЕН РАН)
PDF
Плюрипотентность 101
PDF
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
PPTX
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
PPT
Biodb 2011-everything
PPT
PPT
PPT
PPT
PPT
Nanopores sequencing
A superglue for string comparison
Comparative Genomics and de Bruijn graphs
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Вперед в прошлое. Методы генетической диагностики древней днк
Knime &amp; bioinformatics
"Зачем биологам суперкомпьютеры", Александр Предеус
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Рак 101 (Мария Шутова, ИоГЕН РАН)
Плюрипотентность 101
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Biodb 2011-everything

Bioalgo 2012-01-gene-prediction-sim

  • 2. Outline 1. Introduction 2. Exon Chaining Problem 3. Spliced Alignment 4. Gene Prediction Tools
  • 4. Similarity-Based Approach to Gene Prediction • Some genomes may be well-studied, with many genes having been experimentally verified. • Closely-related organisms may have similar genes. • In order to determine the functions of unknown genes, researchers may compare them to known genes in a closely- related species. • This is the idea behind the similarity-based approach to gene prediction.
  • 5. Similarity-Based Approach to Gene Prediction • There  is  one  issue  with  comparison:  genes  are  often  “split.” • The exons, or coding sections of a gene, are separated from introns, or the non-coding sections. • Example: • Our Problem: Given a known gene and an unannotated genome sequence, find a set of substrings in the genomic sequence whose concatenation best matches the known gene. • This concatenation will be our candidate gene. …AGGGTCTCATTGTAGACAGTGGTACTGATCAACGCAGGACTT… Coding Non-coding Coding Non-coding
  • 6. • Small islands of similarity corresponding to similarities between exons Comparing Genes in Two Genomes
  • 7. FrogGene(known) Human Genome Using Similarities to Find Exon Structure • A (known) frog gene is aligned to different locations in the human genome. • Find  the  “best”  path  to  reveal  the  exon  structure  of  the   corresponding human gene.
  • 8. Finding Local Alignments • A (known) frog gene is aligned to different locations in the human genome. • Find  the  “best”  path  to  reveal  the  exon  structure  of  the   corresponding human gene. • Use local alignments to find all islands of similarity. FrogGene(known) Human Genome
  • 9. genome mRNA Exon 3Exon 1 Exon 2 { { { Intron 1 Intron 2 { { Reverse Translation • Reverse Translation Problem: Given a known protein, find a gene which codes for it. • Inexact: amino acids map to > 1 codon • This problem is essentially reduced to an alignment problem. • Example: Comparing Genomic DNA Against mRNA.
  • 10. Portion of genome mRNA (codonsequence) Exon 3Exon 1 Exon 2 { { { Intron 1 Intron 2 { { Comparing Genomic DNA Against mRNA
  • 11. Reverse Translation • The reverse translation problem can be modeled as traveling in Manhattan  grid  with  “free”  horizontal  jumps. • Each horizontal jump models insertion of an intron. • Complexity of Manhattan grid is O(n3). • Issue: Would match nucleotides pointwise and use horizontal jumps at every opportunity.
  • 13. Chaining Local Alignments • Aim: Find candidate exons, or substrings that match a given gene sequence. • Define a candidate exon as (l, r, w): • l = starting position of exon • r = ending position of exon • w = weight of exon, defined as score of local alignment or some other weighting score • Idea: Look for a chain of substrings with maximal score. • Chain: a set of non-overlapping nonadjacent intervals.
  • 14. • Locate the beginning and end of each interval (2n points). • Find  the  “best”  concatenation  of  some  of  the  intervals. 3 4 11 9 15 5 5 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem: Illustration
  • 15. • Locate the beginning and end of each interval (2n points). • Find  the  “best”  concatenation  of  some  of  the  intervals. 3 4 11 9 15 5 5 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem: Illustration
  • 16. • Locate the beginning and end of each interval (2n points). • Find  the  “best”  concatenation  of  some  of  the  intervals. 3 4 11 9 15 5 5 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem: Illustration
  • 17. Exon Chaining Problem: Formulation • Goal: Given a set of putative exons, find a maximum set of non-overlapping putative exons (chain). • Input: A set of weighted intervals (putative exons). • Output: A maximum chain of intervals from this set.
  • 18. Exon Chaining Problem: Formulation • Goal: Given a set of putative exons, find a maximum set of non-overlapping putative exons (chain). • Input: A set of weighted intervals (putative exons). • Output: A maximum chain of intervals from this set. • Question: Would a greedy algorithm solve this problem?
  • 19. • This problem can be solved with dynamic programming in O(n) time. • Idea: Connect adjacent endpoints with weight zero edges, connect ends of weight w exon with weight w edge. Exon Chaining Problem: Graph Representation
  • 20. ExonChaining (G, n) //Graph, number of intervals 1. for i ← 1 to 2n 2. si ← 0 3. for i ← 1 to 2n 4. if vertex vi in G corresponds to right end of the interval I 5. j ← index of vertex for left end of the interval I 6. w ← weight of the interval I 7. sj ← max {sj + w, si-1} 8. else 9. si ← si-1 10.return s2n Exon Chaining Problem: Pseudocode
  • 21. Exon Chaining: Deficiencies 1. Poor definition of the putative exon endpoints. 2. Optimal chain of intervals may not correspond to a valid alignment. • Example: First interval may correspond to a suffix, whereas second interval may correspond to a prefix.
  • 22. Human Genome FrogGenes(known) Infeasible Chains: Illustration • Red local similarities form two non-overlapping intervals but do not form a valid global alignment.
  • 23. • The cell carries DNA as a blueprint for producing proteins, like a manufacturer carries a blueprint for producing a car. Gene Prediction Analogy
  • 24. Gene Prediction Analogy: Using Blueprint • Each protein has its own distinct blueprint for construction.
  • 25. Gene Prediction Analogy: Assembling Exons
  • 26. Gene  Prediction  Analogy:  Still  Assembling…
  • 28. Spliced Alignment • Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein from one genome to reconstruct the exon-intron structure of a (related) gene in another genome. • Spliced alignment begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). • This set is further filtered in a such a way that attempts to retain all true exons, with some false ones.
  • 29. Spliced Alignment Problem: Formulation • Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence. • Input: Genomic sequences G, target sequence T, and a set of candidate exons B. • Output: A chain of exons C such that the global alignment score between C* and T is maximum among all chains of blocks from B. • Here C* is the concatenation of all exons from chain C.
  • 30. Example:  Lewis  Carroll’s  “Jabberwocky” • Genomic  Sequence:  “It  was  a  brilliant  thrilling  morning  and   the slimy, hellish, lithe doves gyrated and gamboled nimbly in the  waves.” • Target  Sequence:  “Twas  brillig,  and  the  slithy  toves  did  gyre   and  gimble  in  the  wabe.” • Alignment:  Next  Slide…    
  • 31. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 32. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 33. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 34. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 35. Example:  Lewis  Carroll’s  “Jabberwocky”
  • 36. Spliced Alignment: Idea • Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) • But what is “i-prefix”  of G? • There may be a few i-prefixes of G, depending on which block B we are in.
  • 37. Spliced Alignment: Idea • Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) • But what is “i-prefix”  of G? • There may be a few i-prefixes of G, depending on which block B we are in. • Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T under the assumption that the alignment uses the block B at position i: S(i, j, B).
  • 38. Spliced Alignment Recurrence • If i is not the starting vertex of block B: • If i is the starting vertex of block B: where p is  some  indel  penalty  and  δ  is  a  scoring  matrix.  S(i, j,B)  max S(i  1, j,B)  p S(i, j  1,B)  p S(i  1, j  1,B)   (gi,t j )      S(i, j,B)  max B ' preceding B S(i, j  1,B)  p S(end (B'), j,B')  p S(end (B'), j  1,B')   (gi,t j )     
  • 39. Spliced Alignment Recurrence • After computing the three-dimensional table S(i, j, B), the score of the optimal spliced alignment is:  max B S end (B), length (T), B 
  • 40. Spliced Alignment: Complications • Considering multiple i-prefixes leads to slow down. • Running time: where m is the target length, n is the genomic sequence length and |B| is the number of blocks. • A mosaic effect: short exons are easily combined to fit any target protein.  O mn 2 B 
  • 43. Spliced Alignment: Speedup  P i, j   max B preceding i S end B , j, B 
  • 44. Exon Chaining vs Spliced Alignment • In Spliced Alignment, every path spells out the string obtained by concatenation of labels of its edges. • The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequence. • Defines weight of entire path in graph, but not the weights for individual edges. • Exon Chaining assumes the positions and weights of exons are pre-defined.
  • 46. Gene Prediction: Aligning Genome vs. Genome • Goal: Align entire human and mouse genomes. • Predict genes in both sequences simultaneously as chains of aligned blocks (exons). • This approach does not assume any annotation of either human or mouse genes.
  • 47. Gene Prediction Tools 1. GENSCAN/Genome Scan 2. TwinScan 3. Glimmer 4. GenMark
  • 48. The GENSCAN Algorithm • Algorithm is based on probabilistic model of gene structure. • GENSCAN  uses  a  “training  set,”  then  the  algorithm  returns   the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). • Biological input: Codon bias in coding regions, gene structure (start and stop codons, typical exon and intron length, presence of promoters, presence of genes on both strands, etc). • Benefit: Covers cases where input sequence contains no gene, partial gene, complete gene, or multiple genes.
  • 49. GENSCAN Limitations • Does not use similarity search to predict genes. • Does not address alternative splicing. • Could combine two exons from consecutive genes together.
  • 50. GenomeScan • Incorporates similarity information into GENSCAN: predicts gene structure which corresponds to maximum probability conditional on similarity information. • Algorithm is a combination of two sources of information: 1. Probabilistic models of exons-introns. 2. Sequence similarity information.
  • 51. http://www.stanford.edu/class/cs262/Spring2003/Notes/ln10.pdf TwinScan • Aligns two sequences and marks each base as gap ( - ), mismatch (:), or match (|). • Results in a new alphabet of 12 letters Σ = {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}. • Then run Viterbi algorithm using emissions ek(b) where b ∊ {A-,  A:,  A|,  …,  T|}.
  • 52. TwinScan • The emission probabilities are estimated from human/mouse gene pairs. • Example: eI(x|) < eE(x|) since matches are favored in exons, and eI(x-) > eE(x-) since gaps (as well as mismatches) are favored in introns. • Benefit: Compensates for dominant occurrence of poly-A region in introns.
  • 53. Glimmer • Glimmer: Stands for Gene Locator and Interpolated Markov ModelER • Finds genes in bacterial DNA • Uses interpolated Markov Models
  • 54. Glimmer • Made of 2 programs: 1. BuildIMM: • Takes sequences as input and outputs the Interpolated Markov Models (IMMs). 2. Glimmer: • Takes IMMs and outputs all candidate genes. • Automatically resolves overlapping genes by choosing one, hence limited. • Marks  “suspected  to  truly  overlap”  genes  for  closer   inspection by user.
  • 55. GenMark • Based on non-stationary Markov chain models. • Results displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence.