SlideShare a Scribd company logo
Translation initiation start prediction
in human cDNAs with high accuracy
A. G. Hatzigeorgiou

Paper Presentation
Introduction to Bioinformatics
Anaxagoras Fotopoulos | Marina Adamou - Tzani

21/01/2014
Introduction
•

•
•

•

Primary objective of the present research is
contribution to the definition of the coding part of
a gene.
The search is performed in cDNA sequences.
Coding regions are surrounded by UnTraslated
Regions (UTRs).
The interest is focused in finding the Translation
Initiation Start (TIS) which defines the start of the
coding region.

cDNA
complementary DNA (cDNA) is DNA
synthesized from a messenger RNA (mRNA)
in a reaction catalyzed by the enzymes
reverse transcriptase and DNA polymerase.
2
Previous Research
Salzberg, 1997

Positional Conditional Probability matrix.
Generalized Second Order Profiles.
•
Implementation of the Ribosome Scanning Model
(Kozak, 1996)

Agarwal and
Bafna, 1998a

The ribosome first attaches to a
specific region in the 5’ end of the
mRNA and then scans the sequence
for the first ATG

•
•
3

No significant deferences were observed
between the above methods and a weight matrix
The above methods are studied in common due
to the high rate of false positives.
Previous Research
Pedersen and
Nielsen, 1997

Usage of ANNs for the recognition of local context and
statistical properties around the TIS. Large region of
analysis 100 bases before and 100 after the start codon

Salamov et. al.,
1998

Zien et. al.,
2000

Six characteristics are applied for the analysis of
the region around TIS including weight matrix and
hexanucleotide difference.
Use of Support Vector Machines (SVMs) for TIS
prediction

All of the above methods give up to
85% correct predictions.

4
Methods – Suggested Model
Swissprot
475 cDNAs
(Verified + Checked)

Training Gene Pool
Parameter estimation
Training Set + Evaluation Set

Conserved
Motif

Test Gene Pool

TIS Prediction
Consensus

Test Set

NN

Score
Multiplication

Training Gene Pool
Parameter estimation
Training Set + Evaluation Set

Test Gene Pool

TIS Prediction
Test Set
5

Coding/
Non
Coding
Potential
Coding

NN
Consensus Neural Network
325 positive
+
325 negative
examples
12-nucleotides long window
Feed forward with
short cut connections
& two hidden units
trained with cascade
correlation algorithm

Selection of the
appropriate
feed-forward NN

Binirization
of the input

Cascade
Correlation
Algorithm
6
Coding Neural Network
54 nucleotides length
window

Use Smith –
Waterman algorithm
for the elimination of
homologies between
training and test data

12-nucleotides usage static
long window
Apply codon

250 positive

(Count for every window all
non-overlapping codons)

250 negative

The sequence
window is
rescaled to 64
units

7

+

Sequence regions
extracted for testing

Every unit gives
the normalized
frequency of the
codon in the
window

282 genes with less
than 70% homology
were used for training

700 positive
+
700 negative
Sequence regions
extracted for training

Resilient backpropagation
algorithm is
applied to a
feed-forward NN.
Integrated method
Analysis of full length mRNA sequences

1st stage
• Calculation
of coding
score for
every
nucleotide
of the
mRNA
sequence

2nd stage
• Calculation
of coding
evidence of
the coding
region
included in
the longest
ORF of the
sequence

3rd stage

4th stage

• For every
in-frame
ATG a
consensus
score is
calculated

• For the
same inframe ATG,
a coding
difference
score is
calculated

The final score is obtained
by combining the output of
the consensus ANN and the
coding difference
8
Integrated method
Analysis of full length mRNA sequences
• This method provides only one prediction for every ORF
• According to the results of the test group:
• 94% of the TIS were correctly predicted
• 6% of the predictions were false positive
The use of the Las Vegas algorithm gives a confident decision. The incorporation
of this algorithm leads to a highly accurate recognition of the TIS in human
cDNAs for 60% of the cases!
Las
Vegas

9

Las Vegas algorithm provides a correct prediction
in some cases and has a “no answer” option in the
remaining cases. That is, it always produces the
correct result or it informs about the failure.
Results – Score Combination 1/3

Nucleotide 255 :
cod 0.98 – local 0.2

10

A score combination of
coding ANN and consensus
ANN gives low final score.

Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame
Results – Score Combination 2/3

Nucleotide 270:
cod 0.44 – local 0.4

11

A score combination of
coding ANN and consensus
ANN gives low final score.

Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame
Results – Score Combination 3/3

Correct
TIS

Nucleotide 148:
cod 0.95 – local 0.8

12

A score combination of
coding ANN and consensus
ANN gives high final score.

Cod line: Score of coding ANN
Local line: Score and position of consensus ANN for all ATGs in
coding frame
Results – Methods Comparison
Correct TIS positions

13
Results – Methods Comparison
Prediction for the 3 TIS positions
with the highest scores

14
Results – Methods Comparison
Consensus motif scores
(only for DIANA-TIS)

15
Results – Methods Comparison
Final scores

16
Results – Methods Comparison
Correct predictions

17
Results – Methods Comparison
Prediction Analysis

High
prediction
score
difference

TIS correct
position: 471

Did not find TIS
18

Found TIS but other
higher score exists
Results – Methods Comparison

Performance of the three programs for TIS prediction along
the mRNA with signal peptide sequences
Correct TIS positions

19
Results – Methods Comparison
Length of signal peptide

20
Results – Methods Comparison
Prediction for the 2 TIS positions
with the highest scores

21
Results – Methods Comparison
Consensus motif scores
only for DIANA-TIS)

22
Results – Methods Comparison
Final scores

23
Results – Methods Comparison
Prediction example #1:
DIANA-TIS is able to distinguish between TIS and other ATGs better
than other ANN based programs like NetStart:

2 suitable ATGs are 12
nucleotides away

Coding/non-coding
information is similar

Consensus motif is
completely different

24
Results – Methods Comparison
Prediction example #2:
A favorable prediction does not work for all examples:

Consensus motif is
completely different

Combined score is
much lower

In some signal
peptides sequences
the coding potential
score is relatively
low, and can thus
affect the combined
score.

25
Results – Methods Comparison

TIS
prediction program

TIS
prediction
rate

DIANA-TIS (2001)

94%

Agarwal & Bafna (1998)

85%

ATGPred
(Salamov et al, 1998)

79%

NetStart
(Pedersen & Nielsen, 1997)

78%

These methods allow
more than one
prediction per gene

Notice The results come from different datasets and
thus these numbers should not be directly compared.
26
Thank you!

Introduction to Bioinformatics
Information Technologies in Medicine and Biology
National & Kapodistrian
University of Athens
Department of Informatics
Biomedical Research
Foundation
Academy of Athens
27

Technological Education
Institute of Athens
Department of Biomedical
Engineering
Demokritos
National Center
for Scientific Research

More Related Content

PDF
Tzitzikosta message for the world heritage monuments exhibition
PPT
A Simple, Accurate Approximation to the Sum of Gamma-Gamma variates and Appli...
PPTX
Introduction to Tempus Programme (5th Call)
PPT
Συστήματα ανίχνευσης εισβολών με νευρωνικά δίκτυα
PDF
Impact of detector thickness on imaging characteristics of the Siemens Biogra...
PPTX
Eισήγηση στα χαοτικα τεχνητα νευρωνικα δικτυα
PPTX
The social aspect of Smart Wearable Systems in the era of Internet-of-Things
PPTX
3 d technologies integration in internet travel services based on the modern ...
Tzitzikosta message for the world heritage monuments exhibition
A Simple, Accurate Approximation to the Sum of Gamma-Gamma variates and Appli...
Introduction to Tempus Programme (5th Call)
Συστήματα ανίχνευσης εισβολών με νευρωνικά δίκτυα
Impact of detector thickness on imaging characteristics of the Siemens Biogra...
Eισήγηση στα χαοτικα τεχνητα νευρωνικα δικτυα
The social aspect of Smart Wearable Systems in the era of Internet-of-Things
3 d technologies integration in internet travel services based on the modern ...

Viewers also liked (15)

PDF
Europa Nostra Congress Athens 2013 - Programme
PPS
___tribalistas_velhainf_ncianin
PPT
Uma vida vitoriosa e bem sucedida 4 Abra seu coração para algo impensável
PPS
Croazia ecc.
PPTX
Chapa fênix - apresentação - aperj
PPS
Los si y los no del piercing
PPT
Conocimientos básicos sobre Pago Único
PPTX
Wincash
PDF
scan0005
PDF
Terremoto de Haiti
PPSX
Corrida Premiada Melnick Even
PDF
Driblando a crise com Gestão do Conhecimento
ODP
Fdfs for cnblogs 1th
PPS
Me declaro vivo
PDF
VERDADES SOBRE COMUNIDADES DE PRÁTICA
Europa Nostra Congress Athens 2013 - Programme
___tribalistas_velhainf_ncianin
Uma vida vitoriosa e bem sucedida 4 Abra seu coração para algo impensável
Croazia ecc.
Chapa fênix - apresentação - aperj
Los si y los no del piercing
Conocimientos básicos sobre Pago Único
Wincash
scan0005
Terremoto de Haiti
Corrida Premiada Melnick Even
Driblando a crise com Gestão do Conhecimento
Fdfs for cnblogs 1th
Me declaro vivo
VERDADES SOBRE COMUNIDADES DE PRÁTICA
Ad

Similar to TIS prediction in human cDNAs with high accuracy (20)

PPT
Bioinformatics
PDF
Gene prediction methods vijay
PDF
genomeannotation-160822182432.pdf
PPTX
Genome annotation
PPTX
2015 bioinformatics score_matrices_wim_vancriekinge
PPTX
Assembly and gene_prediction
PPTX
2016 bioinformatics i_score_matrices_wim_vancriekinge
PDF
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
PPTX
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
PPTX
Signal processing of dna and protein sequences
PDF
Sequence Alignment
PPT
6조
PDF
Apresentação Netsci 09
PPTX
Splice site recognition among different organisms
PDF
Bioalgo 2012-01-gene-prediction-stat
PPT
PDF
Data formats
PDF
Cancer, Quantum Computing and TP53 Tumor Suppressor Gene Mutations Prediction...
PDF
20140711 2 j_willey_ercc2.0_workshop
PPTX
Bioinformatica t3-scoringmatrices v2014
Bioinformatics
Gene prediction methods vijay
genomeannotation-160822182432.pdf
Genome annotation
2015 bioinformatics score_matrices_wim_vancriekinge
Assembly and gene_prediction
2016 bioinformatics i_score_matrices_wim_vancriekinge
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Signal processing of dna and protein sequences
Sequence Alignment
6조
Apresentação Netsci 09
Splice site recognition among different organisms
Bioalgo 2012-01-gene-prediction-stat
Data formats
Cancer, Quantum Computing and TP53 Tumor Suppressor Gene Mutations Prediction...
20140711 2 j_willey_ercc2.0_workshop
Bioinformatica t3-scoringmatrices v2014
Ad

More from Anax Fotopoulos (20)

PDF
AFMM Manual
PDF
Acropoils & other hellenic world monuments
PPTX
Architecture of the human regulatory network derived from encode data
PPTX
Ret protooncogene
PPTX
From Smart Homes to Smart Cities: An approach based on Internet-of-Things
PPTX
Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models
PDF
Wef the future role of civil society report 2013
PDF
UNESCO’s Division for Freedom of Expression, Democracy and Peace Report
PDF
Europa Nostra Athens Congress - Registration fees
PDF
PPTX
A new approach in specifying the inverse quadratic matrix in modulo-2 for con...
PPTX
TEI Piraeus IEEE Student Branch Actions 2011-2012
PPT
Η ΣΚΟΤΕΙΝΗ ΠΛΕΥΡΑ ΤΟΥ ΔΙΑΔΙΚΤΥΟΥ - ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΠΡΟΒΛΗΜΑΤΩΝ & ΤΩΝ ΜΕΤΡΩΝ Π...
PPTX
Measuring the EMF of various widely used electronic devices and their possibl...
PPTX
S.A.V.E. - Social Actions Volunteering Ecosystem
PDF
Monte Carlo comparison study of the radiation absorption of scintillators for...
PDF
Semi-empirical Monte Carlo optical-gain modelling of Nuclear Imaging scintill...
PDF
Environmental monitoring of soil radon in a very tectonic area in south west ...
PDF
Similarities in the self organised critical characteristics between radon and...
PPTX
State equations model based on modulo 2 arithmetic and its applciation on rec...
AFMM Manual
Acropoils & other hellenic world monuments
Architecture of the human regulatory network derived from encode data
Ret protooncogene
From Smart Homes to Smart Cities: An approach based on Internet-of-Things
Introduction to HMMER - A biosequence analysis tool with Hidden Markov Models
Wef the future role of civil society report 2013
UNESCO’s Division for Freedom of Expression, Democracy and Peace Report
Europa Nostra Athens Congress - Registration fees
A new approach in specifying the inverse quadratic matrix in modulo-2 for con...
TEI Piraeus IEEE Student Branch Actions 2011-2012
Η ΣΚΟΤΕΙΝΗ ΠΛΕΥΡΑ ΤΟΥ ΔΙΑΔΙΚΤΥΟΥ - ΠΑΡΟΥΣΙΑΣΗ ΤΩΝ ΠΡΟΒΛΗΜΑΤΩΝ & ΤΩΝ ΜΕΤΡΩΝ Π...
Measuring the EMF of various widely used electronic devices and their possibl...
S.A.V.E. - Social Actions Volunteering Ecosystem
Monte Carlo comparison study of the radiation absorption of scintillators for...
Semi-empirical Monte Carlo optical-gain modelling of Nuclear Imaging scintill...
Environmental monitoring of soil radon in a very tectonic area in south west ...
Similarities in the self organised critical characteristics between radon and...
State equations model based on modulo 2 arithmetic and its applciation on rec...

Recently uploaded (20)

PPT
Infections Member of Royal College of Physicians.ppt
PPTX
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
PPT
Rheumatology Member of Royal College of Physicians.ppt
PDF
شيت_عطا_0000000000000000000000000000.pdf
PDF
The_EHRA_Book_of_Interventional Electrophysiology.pdf
PPTX
1. Basic chemist of Biomolecule (1).pptx
PPTX
Reading between the Rings: Imaging in Brain Infections
PDF
Comparison of Swim-Up and Microfluidic Sperm Sorting.pdf
PPTX
Enteric duplication cyst, etiology and management
PDF
focused on the development and application of glycoHILIC, pepHILIC, and comm...
PPTX
09. Diabetes in Pregnancy/ gestational.pptx
PPTX
Radiation Dose Management for Patients in Medical Imaging- Avinesh Shrestha
PPTX
CHEM421 - Biochemistry (Chapter 1 - Introduction)
PDF
Extended-Expanded-role-of-Nurses.pdf is a key for student Nurses
PPTX
Effects of lipid metabolism 22 asfelagi.pptx
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PDF
TISSUE LECTURE (anatomy and physiology )
PDF
Lecture 8- Cornea and Sclera .pdf 5tg year
PPTX
NRP and care of Newborn.pptx- APPT presentation about neonatal resuscitation ...
PDF
OSCE SERIES ( Questions & Answers ) - Set 5.pdf
Infections Member of Royal College of Physicians.ppt
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
Rheumatology Member of Royal College of Physicians.ppt
شيت_عطا_0000000000000000000000000000.pdf
The_EHRA_Book_of_Interventional Electrophysiology.pdf
1. Basic chemist of Biomolecule (1).pptx
Reading between the Rings: Imaging in Brain Infections
Comparison of Swim-Up and Microfluidic Sperm Sorting.pdf
Enteric duplication cyst, etiology and management
focused on the development and application of glycoHILIC, pepHILIC, and comm...
09. Diabetes in Pregnancy/ gestational.pptx
Radiation Dose Management for Patients in Medical Imaging- Avinesh Shrestha
CHEM421 - Biochemistry (Chapter 1 - Introduction)
Extended-Expanded-role-of-Nurses.pdf is a key for student Nurses
Effects of lipid metabolism 22 asfelagi.pptx
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
TISSUE LECTURE (anatomy and physiology )
Lecture 8- Cornea and Sclera .pdf 5tg year
NRP and care of Newborn.pptx- APPT presentation about neonatal resuscitation ...
OSCE SERIES ( Questions & Answers ) - Set 5.pdf

TIS prediction in human cDNAs with high accuracy

  • 1. Translation initiation start prediction in human cDNAs with high accuracy A. G. Hatzigeorgiou Paper Presentation Introduction to Bioinformatics Anaxagoras Fotopoulos | Marina Adamou - Tzani 21/01/2014
  • 2. Introduction • • • • Primary objective of the present research is contribution to the definition of the coding part of a gene. The search is performed in cDNA sequences. Coding regions are surrounded by UnTraslated Regions (UTRs). The interest is focused in finding the Translation Initiation Start (TIS) which defines the start of the coding region. cDNA complementary DNA (cDNA) is DNA synthesized from a messenger RNA (mRNA) in a reaction catalyzed by the enzymes reverse transcriptase and DNA polymerase. 2
  • 3. Previous Research Salzberg, 1997 Positional Conditional Probability matrix. Generalized Second Order Profiles. • Implementation of the Ribosome Scanning Model (Kozak, 1996) Agarwal and Bafna, 1998a The ribosome first attaches to a specific region in the 5’ end of the mRNA and then scans the sequence for the first ATG • • 3 No significant deferences were observed between the above methods and a weight matrix The above methods are studied in common due to the high rate of false positives.
  • 4. Previous Research Pedersen and Nielsen, 1997 Usage of ANNs for the recognition of local context and statistical properties around the TIS. Large region of analysis 100 bases before and 100 after the start codon Salamov et. al., 1998 Zien et. al., 2000 Six characteristics are applied for the analysis of the region around TIS including weight matrix and hexanucleotide difference. Use of Support Vector Machines (SVMs) for TIS prediction All of the above methods give up to 85% correct predictions. 4
  • 5. Methods – Suggested Model Swissprot 475 cDNAs (Verified + Checked) Training Gene Pool Parameter estimation Training Set + Evaluation Set Conserved Motif Test Gene Pool TIS Prediction Consensus Test Set NN Score Multiplication Training Gene Pool Parameter estimation Training Set + Evaluation Set Test Gene Pool TIS Prediction Test Set 5 Coding/ Non Coding Potential Coding NN
  • 6. Consensus Neural Network 325 positive + 325 negative examples 12-nucleotides long window Feed forward with short cut connections & two hidden units trained with cascade correlation algorithm Selection of the appropriate feed-forward NN Binirization of the input Cascade Correlation Algorithm 6
  • 7. Coding Neural Network 54 nucleotides length window Use Smith – Waterman algorithm for the elimination of homologies between training and test data 12-nucleotides usage static long window Apply codon 250 positive (Count for every window all non-overlapping codons) 250 negative The sequence window is rescaled to 64 units 7 + Sequence regions extracted for testing Every unit gives the normalized frequency of the codon in the window 282 genes with less than 70% homology were used for training 700 positive + 700 negative Sequence regions extracted for training Resilient backpropagation algorithm is applied to a feed-forward NN.
  • 8. Integrated method Analysis of full length mRNA sequences 1st stage • Calculation of coding score for every nucleotide of the mRNA sequence 2nd stage • Calculation of coding evidence of the coding region included in the longest ORF of the sequence 3rd stage 4th stage • For every in-frame ATG a consensus score is calculated • For the same inframe ATG, a coding difference score is calculated The final score is obtained by combining the output of the consensus ANN and the coding difference 8
  • 9. Integrated method Analysis of full length mRNA sequences • This method provides only one prediction for every ORF • According to the results of the test group: • 94% of the TIS were correctly predicted • 6% of the predictions were false positive The use of the Las Vegas algorithm gives a confident decision. The incorporation of this algorithm leads to a highly accurate recognition of the TIS in human cDNAs for 60% of the cases! Las Vegas 9 Las Vegas algorithm provides a correct prediction in some cases and has a “no answer” option in the remaining cases. That is, it always produces the correct result or it informs about the failure.
  • 10. Results – Score Combination 1/3 Nucleotide 255 : cod 0.98 – local 0.2 10 A score combination of coding ANN and consensus ANN gives low final score. Cod line: Score of coding ANN Local line: Score and position of consensus ANN for all ATGs in coding frame
  • 11. Results – Score Combination 2/3 Nucleotide 270: cod 0.44 – local 0.4 11 A score combination of coding ANN and consensus ANN gives low final score. Cod line: Score of coding ANN Local line: Score and position of consensus ANN for all ATGs in coding frame
  • 12. Results – Score Combination 3/3 Correct TIS Nucleotide 148: cod 0.95 – local 0.8 12 A score combination of coding ANN and consensus ANN gives high final score. Cod line: Score of coding ANN Local line: Score and position of consensus ANN for all ATGs in coding frame
  • 13. Results – Methods Comparison Correct TIS positions 13
  • 14. Results – Methods Comparison Prediction for the 3 TIS positions with the highest scores 14
  • 15. Results – Methods Comparison Consensus motif scores (only for DIANA-TIS) 15
  • 16. Results – Methods Comparison Final scores 16
  • 17. Results – Methods Comparison Correct predictions 17
  • 18. Results – Methods Comparison Prediction Analysis High prediction score difference TIS correct position: 471 Did not find TIS 18 Found TIS but other higher score exists
  • 19. Results – Methods Comparison Performance of the three programs for TIS prediction along the mRNA with signal peptide sequences Correct TIS positions 19
  • 20. Results – Methods Comparison Length of signal peptide 20
  • 21. Results – Methods Comparison Prediction for the 2 TIS positions with the highest scores 21
  • 22. Results – Methods Comparison Consensus motif scores only for DIANA-TIS) 22
  • 23. Results – Methods Comparison Final scores 23
  • 24. Results – Methods Comparison Prediction example #1: DIANA-TIS is able to distinguish between TIS and other ATGs better than other ANN based programs like NetStart: 2 suitable ATGs are 12 nucleotides away Coding/non-coding information is similar Consensus motif is completely different 24
  • 25. Results – Methods Comparison Prediction example #2: A favorable prediction does not work for all examples: Consensus motif is completely different Combined score is much lower In some signal peptides sequences the coding potential score is relatively low, and can thus affect the combined score. 25
  • 26. Results – Methods Comparison TIS prediction program TIS prediction rate DIANA-TIS (2001) 94% Agarwal & Bafna (1998) 85% ATGPred (Salamov et al, 1998) 79% NetStart (Pedersen & Nielsen, 1997) 78% These methods allow more than one prediction per gene Notice The results come from different datasets and thus these numbers should not be directly compared. 26
  • 27. Thank you! Introduction to Bioinformatics Information Technologies in Medicine and Biology National & Kapodistrian University of Athens Department of Informatics Biomedical Research Foundation Academy of Athens 27 Technological Education Institute of Athens Department of Biomedical Engineering Demokritos National Center for Scientific Research