Genome assembly and gene
prediction
Robin Ohm
Microbiology Department
1 November 2017
Master Course Introduction to Bioinformatics
Genome sequencing: pipeline
Organism DNA extraction DNA
Sequencing Sequence reads
(100-300 bp)
Assembly Gene prediction
Sequencing approaches
Making contigs
Gap
Single
stranded
• Assemble the short sequences into ‘contigs’:
contiguous stretches of sequence
• Overlapping short sequences are merged into a
consensus sequence
Finding overlap between
fragments
Contig assembly: CAP3 Algorithm
• Uses quality values
to define 5’ and
3’ clipping regions
• Assembled into contig
if both reads have
 high similarity
 sequences are high quality
• Quality values
determine which base
in consensus
Repeats complicate assembly
Misassembly caused by pseudo-overlap
Misassembly caused by pseudo-overlap
Usefulness of paired end-sequences
Paired-end reads are very useful
• Improving the assembly (linking contigs into scaffolds)
• Detecting genome changes relative to reference
Assembly
TTGATAACAAGTCAGGAATCAGTGTAGCCAAGAATCCAGAGCATCACGGACGCATGAAACACCTAGACCTGCGCTTCTACTGGCTCCGGGAGGCTGTTGAAGAAGGACTCAT
AGATCCTTTGTATGTTTCTACACACGAGCAAGTGGCCGACATCTTAACCAAGGCTGTCGCAAAGAAGGTAGTTGAGTTTGCTGTTCCACTATTGGGACTTGAGTAGACTGGCTA
GATCAAGGGGGTGTGTTAGAGTGAGCAAGCAGTCTACAAGTGCTATTGGTGGACTTAGAGATATGCGGACGAGTTTAGTCACATGATAGCTTAGTCATGTATGGTTGAACTTC
CCTGTGGACGCTTCAGAGGCCTTAGCTTCTTACAGTCTGTCCACAGGTGAGTACAAGTTGTTAGAACTCAATATAGCTTCTTACAGTCTGTCCACAGGATTGCTCAGTCAGTTGC
CATCTGTTTGCCTCGTTTCTAGGACCGTTAGGATCGCCACTGCAACCGCTATCGTTCTACGTTTAACAACAATATGAGGAGGACTCTGCGTGTAGCAAGACGCCAGTATTTTGA
TAAGCAAATTCATAATATGGCAAGCGATAGGAAGAGACCATGGGACTTGATGCCGTGGACTAGAGAAAGGAAGATGCCAGCGGTGGAGGCTATTCTGGACAGCAAGGGGA
ACTCGTGCAACACTGAAGAAAAGTTGTTCAAGACCCTTCACAAGACGTACAATGCGGCGGATAACAGAAAGGTGGATGTCAGCAGTATGTATAGGGAGATAGAAGAGTTCGA
GGAGAGGGAATGGGTGAAGTTCTCTGTTCAAGAATTTCACGACGCGCTCAAGAATTGTGCCAAGAACACGGCACCTGGGCCAGACCACGTCTCGTGGAGATTGTGGAAGCG
GTTTGCGACAGACGACACGGTTCTGCCAATTCGTAACAAAAATAGCCAACGCCTGTTTTGACACAGGATACTGGCCTCAACACTTCAAACAGTCCATTTCGGTGATCATTCCAA
ACCGTTGATAACAAGTCAGGAATCAGTGTAGCCAAGAATCCAGAGCATCACGGACGCATGAAACACCTAGACCTGCGCTTCTACTGGCTCCGGGAGGCTGTTGAAGAAGGAC
TCATAGATCCTTTGTATGTTTCTACACACGAGCAAGTGGCCGACATCTTAACCAAGGCTGTCGCAAAGAAGGTAGTTGAGTTTGCTGTTCCACTATTGGGACTTGAGTAGACTG
GCTAGATCAAGGGGGTGTGTTAGAGTGAGCAAGCAGTCTACAAGTGCTATTGGTGGACTTAGAGATATGCGGACGAGTTTAGTCACATGATAGCTTAGTCATGTATGGTTGA
ACTTCCCTGTGGACGCTTCAGAGGCCTTAGCTTCTTACAGTCTGTCCACAGGTGAGTACAAGTTGTTAGAACTCAATATAGCTTCTTACAGTCTGTCCACAGGATTGCTCAGTCA
GTTGCCATCTGTTTGCCTCGTTTCTAGGACCGTTAGGATCGCCACTGCAACCGCTATCGTTCTACGTTTAACAACAATATGAGGAGGACTCTGCGTGTAGCAAGACGCCAGTAT
TTTGATAAGCAAATTCATAATATGGCAAGCGATAGGAAGAGACCATGGGACTTGATGCCGTGGACTAGAGAAAGGAAGATGCCAGCGGTGGAGGCTATTCTGGACAGCAAG
GGGAACTCGTGCAACACTGAAGAAAAGTTGTTCAAGACCCTTCACAAGACGTACAATGCGGCGGATAACAGAAAGGTGGATGTCAGCAGTATGTATAGGGAGATAGAAGAG
TTCGAGGAGAGGGAATGGGTGAAGTTCTCTGTTCAAGAATTTCACGACGCGCTCAAGAATTGTGCCAAGAACACGGCACCTGGGCCAGACCACGTCTCGTGGAGATTGTGGA
AGCGGTTTGCGACAGACGACACGGTTCTGCCAATTCGTAACAAAAATAGCCAACGCCTGTTTTGACACAGGATACTGGCCTCAACACTTCAAACAGTCCATTTCGGTGATCATT
CCAAACCGTTGATAACAAGTCAGGAATCAGTGTAGCCAAGAATCCAGAGCATCACGGACGCATGAAACACCTAGACCTGCGCTTCTACTGGCTCCGGGAGGCTGTTGAAGAA
GGACTCATAGATCCTTTGTATGTTTCTACACACGAGCAAGTGGCCGACATCTTAACCAAGGCTGTCGCAAAGAAGGTAGTTGAGTTTGCTGTTCCACTATTGGGACTTGAGTAG
ACTGGCTAGATCAAGGGGGTGTGTTAGAGTGAGCAAGCAGTCTACAAGTGCTATTGGTGGACTTAGAGATATGCGGACGAGTTTAGTCACATGATAGCTTAGTCATGTATGG
TTGAACTTCCCTGTGGACGCTTCAGAGGCCTTAGCTTCTTACAGTCTGTCCACAGGTGAGTACAAGTTGTTAGAACTCAATATAGCTTCTTACAGTCTGTCCACAGGATTGCTCA
GTCAGTTGCCATCTGTTTGCCTCGTTTCTAGGACCGTTAGGATCGCCACTGCAACCGCTATCGTTCTACGTTTAACAACAATATGAGGAGGACTCTGCGTGTAGCAAGACGCCA
GTATTTTGATAAGCAAATTCATAATATGGCAAGCGATAGGAAGAGACCATGGGACTTGATGCCGTGGACTAGAGAAAGGAAGATGCCAGCGGTGGAGGCTATTCTGGACAG
CAAGGGGAACTCGTGCAACACTGAAGAAAAGTTGTTCAAGACCCTTCACAAGACGTACAATGCGGCGGATAACAGAAAGGTGGATGTCAGCAGTATGTATAGGGAGATAGA
AGAGTTCGAGGAGAGGGAATGGGTGAAGTTCTCTGTTCAAGAATTTCACGACGCGCTCAAGAATTGTGCCAAGAACACGGCACCTGGGCCAGACCACGTCTCGTGGAGATTG
TGGAAGCGGTTTGCGACAGACGACACGGTTCTGCCAATTCGTAACAAAAATAGCCAACGCCTGTTTTGACACAGGATACTGGCCTCAACACTTCAAACAGTCCATTTCGGTGAT
CATTCCAAACCGTTGATAACAAGTCAGGAATCAGTGTAGCCAAGAATCCAGAGCATCACGGACGCATGAAACACCTAGACCTGCGCTTCTACTGGCTCCGGGAGGCTGTTGAA
GAAGGACTCATAGATCCTTTGTATGTTTCTACACACGAGCAAGTGGCCGACATCTTAACCAAGGCTGTCGCAAAGAAGGTAGTTGAGTTTGCTGTTCCACTATTGGGACTTGAG
TAGACTGGCTAGATCAAGGGGGTGTGTTAGAGTGAGCAAGCAGTCTACAAGTGCTATTGGTGGACTTAGAGATATGCGGACGAGTTTAGTCACATGATAGCTTAGTCATGTA
TGGTTGAACTTCCCTGTGGACGCTTCAGAGGCCTTAGCTTCTTACAGTCTGTCCACAGGTGAGTACAAGTTGTTAGAACTCAATATAGCTTCTTACAGTCTGTCCACAGGATTGC
TCAGTCAGTTGCCATCTGTTTGCCTCGTTTCTAGGACCGTTAGGATCGCCACTGCAACCGCTATCGTTCTACGTTTAACAACAATATGAGGAGGACTCTGCGTGTAGCAAGACG
CCAGTATTTTGATAAGCAAATTCATAATATGGCAAGCGATAGGAAGAGACCATGGGACTTGATGCCGTGGACTAGAGAAAGGAAGATGCCAGCGGTGGAGGCTATTCTGGAC
AGCAAGGGGAACTCGTGCAACACTGAAGAAAAGTTGTTCAAGACCCTTCACAAGACGTACAATGCGGCGGATAACAGAAAGGTGGATGTCAGCAGTATGTATAGGGAGATA
GAAGAGTTCGAGGAGAGGGAATGGGTGAAGTTCTCTGTTCAAGAATTTCACGACGCGCTCAAGAATTGTGCCAAGAACACGGCACCTGGGCCAGACCACGTCTCGTGGAGAT
TGTGGAAGCGGTTTGCGACAGACGACACGGTTCTGCCAATTCGTAACAAAAATAGCCAACGCCTGTTTTGACACAGGATACTGGCCTCAACACTTCAAACAGTCCATTTCGGT
GATCATTCCAAACCGCGACAGACGACACGGTTCTGCCAATTCGTAACAAAAATAGCCAACGCCTGTTTTGACACAGGATACTGGCCTCAACACTTCAAACAGTCCATTTCGGT
Where are the genes?
Step 1: masking repetitive regions
Repetitive regions: DNA sequences that occur
many times in the genome
• Transposons / Transposable Elements
– Self-replicating sequences
– Usually of viral origins
– Contains genes (or at least Open Reading Frames)
• Low-complexity regions. For example:
– (G)n : GGGGGGGGGGGGGGGGGGGGGGGGGGGG
– (GAT)n : GATGATGATGATGATGATGATGATGATGATG
Repeat masking
Step 2. Predict genes
• Prokaryotes
– Scan for Open Reading Frames (ORFs)
• Eukaryotes
– Simple ORF scanning not possible due to introns
Prokaryotes: scan for open reading frames
Eukaryotes: Introns complicate
ORF scanning
Recognition of
intron splice sites
is difficult
Gene prediction in eukaryotes
3 strategies
• Ab initio prediction: uses statistical
parameters (‘rules’) to find exons, introns,
terminators, promoters, etc. Not evidence-
based.
• Homology-based: uses known genes from
related organisms to find genes
• Transcript-based: uses expression data
(cDNA) to find genes
Ab initio gene prediction
• Not evidence-based
• Parameter for:
– Start and Stop codons
– Length of open reading frame (ORF)
– Ribosome binding site or Kozak sequence
– Codon usage
– Bias towards G or C as third base
– Average GC content (CpG islands)
– TATA box
– Polyadenylation sites (consensus AATAAA)
– 5’ and 3’ Splice signals
– Hexamer base composition (discrimination between codon and non-
coding)
• These parameters are organism-specific. Many differences between
e.g. animals, plants, fungi
• Train the predictor using a subset of known genes
Homology-based gene prediction
• Uses genes (proteins) from other organisms to
find genes
• Works for well-conserved genes
Source: http://bioinformatica.upf.edu
Transcript-based gene prediction
• Uses cDNA to identify genomics regions that
are expressed
• ESTs: expressed sequence tags
Considerations mRNA analysis
• Open reading frames
– Mature mRNA contains ORF
– All internal exons contain open “read-through”
– Pre-start and post-stop sequences are UTRs
• Posttranscriptional modification
– 5’-CAP, polyA tail, splicing
• Multiple translates
– One gene – many proteins via alternative splicing
Alternative splicing
RNAseq
Coverage
curve
RNAseq coverage curve
Overview gene predictors
• Fungal genome: Schizophyllum commune
Various gene
predictors
ESTs (mRNA)
Homology with
other
organisms
(BLAST)
State of the art: BRAKER1
• Ab initio gene predictor, based on Augustus
• Can be trained using RNA-seq data
• First it learns the rules from experimental data
• Then applies those rules to the entire genome
Introns
Exons
Which gene prediction is correct?
• Gene predictors routinely make different
predictions
• Manual curation is essential
• Based on evidence
– RNA-seq expression data
– Homology
Correctly or incorrectly predicted?
Common problem in fungal genomes
• Fungal genomes have very high gene density
• Neighboring genes are sometimes fused into one gene
• Chimera
Functional annotation
• Next step: what do these predicted genes do?
• Homology-based
– Known gene from another (related?) organism
• Presence of conserved protein domains
– PFAM domains: DNA binding domain, kinase, etc
• Transcript analysis
– Differentially regulated?
– RNA-Seq
• Proteomics
Functional annotation
• PFAM domains (protein family)

More Related Content

PPTX
gene prediction programs
PPTX
Whole Genome Sequencing Analysis
PPTX
Comparative and functional genomics
PPTX
The Gene Ontology & Gene Ontology Annotation resources
PPTX
Comparative genomics
PPTX
Genome annotation
PDF
Gene prediction methods vijay
PPTX
Comparative genomics
gene prediction programs
Whole Genome Sequencing Analysis
Comparative and functional genomics
The Gene Ontology & Gene Ontology Annotation resources
Comparative genomics
Genome annotation
Gene prediction methods vijay
Comparative genomics

What's hot (20)

PPTX
Genomics(functional genomics)
PDF
Transcriptome Analysis & Applications
PPTX
Illumina (sequencing by synthesis) method
PPTX
2 whole genome sequencing and analysis
PPTX
Transcriptome analysis
PDF
RNA Sequencing from Single Cell
PPTX
Prokka - rapid bacterial genome annotation - ABPHM 2013
PPT
Recombinant protein expression and purification Lecture
PPTX
Illumina Sequencing
PPTX
Gene identification and discovery
PPTX
Single Nucleotide Polymorphism
PPTX
ZINC database
PPT
The uni prot knowledgebase
PPTX
Ion torrent and SOLiD Sequencing Techniques
PPTX
Express sequence tags
PPTX
How to design a DNA primer on NCBI.pptx
PPTX
Snp and its role in diseases
PPTX
Mouse genome
PPTX
Map based cloning
Genomics(functional genomics)
Transcriptome Analysis & Applications
Illumina (sequencing by synthesis) method
2 whole genome sequencing and analysis
Transcriptome analysis
RNA Sequencing from Single Cell
Prokka - rapid bacterial genome annotation - ABPHM 2013
Recombinant protein expression and purification Lecture
Illumina Sequencing
Gene identification and discovery
Single Nucleotide Polymorphism
ZINC database
The uni prot knowledgebase
Ion torrent and SOLiD Sequencing Techniques
Express sequence tags
How to design a DNA primer on NCBI.pptx
Snp and its role in diseases
Mouse genome
Map based cloning
Ad

Similar to Assembly and gene_prediction (20)

PPTX
Genome annotation
PDF
genomeannotation-160822182432.pdf
PDF
RNASeq Experiment Design
PDF
Apollo Introduction for the Chestnut Research Community
PDF
Apollo : A workshop for the Manakin Research Coordination Network
PPTX
Bioinformatics t8-go-hmm v2014
PPT
PPTX
Structural annotation................pptx
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
gene prediction methods.pptx
PDF
RNA Seq Data Analysis
PPTX
How we revealed genomes secrets?
PPTX
SAGE- Serial Analysis of Gene Expression
PPT
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
PPTX
PPTX
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
PPT
Bio305 genome analysis and annotation 2012
PDF
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
PPTX
Marker devt. workshop 27022012
PPTX
Catalyzing Plant Science Research with RNA-seq
Genome annotation
genomeannotation-160822182432.pdf
RNASeq Experiment Design
Apollo Introduction for the Chestnut Research Community
Apollo : A workshop for the Manakin Research Coordination Network
Bioinformatics t8-go-hmm v2014
Structural annotation................pptx
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
gene prediction methods.pptx
RNA Seq Data Analysis
How we revealed genomes secrets?
SAGE- Serial Analysis of Gene Expression
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bio305 genome analysis and annotation 2012
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Marker devt. workshop 27022012
Catalyzing Plant Science Research with RNA-seq
Ad

Recently uploaded (20)

PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
Computer Architecture Input Output Memory.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Complications of Minimal Access-Surgery.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PPTX
20th Century Theater, Methods, History.pptx
PDF
Empowerment Technology for Senior High School Guide
PDF
advance database management system book.pdf
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
What if we spent less time fighting change, and more time building what’s rig...
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
Introduction to pro and eukaryotes and differences.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Computer Architecture Input Output Memory.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Complications of Minimal Access-Surgery.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
20th Century Theater, Methods, History.pptx
Empowerment Technology for Senior High School Guide
advance database management system book.pdf
Hazard Identification & Risk Assessment .pdf
B.Sc. DS Unit 2 Software Engineering.pptx
LDMMIA Reiki Yoga Finals Review Spring Summer

Assembly and gene_prediction