SlideShare a Scribd company logo
Sebastian Schmeier
s.schmeier@massey.ac.nz
http://sschmeier.com/bioinf-workshop/
Next-generation sequencing and quality control:
An introduction
2016
Sebastian Schmeier
Overview
• Typical workflow of a genomics experiment
• Genome versus transcriptome
• DNA sequencing
• The FASTQ-file format
• Sources of sequencing errors
• Quality assessment of a sequencing run
• Pre-processing sequencing data
2
DNA sequencing technologies
Quality assessment of a sequencing run
Pre-processing sequencing data
Sebastian Schmeier
Learning outcomes
• Being able to describe what next generation sequencing is.
• Being able to describe the FastQ-sequence format and understand
what information it contains and how it can be used.
• Describe what the Phred-based quality score is.
• Be able to describe the sources of sequencing errors.
• Being able to compute, investigate and evaluate the quality of 

sequence data from a sequencing experiment.
• Be able to distinguish between a good and a bad sequencing run.
• Being able to describe the steps involved in cleaning sequencing data.
3
Sebastian Schmeier
Typical workflow of a genomics experiment
4
Scientific question
Experimental design
Run the experiment
Assess data quality
Analyse the data
This is where the biological
experiment happens
This is the bottleneck of the
whole experiment
The essential part to make all
downstream analysis work
This is the bottleneck of the 

whole experiment
This is where the biological 

experiment happens
The essential part to make all 

downstream analysis work
Meaningful results
Sebastian Schmeier
Genome versus transcriptome
Genome
The entirety of an organism's
ancestral information. It is encoded
either in DNA or, for many types of
viruses, in RNA.
Transcriptome
The set of all RNA molecules,
including messenger RNA, ribosomal
RNA, transfer RNA, and other non-
coding RNA produced in one or a
population of cells
5
Name	 Base	Pairs	
HIV	 9,749	 9.7kb	
E.Coli	 4,600,000	 4.6MB	
Yeast	 12,100,000	 12.1Mb	
Drosophila	 130,000,000	 130MB	
Homo	sapiens	 3,200,000,000	 3.2GB	
marbled	lungfish	 130,000,000,000	 130Gb	
"Amoeba"	dubia	 670,000,000,000	 670Gb
Sebastian Schmeier
DNA sequencing
DNA sequencing is the process 

of determining the nucleotide 

order of a given DNA fragment.
First-generation sequencing:
1977 Sanger sequencing method 

development (chain-termination 

method)
2001, Sanger method produced a 

draft sequence of the human genome
Next-generation sequencing (NGS)
Demand for low-cost sequencing has driven the development of high-
throughput sequencing (or NGS) technologies that parallelize the sequencing
process, producing thousands or millions of sequences concurrently
2004 454 Life Sciences marketed a parallelized version of pyrosequencing
6
Sebastian Schmeier
Result of a sequencing run
Short read sequences
• The result of NGS technology are a collection of short
nucleotide sequences (reads) of varying length
(~40-400nt) depending on the technology used to
generate the reads
• Usually a reads quality is good at the beginning of the read
and errors accumulate the longer the read gets 

=> IMPORTANT
7
Sebastian Schmeier
Illumina sequencing
MiSeq:
• Bench-top sequencer
• Produces around 30 million reads/run
• Reads are up to 250nt
HiSeq:
• Large-scale sequencer
• 4 billion reads/run
• Reads up to 150nt
The Illumina systems accumulate errors
towards the end of the read sequence.
8
http://www.illumina.com/
Sebastian Schmeier
Illumina sequencing (2)
• An Illumina flowcell is a surface to
which seq. adaptors are
covalently attached.
• DNA with complementary
adaptors is attached, clonally
amplified, and then sequenced
by synthesis
• Each flowcell is subdivided into

hundreds of tiles
9
Sebastian Schmeier
Illumina sequencing (3)
10Addressing challenges in the production and analysis of illumina sequencing data. Kirchner et al. BMC Genomics, 2011,12:382
Sebastian Schmeier
The FASTQ-file format
The file-format that you will encounter soon is called FASTQ
11
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sebastian Schmeier
The FASTQ-file format (2)
The file-format that you will encounter soon is called FASTQ
12
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Sebastian Schmeier
The FASTQ-file format (3)
The file-format that you will encounter soon is called FASTQ
13
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Sequence
Sebastian Schmeier
The FASTQ-file format (4)
The file-format that you will encounter soon is called FASTQ
14
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Sequence
Phred quality of the 

corresponding 

nucleotide (ASCII code)
Sebastian Schmeier
The FASTQ-file format (5)
The file-format that you will encounter soon is called FASTQ
15
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Casava 1.8 the format
Sebastian Schmeier
Sequencing errors
• Sequencing errors or mis-called bases occur when a
sequencing method calls one or more bases incorrectly leading
to an incorrect read.
• The chance of a sequencing error is generally known and
quantifiable, thanks to extensive testing and calibration of the
sequencing machines
• Each base in a read is assigned a quality score, indicating
confidence that the base has been called correctly.
16https://www.broadinstitute.org/crd/wiki/index.php/Sequencing_error
Sebastian Schmeier
The FASTQ-file format (6)
• FastQ: Phred base quality
• One ASCII character per nucleotide.
• Encodes for a quality Q = -10*log10(P), where P is the error probability
17
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Phred quality of the 

corresponding 

nucleotide (ASCII code)
-10*log10(0.1) =
Sebastian Schmeier
Sources of sequencing errors
• The importance and the relative effect of each error source on
downstream applications depend on many factors, such as:
sample acquisition
reagents
tissue type
protocol
18The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
instrumentation
experimental conditions
analytical application
the ultimate goal of the study.
Sebastian Schmeier
Sources of sequencing errors (2)
Sequencing errors can stem from any time point throughout the experimental
workflow, including initial sequence preparation, library preparation and sequencing.
19The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sebastian Schmeier
Sources of sequencing errors (3)
20The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sample preparation
• User errors; for example, mislabelling
• Degradation of DNA and/or RNA from
preservation methods; for example, tissue
autolysis, nucleic acid degradation and
crosslinking during the preparation of
formalin-fixed, paraffin-embedded (FFPE)
tissues
• Alien sequence contamination; for example,
those of mycoplasma and xenograft hosts
• Low DNA input
Sebastian Schmeier
Sources of sequencing errors (4)
21The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Library preparation
• User errors; for example, carry-over of DNA from one sample to
the next and contamination from previous reactions
• PCR amplification errors
• Primer biases; for example, binding bias, methylation bias, biases that
result from mispriming, nonspecific binding and the formation of
primer dimers, hairpins and interfering pairs, and biases that are
introduced by having a melting temperature that is too high or too
low
• 3ʹ-end capture bias that is introduced during poly(A) enrichment in
high-throughput RNA sequencing
• Private mutations; for example, those introduced by repeat regions
and mispriming over private variation
• Machine failure; for example, incorrect PCR cycling temperatures
• Chimeric reads
• Barcode and/or adaptor errors; for example, adaptor contamination,
lack of barcode diversity and incompatible barcodes
Sebastian Schmeier
Sources of sequencing errors (5)
22The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sequencing and imaging
• User errors; for example, cluster crosstalk caused by
overloading the flow cell
• Dephasing; for example, incomplete extension and
addition of multiple nucleotides instead of a single
nucleotide
• ‘Dead’ fluorophores, damaged nucleotides and
overlapping signals
• Sequence context; for example, GC richness,
homologous and low-complexity regions, and
homopolymers
• Machine failure; for example, failure of laser, hard
drive, software and fluidics
• Strand biases
Sebastian Schmeier
Sources of sequencing errors (6)
It is important to remember that one get somatic (acquired) mutations
as well.These are not sequencing error but can be mistaken for them.
23The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sebastian Schmeier
Assessing quality: Reads
• If the quality of the reads is bad we can trim the nucleotides that are bad
of the end of the reads
• Not trimming the end has a huge influence on downstream processes,
e.g. assemblies
24
Good	
Bad	
Trimming	needed	
Position in read Position in read
Quality
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Sebastian Schmeier 25
Assessing quality: Reads (2)
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality:Tiles
• One can assess also the quality of a run based on the tiles of a
lane of a flowcell
spot problems with a particular tile on a lane, e.g. Bubbles in
the reagents
• The homogeneity of the Illumina process ensures that the relative
frequencies are similar from tile to tile and distributed uniformly
across each tile
when the machine is functioning properly
• Major discrepancies in these conditions can be discerned by sight
• Many such discrepancies are small and their effects are limited to
one, or a few, tiles.
26
Sebastian Schmeier
Assessing quality:Tiles (2)
• Encoded in the FASTQ-file is the flowcell tile from which each read came.
• The graph allows you to look at the quality scores from each tile across all
of your bases to see if there was a loss in quality associated with only one
part of the flowcell.
27
Good run Bad run
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality:Tiles (3)
• The plot shows the deviation from the average quality for each tile.
• The colours are on a cold to hot scale
• Cold colours being positions where the quality was at or below the average for that
base in the run
• Hotter colours indicate that a tile had worse qualities than other tiles for that base.
28
Good run Bad run
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality:Tiles (4)
29
Good run
Bad run
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality: Data processing
• Adapter trimming: If not already done, we can remove the
adapter used for sequencing.
30
DNA sequence of interest
Universal adapter
Indexed adapter
6 base index region
Example for IlluminaTrueSeq
Sebastian Schmeier
Assessing quality: Data processing (2)
• Filtering:We can remove all reads that do not have a
particular quality over the read length, e.g. at least q20 for 80%
of the read
31
good quality
bad quality
Too many errors? Discard
X
Sebastian Schmeier
Assessing quality: Data processing (3)
• Cropping:We can try to remove all nt from both ends that do
not fulfil a certain quality
32
good quality
bad quality
Sebastian Schmeier
Assessing quality: Data processing (4)
• Removal: We can remove reads that are too short after
cropping.
33
good quality
bad quality
Fragment too short? Discard
X
Sebastian Schmeier
Assessing quality: Data processing (5)
• Adapter trimming: If not already done, we can remove the adapter used
for sequencing
• Filtering:We can remove all reads that do not have a particular quality over
the read length, e.g. at least q20 for 80% of the read
• Cropping:We can try to remove all nt from both ends that do not fulfil a
certain quality
• Removal: We can remove reads that are too short after cropping.
In the end, we work with an adjusted set of sequencing reads for which we
are more certain that they represent correct nt sequences from the genome
=> However filtering/trimming does not always improve things 

as we loose information
34
Sebastian Schmeier
s.schmeier@massey.ac.nz
http://sschmeier.com/bioinf-workshop/
References
The role of replicates for error mitigation in next-generation sequencing. Robasky et al.
Nature Reviews Genetics, 2014, 15, 56-62
Addressing challenges in the production and analysis of illumina sequencing data. Kirchner
et al. BMC Genomics, 2011,12:382

More Related Content

PPTX
Transcriptomics approaches
PPTX
Introduction to Next Generation Sequencing
PPTX
Sequenced taged sites (sts)
PPT
NGS - QC & Dataformat
PDF
Tech Talk: UCSC Genome Browser
PPTX
Gemome annotation
PDF
A short introduction to single-cell RNA-seq analyses
PDF
Basics of Genome Assembly
Transcriptomics approaches
Introduction to Next Generation Sequencing
Sequenced taged sites (sts)
NGS - QC & Dataformat
Tech Talk: UCSC Genome Browser
Gemome annotation
A short introduction to single-cell RNA-seq analyses
Basics of Genome Assembly

What's hot (20)

PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PDF
Illumina sequencing introduction
PPTX
RNA-seq Data Analysis Overview
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Whole genome sequence
PPTX
Comparative genomics
PPTX
Next generation sequencing
PDF
Genome Assembly
PPTX
Next Generation Sequencing (NGS)
PPTX
NGS data formats and analyses
PPTX
Gene Expression Analysis by Real Time PCR
PPTX
Illumina Sequencing
PPTX
2 whole genome sequencing and analysis
PPTX
Gene expression profiling
PPTX
qRT PCR
PPTX
NGS File formats
PPTX
Ngs ppt
PPTX
Structural genomics
PPTX
Functional genomics
PPTX
sequence of file formats in bioinformatics
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Illumina sequencing introduction
RNA-seq Data Analysis Overview
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Whole genome sequence
Comparative genomics
Next generation sequencing
Genome Assembly
Next Generation Sequencing (NGS)
NGS data formats and analyses
Gene Expression Analysis by Real Time PCR
Illumina Sequencing
2 whole genome sequencing and analysis
Gene expression profiling
qRT PCR
NGS File formats
Ngs ppt
Structural genomics
Functional genomics
sequence of file formats in bioinformatics
Ad

Viewers also liked (19)

PDF
ChIP-seq - Data processing
PDF
Introduction to next generation sequencing
PDF
Next-generation sequencing course, part 1: technologies
PPT
New Generation Sequencing Technologies: an overview
PPTX
Next generation sequencing
PDF
Quality Control of Sequencing Data
PDF
Quality Control of NGS Data Solutions
PDF
Promises and Challenges of Next Generation Sequencing for HIV and HCV
PPTX
GTC group 8 - Next Generation Sequencing
PDF
Examining gene expression and methylation with next gen sequencing
PDF
Genome assembly: An Introduction (2016)
PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
PDF
ECCB 2010 Next-gen sequencing Tutorial
PDF
Making your science powerful : an introduction to NGS experimental design
PDF
Next Generation Sequencing Informatics - Challenges and Opportunities
PDF
NGS technologies - platforms and applications
PPTX
A Comparison of NGS Platforms.
PDF
Ngs intro_v6_public
PDF
NGS - Basic principles and sequencing platforms
ChIP-seq - Data processing
Introduction to next generation sequencing
Next-generation sequencing course, part 1: technologies
New Generation Sequencing Technologies: an overview
Next generation sequencing
Quality Control of Sequencing Data
Quality Control of NGS Data Solutions
Promises and Challenges of Next Generation Sequencing for HIV and HCV
GTC group 8 - Next Generation Sequencing
Examining gene expression and methylation with next gen sequencing
Genome assembly: An Introduction (2016)
Next-generation sequencing data format and visualization with ngs.plot 2015
ECCB 2010 Next-gen sequencing Tutorial
Making your science powerful : an introduction to NGS experimental design
Next Generation Sequencing Informatics - Challenges and Opportunities
NGS technologies - platforms and applications
A Comparison of NGS Platforms.
Ngs intro_v6_public
NGS - Basic principles and sequencing platforms
Ad

Similar to Next-generation sequencing and quality control: An Introduction (2016) (20)

PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
PDF
05_Microbio590B_QC_2022.pdf
PDF
Introducing data analysis: reads to results
PDF
2015 09-29-sbc322-methods.key
PPTX
Introduction to second generation sequencing
PPT
High Throughput Sequencing Technologies: What We Can Know
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PDF
DNA SEQUENCING_BASICS_NGS_SANGER_NGS_SLIDES
PPTX
Understanding and controlling for sample and platform biases in NGS assays
PPTX
Next generation sequencing methods
PPT
2013 pag-equine-workshop
PDF
Errors and Limitaions of Next Generation Sequencing
PDF
7 sins in the analysis of high-throughput sequencing data
PDF
Introduction to Next Generation Sequencing
PDF
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
PDF
RNA sequencing analysis tutorial with NGS
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
Knowing Your NGS Upstream: Alignment and Variants
PDF
Part 2 of RNA-seq for DE analysis: Investigating raw data
PPTX
Rnaseq forgenefinding
Data Management for Quantitative Biology - Data sources (Next generation tech...
05_Microbio590B_QC_2022.pdf
Introducing data analysis: reads to results
2015 09-29-sbc322-methods.key
Introduction to second generation sequencing
High Throughput Sequencing Technologies: What We Can Know
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
DNA SEQUENCING_BASICS_NGS_SANGER_NGS_SLIDES
Understanding and controlling for sample and platform biases in NGS assays
Next generation sequencing methods
2013 pag-equine-workshop
Errors and Limitaions of Next Generation Sequencing
7 sins in the analysis of high-throughput sequencing data
Introduction to Next Generation Sequencing
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
RNA sequencing analysis tutorial with NGS
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Knowing Your NGS Upstream: Alignment and Variants
Part 2 of RNA-seq for DE analysis: Investigating raw data
Rnaseq forgenefinding

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Pre independence Education in Inndia.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Cell Types and Its function , kingdom of life
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Cell Structure & Organelles in detailed.
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
RMMM.pdf make it easy to upload and study
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pre independence Education in Inndia.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Abdominal Access Techniques with Prof. Dr. R K Mishra
TR - Agricultural Crops Production NC III.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Cell Types and Its function , kingdom of life
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
VCE English Exam - Section C Student Revision Booklet
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Microbial diseases, their pathogenesis and prophylaxis
Cell Structure & Organelles in detailed.
Microbial disease of the cardiovascular and lymphatic systems
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
RMMM.pdf make it easy to upload and study
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Final Presentation General Medicine 03-08-2024.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
Renaissance Architecture: A Journey from Faith to Humanism

Next-generation sequencing and quality control: An Introduction (2016)

  • 2. Sebastian Schmeier Overview • Typical workflow of a genomics experiment • Genome versus transcriptome • DNA sequencing • The FASTQ-file format • Sources of sequencing errors • Quality assessment of a sequencing run • Pre-processing sequencing data 2 DNA sequencing technologies Quality assessment of a sequencing run Pre-processing sequencing data
  • 3. Sebastian Schmeier Learning outcomes • Being able to describe what next generation sequencing is. • Being able to describe the FastQ-sequence format and understand what information it contains and how it can be used. • Describe what the Phred-based quality score is. • Be able to describe the sources of sequencing errors. • Being able to compute, investigate and evaluate the quality of 
 sequence data from a sequencing experiment. • Be able to distinguish between a good and a bad sequencing run. • Being able to describe the steps involved in cleaning sequencing data. 3
  • 4. Sebastian Schmeier Typical workflow of a genomics experiment 4 Scientific question Experimental design Run the experiment Assess data quality Analyse the data This is where the biological experiment happens This is the bottleneck of the whole experiment The essential part to make all downstream analysis work This is the bottleneck of the 
 whole experiment This is where the biological 
 experiment happens The essential part to make all 
 downstream analysis work Meaningful results
  • 5. Sebastian Schmeier Genome versus transcriptome Genome The entirety of an organism's ancestral information. It is encoded either in DNA or, for many types of viruses, in RNA. Transcriptome The set of all RNA molecules, including messenger RNA, ribosomal RNA, transfer RNA, and other non- coding RNA produced in one or a population of cells 5 Name Base Pairs HIV 9,749 9.7kb E.Coli 4,600,000 4.6MB Yeast 12,100,000 12.1Mb Drosophila 130,000,000 130MB Homo sapiens 3,200,000,000 3.2GB marbled lungfish 130,000,000,000 130Gb "Amoeba" dubia 670,000,000,000 670Gb
  • 6. Sebastian Schmeier DNA sequencing DNA sequencing is the process 
 of determining the nucleotide 
 order of a given DNA fragment. First-generation sequencing: 1977 Sanger sequencing method 
 development (chain-termination 
 method) 2001, Sanger method produced a 
 draft sequence of the human genome Next-generation sequencing (NGS) Demand for low-cost sequencing has driven the development of high- throughput sequencing (or NGS) technologies that parallelize the sequencing process, producing thousands or millions of sequences concurrently 2004 454 Life Sciences marketed a parallelized version of pyrosequencing 6
  • 7. Sebastian Schmeier Result of a sequencing run Short read sequences • The result of NGS technology are a collection of short nucleotide sequences (reads) of varying length (~40-400nt) depending on the technology used to generate the reads • Usually a reads quality is good at the beginning of the read and errors accumulate the longer the read gets 
 => IMPORTANT 7
  • 8. Sebastian Schmeier Illumina sequencing MiSeq: • Bench-top sequencer • Produces around 30 million reads/run • Reads are up to 250nt HiSeq: • Large-scale sequencer • 4 billion reads/run • Reads up to 150nt The Illumina systems accumulate errors towards the end of the read sequence. 8 http://www.illumina.com/
  • 9. Sebastian Schmeier Illumina sequencing (2) • An Illumina flowcell is a surface to which seq. adaptors are covalently attached. • DNA with complementary adaptors is attached, clonally amplified, and then sequenced by synthesis • Each flowcell is subdivided into
 hundreds of tiles 9
  • 10. Sebastian Schmeier Illumina sequencing (3) 10Addressing challenges in the production and analysis of illumina sequencing data. Kirchner et al. BMC Genomics, 2011,12:382
  • 11. Sebastian Schmeier The FASTQ-file format The file-format that you will encounter soon is called FASTQ 11 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
  • 12. Sebastian Schmeier The FASTQ-file format (2) The file-format that you will encounter soon is called FASTQ 12 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID
  • 13. Sebastian Schmeier The FASTQ-file format (3) The file-format that you will encounter soon is called FASTQ 13 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID Sequence
  • 14. Sebastian Schmeier The FASTQ-file format (4) The file-format that you will encounter soon is called FASTQ 14 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID Sequence Phred quality of the 
 corresponding 
 nucleotide (ASCII code)
  • 15. Sebastian Schmeier The FASTQ-file format (5) The file-format that you will encounter soon is called FASTQ 15 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID Casava 1.8 the format
  • 16. Sebastian Schmeier Sequencing errors • Sequencing errors or mis-called bases occur when a sequencing method calls one or more bases incorrectly leading to an incorrect read. • The chance of a sequencing error is generally known and quantifiable, thanks to extensive testing and calibration of the sequencing machines • Each base in a read is assigned a quality score, indicating confidence that the base has been called correctly. 16https://www.broadinstitute.org/crd/wiki/index.php/Sequencing_error
  • 17. Sebastian Schmeier The FASTQ-file format (6) • FastQ: Phred base quality • One ASCII character per nucleotide. • Encodes for a quality Q = -10*log10(P), where P is the error probability 17 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Phred quality of the 
 corresponding 
 nucleotide (ASCII code) -10*log10(0.1) =
  • 18. Sebastian Schmeier Sources of sequencing errors • The importance and the relative effect of each error source on downstream applications depend on many factors, such as: sample acquisition reagents tissue type protocol 18The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 instrumentation experimental conditions analytical application the ultimate goal of the study.
  • 19. Sebastian Schmeier Sources of sequencing errors (2) Sequencing errors can stem from any time point throughout the experimental workflow, including initial sequence preparation, library preparation and sequencing. 19The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
  • 20. Sebastian Schmeier Sources of sequencing errors (3) 20The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Sample preparation • User errors; for example, mislabelling • Degradation of DNA and/or RNA from preservation methods; for example, tissue autolysis, nucleic acid degradation and crosslinking during the preparation of formalin-fixed, paraffin-embedded (FFPE) tissues • Alien sequence contamination; for example, those of mycoplasma and xenograft hosts • Low DNA input
  • 21. Sebastian Schmeier Sources of sequencing errors (4) 21The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Library preparation • User errors; for example, carry-over of DNA from one sample to the next and contamination from previous reactions • PCR amplification errors • Primer biases; for example, binding bias, methylation bias, biases that result from mispriming, nonspecific binding and the formation of primer dimers, hairpins and interfering pairs, and biases that are introduced by having a melting temperature that is too high or too low • 3ʹ-end capture bias that is introduced during poly(A) enrichment in high-throughput RNA sequencing • Private mutations; for example, those introduced by repeat regions and mispriming over private variation • Machine failure; for example, incorrect PCR cycling temperatures • Chimeric reads • Barcode and/or adaptor errors; for example, adaptor contamination, lack of barcode diversity and incompatible barcodes
  • 22. Sebastian Schmeier Sources of sequencing errors (5) 22The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Sequencing and imaging • User errors; for example, cluster crosstalk caused by overloading the flow cell • Dephasing; for example, incomplete extension and addition of multiple nucleotides instead of a single nucleotide • ‘Dead’ fluorophores, damaged nucleotides and overlapping signals • Sequence context; for example, GC richness, homologous and low-complexity regions, and homopolymers • Machine failure; for example, failure of laser, hard drive, software and fluidics • Strand biases
  • 23. Sebastian Schmeier Sources of sequencing errors (6) It is important to remember that one get somatic (acquired) mutations as well.These are not sequencing error but can be mistaken for them. 23The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
  • 24. Sebastian Schmeier Assessing quality: Reads • If the quality of the reads is bad we can trim the nucleotides that are bad of the end of the reads • Not trimming the end has a huge influence on downstream processes, e.g. assemblies 24 Good Bad Trimming needed Position in read Position in read Quality http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 25. Sebastian Schmeier 25 Assessing quality: Reads (2) http://solexaqa.sourceforge.net/
  • 26. Sebastian Schmeier Assessing quality:Tiles • One can assess also the quality of a run based on the tiles of a lane of a flowcell spot problems with a particular tile on a lane, e.g. Bubbles in the reagents • The homogeneity of the Illumina process ensures that the relative frequencies are similar from tile to tile and distributed uniformly across each tile when the machine is functioning properly • Major discrepancies in these conditions can be discerned by sight • Many such discrepancies are small and their effects are limited to one, or a few, tiles. 26
  • 27. Sebastian Schmeier Assessing quality:Tiles (2) • Encoded in the FASTQ-file is the flowcell tile from which each read came. • The graph allows you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. 27 Good run Bad run http://solexaqa.sourceforge.net/
  • 28. Sebastian Schmeier Assessing quality:Tiles (3) • The plot shows the deviation from the average quality for each tile. • The colours are on a cold to hot scale • Cold colours being positions where the quality was at or below the average for that base in the run • Hotter colours indicate that a tile had worse qualities than other tiles for that base. 28 Good run Bad run http://solexaqa.sourceforge.net/
  • 29. Sebastian Schmeier Assessing quality:Tiles (4) 29 Good run Bad run http://solexaqa.sourceforge.net/
  • 30. Sebastian Schmeier Assessing quality: Data processing • Adapter trimming: If not already done, we can remove the adapter used for sequencing. 30 DNA sequence of interest Universal adapter Indexed adapter 6 base index region Example for IlluminaTrueSeq
  • 31. Sebastian Schmeier Assessing quality: Data processing (2) • Filtering:We can remove all reads that do not have a particular quality over the read length, e.g. at least q20 for 80% of the read 31 good quality bad quality Too many errors? Discard X
  • 32. Sebastian Schmeier Assessing quality: Data processing (3) • Cropping:We can try to remove all nt from both ends that do not fulfil a certain quality 32 good quality bad quality
  • 33. Sebastian Schmeier Assessing quality: Data processing (4) • Removal: We can remove reads that are too short after cropping. 33 good quality bad quality Fragment too short? Discard X
  • 34. Sebastian Schmeier Assessing quality: Data processing (5) • Adapter trimming: If not already done, we can remove the adapter used for sequencing • Filtering:We can remove all reads that do not have a particular quality over the read length, e.g. at least q20 for 80% of the read • Cropping:We can try to remove all nt from both ends that do not fulfil a certain quality • Removal: We can remove reads that are too short after cropping. In the end, we work with an adjusted set of sequencing reads for which we are more certain that they represent correct nt sequences from the genome => However filtering/trimming does not always improve things 
 as we loose information 34
  • 35. Sebastian Schmeier s.schmeier@massey.ac.nz http://sschmeier.com/bioinf-workshop/ References The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Addressing challenges in the production and analysis of illumina sequencing data. Kirchner et al. BMC Genomics, 2011,12:382