Next generation sequencing for snp discovery(final)

GENOME RE SEQUENCING FOR
SNP DISCOVERY / GENOTYPING
MARKER
Presented by
Monoj Sutradhar
PALB 3243
Jr. M.sc(Pl. Biotech)
UAS,GKVK,Bangalore
9/8/2014 1

What are SNPs ?
ACGTTTGGATAC
TGCAAACCTATG
ACGTTTGTATAC
TGCAAACATATG
Single nucleotide polymorphisms consist of a single
change in the DNA code
SNPs occur with various allele frequencies. Those in
the 20-40% range are useful for genetic mapping.
Those at frequencies between 1% and 20% may be
used with candidate gene approaches. Usually bi-allelic.
Changes at 〈1% are called variants
9/8/2014 2

What are the effects of SNPs ?
Where Result Effect
In coding
region
May be silent, o.g.,UUG→CUG, leu in both cases sSNP Usually no change in
phenotype
In coding
region
May change amino acid sequence, e.g., UUC→UUA,
phe to leu, Some characterize these as the least
common and most valuable SNPs, Many being
patented
cSNP Phenotype change
(may be subtle
depending on amino
acid replacement and
position)
In coding
region
May create a "Stop"codon, e. g., UCA→UGA,
ser to stop
Phenotype change
In coding
region
May affect the rate of transcription
(up-or down-regulate)
cSNP Possible phenotype
Change
Other
regions
No affect on gene products(7).
May act as genetic markers for multi-component
diseases. These are sometimes called anonymous SNPs
and are the most common.
rSNP
9/8/2014 3

How many SNPs are there ?
It is estimated that the human genome contains between
3 million and 6 million SNPs spaced irregularly at
intervals of 500 to 1,000 bases.
The SNP Consortium estimates that as many as 300,000
SNPs may be needed to fuel studies.
100.000 or more SNPs may be required for complex
disease gene discovery
9/8/2014 4

SNP Discovery
SNP Discovery refers to the initial identification of new
SNPs.
The established method is DNA sequencing
with subsequent data analysis. Some indirect Discovery
techniques (e.g., dHPLC, SSCP) only indicate that a SNP
(or other mutation) exists.
DNA sequencing of multiple individuals is used to determine
the point and type of polymorphism.
Low throughput, based on established DNA sequencing
analyses or collected data (also based on electrophoretic data)
9/8/2014 5

SNP Validation
SNP Validation refers to genetic validation, the process
of ensuring that the SNP is not due to sequencing error
and that it is not extremely rear. This should not be
confused with assay, target or regulatory validation.
Confirmation of SNPs found in Discovery
Larger numbers of individual samples to get statistical
data on occurrence in the population
9/8/2014 6

SNP Screening
SNP Screening refers to researchers running thousands of
genotypes (may SNPs or many individuals or both)
Thousands to hundreds of thousands of samples per day
Two different screening strategies
- Many SNPs in a few individuals
- A few SNPs in many individuals
Different strategies will require different tools
Important in determining markers for complex genetic states
9/8/2014 7

Steps of SNP discovery
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection

Initial SNP Discovery and Mapping
SNP discovery using Sanger re-sequencing
- MSNP discovery using Sanger re-sequencing
- Mostly genic
- BAC-end and BAC subclones genic
- BAC-end and BAC subclones
SNP genotyping and mapping
- Sequenom mass spectrometer
- Luminex Flow cytometer
- Illumina Inc. GoldenGate™ assay

Roche (454) Sequencing
Pyrosequencing was the first of the new highly parallel sequenci
ng technologies to reach the market [24]. It is commonly referred
to as 454 sequencing after the name of the company that first co
mmercialized it.
It is an SBS method where single fragments of DNA are hybridiz
ed to a capture bead array and the beads are emulsified with rea
gents necessary to PCR amplifying the individually bound templa
te.
Each bead in the emulsion acts as an independent PCR where
millions of copies of the original template are produced and boun
d to the capture beads which then serve as the templates for the
subsequent sequencing reaction
9/8/2014 10

The individual beads are deposited into a picotiter plate along wit
h DNA polymerase, primers, and the enzymes necessary to creat
e fluorescence through the consumption of inorganic phosphate p
roduced during sequencing.
The instrument washes the picotiter plate with each of the DNA b
ases in turn. As template-specific incorporation of a base by DNA
polymerase occurs, a pyrophosphate (PPi) is produced.
This pyrophosphate is detected by an enzymatic luminometric in
organic pyrophosphate detection assay (ELIDA) through the gen
eration of a light signal following the conversion of PPi into ATP
9/8/2014 11

Shotgun sequencing by PGM/454
Genomic
Fragment
Adapters

Genomic
Fragment
Barcode

Bead/ISP
Adapter
Complement
Sequences
The idea is that each bead should be amplified
all over with a SINGLE library fragment.

Problem: How do I do PCR to amplify the fragments
without having to use 1 tube for each reaction?

~3.5 μm for Ion Torrent, ~30 μm for 454

Only give polymerase one nucleotide at a time:
Prime
r
T G C G C G G C C C A
T T
A C G C G C C G G G T C A G A A C C C G A T C G C G
5’
3’ 5’
If that nucleotide is incorporated, enzymes turn b
y-products into light:
T C A G T C A G T C A G
1 2 3 4 5
T
T T

Prime
r
A A
5’
3’ 5’
1 2 3 4 5
A
A A

Prime
r
G G
5’
3’ 5’
1 2 3 4 5
G
G G
G

Prime
r
T T
5’
3’ 5’
1 2 3 4 5
G
T
T T
T

Prime
r
C C
5’
3’ 5’
1 2 3 4 5
G T
C
C C
C

Prime
r
G G
5’
3’ 5’
1 2 3 4 5
G T C T T
G
G G
G G G
The real pow
er of this met
hod is that it
can take plac
e in millions
of tiny wells i
n a single pla
te at once.

Prime
r
G G
5’
3’ 5’
1 2 3 4 5
G T C T T
G
G G
G G G
The real pow
er of this met
hod is that it
can take plac
e in millions
of tiny wells i
n a single pla
te at once.
Raw 454 data

The instrument repeats the sequential nucleotide wash cy
cle hundreds of times to lengthen the sequences.
The 454 GS FLX Titanium XL+ platform currently generate
s up to 700 MB of raw 750 bp reads in a 23 hour run
9/8/2014 29

. Illumina Sequencing
• Illumina technology, acquired by Illumina from Solexa, followed the
release of 454 sequencing.
• With this sequencing approach, fragments of DNA are hybridized to a
solid substrate called a flow cell.
• In a process called bridge amplification, the bound DNA template
fragments are amplified in an isothermal reaction where copies of
the template are created in close proximity to the original.
9/8/2014 30

• This results in clusters of DNA fragments on the flow cell
creating a “lawn” of bound single strand DNA molecules.
• The molecules are sequenced by flooding the flow cell with
a new class of cleavable fluorescent nucleotides and the
reagents necessary for DNA polymerization .
•
• A complementary strand of each template is synthesized one
base at a time using fluorescently labeled nucleotides.
• The fluorescent molecule is excited by a laser and emits
light, the colour of which is different for each of the four
bases. The fluorescent label is then cleaved off and a new
round of polymerization occurs
9/8/2014 31

• Unlike 454 sequencing, all four bases are present for the
polymerization step and only a single molecule is incorporated
per cycle.
• The flagship HiSeq2500 sequencing instrument from Illumina
can generate up to 600 GB per run with a read length of 100 nt
and 0.1% error rate.
• The Illumina technique can generate sequence from opposite
ends of a DNA fragment, so called paired-end (PE) reads.
9/8/2014 33

. Applied Biosystems (SOLiD) Sequencing
• The SOLiD system was jointly developed by the Harvard
Medical School and the Howard Hughes Medical Institute .
The library preparation in SOLiD is very similar to Roche/454
in which clonal bead populations are prepared in
microreactors containing DNA template, beads, primers, and
PCR components.
• Beads that contain PCR products amplified by emulsion PCR
are enriched by a proprietary process. The DNA templates
on the beads are modified at their 3′ end to allow
attachment to glass slides.
• A primer is annealed to an adapter on the DNA template and
a mixture of fluorescently tagged oligonucleotides is
pumped into the flow cell
9/8/2014 34

• . When the oligonucleotide matches the template sequence, it is
ligated onto the primer and the unincorporated nucleotides are
washed away.
• A charged couple device (CCD) camera captures the different colours
attached to the primer. Each fluorescence wavelength corresponds
to a particular dinucleotide combination.
• After image capture, the fluorescent tag is removed and new set of
oligonucleotides are injected into the flow cell to begin the next
round of DNA ligation .
• This sequencing-by-ligation method in SOLiD-5500x1 platform
generates up to 1,410 million reads of nt each with an error rate of
0.01%
9/8/2014 35

Software for Sequence Analysis
• Both commercial and noncommercial sequence analysis
software are available for Windows, Macintosh, and Linux
operating systems.
• NGS companies offer proprietary software such as
consensus assessment of sequence and variation (Cassava)
for Illumina data and Newbler for 454 data.
• Such software tend to be optimized for their respective
platform but have limited cross applicability to the others
9/8/2014 39

•Commercially available software such as CLC-Bio
(http://www.clcbio.com/) and SeqMan NGen
(http://www.dnastar.com/t-sub-products-genomics-
seqman-ngen.aspx) provide a friendly
user interface, are compatible with different
operating systems, require minimal computing
knowledge, and are capable of performing multiple
downstream analyses.
•However, they tend to be relatively expensive, have
narrow customizability, and require locally available
high computing power.
9/8/2014 40

• . Linux-based software such as Bowtie [59], BWA [60], and
SOAP2/3 [61] have been used widely for the analysis of NGS
data.
9/8/2014 41

Software and Pipelines for SNP Discovery
• Broadly used SNP calling software include Samtools [103],
SNVer [104], and SOAPsnp [74]. Samtools is popular because
of its various modules for file conversion (SAM to BAM and
vice-versa), mapping statistics, variant calling, and assembly
visualization.
• Recently, SOAPsnp has gained popularity because of its tight
integration with SOAP aligner and other SOAP modules
which are constantly upgraded and provide a one stop shop
for the sequencing analysis continuum.
•
9/8/2014 42

Variant calling algorithms such as Samtools and
SNVer can be used as stand-alone programs or
incorporated into pipelines for SNP calling
• A wide array of commonly used file formats such as
SAM, BAM, SOAP, ACE, FASTQ, and FASTA generated
by different read assemblers such as Bowtie, BWA,
SOAP, MAQ, and SeqMan Ngen.
9/8/2014 43

SNP Discovery
• NGS-derived SNPs have been reported in humans , Drosophila ,
wheat , eggplant , rice, Arabidopsis, barley, sorghum , cotton,
common beans, soybean , potato, flax, Aegilops tauschii, alfalfa, oat,
and maize to name a few.
• SNP discovery using NGS is readily accomplished in small plant
genomes for which good reference genomes are available such as
rice and Arabidopsis Although SNP discovery in complex genomes
without a reference genome such as wheat , barley , oat, and beans
can be achieved through NGS, several challenges remain in other
nonmodel but economically important crops.
9/8/2014 44

SNP Validation
• The two major factors affecting the SNP validation rate are
sequencing and read mapping errors as discussed above.
• NGS platforms have different levels of sequencing accuracies,
and this may be the most important factor determining the
variation in the validation, from 88.2% for SOLiD followed by
Illumina at 85.4% and Roche 454 at 71% .
• The SNP validation rates can be improved using RRL for SNP
discovery and choosing SNPs within the nonrepetitive
sequences including predicted single copy genes and single copy
repeat junctions shown to have high validation rates.
9/8/2014 45

Genome-Wide Association Mapping
• Association mapping (AM) panels provide a better resolution, consider
numerous alleles, and may provide faster marker-trait association than
biparental populations .
• AM, often referred to as linkage disequilibrium (LD) mapping, relies on
the nonrandom association between markers and traits.
• In the past few years, NGS technologies have led to the discovery of
thousands, even millions of SNPs, and novel application platforms have
made it possible to produce genome-wide haplotypes of large numbers
of genotypes, making SNPs the ideal marker for GWASs.
9/8/2014 46

• A GWAS performed in rice using ~3.6 million SNPs identified genomic
regions associated with 14 agronomic traits .
• The genetic structure of northern leaf blight, southern leaf blight,
and leaf architecture was studied using ~1.6 million SNPs in maize
• SNP-based GWAS was also performed on species such as barley for
which a reference genome sequence is not available ‘
• So far, 951 GWASs have been reported in humans .
9/8/2014 47

Future Perspectives
• SNP discovery incontestably made a quantum leap forward with the advent
of NGS technologies and large numbers of SNPs are now available from
several genomes including large and complex ones .
• Unlike model systems such as humans and Arabidopsis, SNPs from crop
plants remain limited for the time being, but broad access to reasonable
cost NGS promises to rapidly increase the production of reference genome
sequences as well as SNP discovery.
• The NGS technologies have made SNP discovery affordable even in complex
genomes and the technologies themselves have improved tremendously in
the past decade.
9/8/2014 48

References
• Genome wide SNP discovery in flax through next generation
sequencing of reduced representation libraries:; SANTOSH KUMAR,
FRANK M YOU and SYLVIE CLOUTIER;2012; BioMed Central
• Identification of Novel SNPs in Glioblastoma Using Targeted
Resequencing; ANDREAS KELLER1., CHRISTIAN HARZ2., MARK
MATZAS1, BENJAMIN MEDER3, HUGO A. KATUS3, NICOLE LUDWIG2,
ULRIKE FISCHER2, ECKART MEESE2;2011; Ohio State University
Medical Center.
9/8/2014 49

Thanks for Attention
9/8/2014 50

Next generation sequencing for snp discovery(final)

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Next generation sequencing for snp discovery(final) (20)

Next generation sequencing for snp discovery(final)