SlideShare a Scribd company logo
New Strategy to detect SNPs
Miguel Galves
José Augusto Quitzau
Zanoni Dias
Scylla Bioinformatics –Brazil
{miguel,jquitzau,zanoni}@scylla.com.br
Agenda
 Introduction
 HIV Dataset
 Detection Strategy
 Trimming Procedure
 Base-Calling Strategies
 Filter Algorithm
 Consensus Algorithm
 Tests Protocol
 Results
 Discussion
Introduction
 Polymorphism: set of base pair locus at
which different alleles exists in individuals in
some population
– The second most frequent allele must appear in
at least 1% of the individuals
 SNP: polymorphism in a single base pair
position
 SNP discovery is very important to
understand complex diseases
HIV Dataset
 HIV genetic sequences:
– 1302 bp
– Well-conserved region
 35 batches from 35 individuals:
– 6 PCR reads, with average size of 690bp
– 1 validated sequence, with manually annotated
SNPs
 HIV Reference Sequence
Detection Strategy: Survey
 Trimming Procedure
 Base-Calling Correction
 SNPs Filter
 Batch Consensus Algorithm
Trimming Procedure
 Low Quality Ends filtering
 Converts phred’s quality sequence to error
probability sequence:
⇒ Q = -10 x log10(p)
 Subtract 0.05 from all values (Q=13)
 Maximum Score Subsequence Algorithm
Base Calling: Area Ratio
 The base calling is made in 5 Steps:
1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks area
5. Calculation of the polymorphic/reference peak area
 If the calculated ratio is above a certain threshold, the
point is considered a polymorphism.
Base Calling: Area Delimitation
Base Calling: Peak Identification
Base Calling: Average Height Ratio
 Almost the same steps:
1. Chromatogram area delimitation
2. Peak search
3. Choice of the nearest peaks
4. Calculation of the nearest peaks average height
5. Calculation of the polymorphic/reference peak average
height.
 Again, if the calculated ratio is above a certain
threshold, the point is considered a polymorphism.
Base Calling: Peak Identification
Filter Algorithm
 Analyzes each sequence
 Uses a window based algorithm to eliminate
adjacents SNPs
– Window size: 11 bases
– Empirical score system assigned to polymorphism
in the window
Consensus Algorithm
 Rule-based algorithm
– Empirical rules
 Analyzes the whole cross section to define a
consensus
– Take account of nucleotide frequencies and
qualities
 Do not create N symbols, nor tri-allelic
polymorphisms.
Consensus Algorithm: Example
Sequence 1 A25 C30 C18 C30 A21
Sequence 2 A30 C25 C15 C25 A16
Sequence 3 - M18 A9 C30 -
Sequence 4 - - S12 G17 T18
Consensus A M S S W
Tests Protocol: Third Party Packages
 Two external packages used to compare our results:
– Polybayes: SNP detection tool based on Bayesian
Methods
– Polyphred: SNP detection tool based on chromatogram
analysis
 ACE file (contig and consensus) created for each
batch using phrap
 ACE file analyzed by Polyphred and Polybayes
 Results viewed with consed
Tests Protocol: Our strategy
 Reads trimmed using Maximum
Subsequence Algorithm
 Base-calling analysis and correction using
algorithms describe previously
 SNP filtering
 Multiple alignment
– Reference sequence as anchor
 Consensus creation
Third Party Results: Polybayes
 Polybayes detected SNPs in only 2 batches out of 35
Batch Existing
SNPs
Detected
SNPs
Correct
SNPs
False
Positives
False
Negatives
Batch 13 12 1 1 0 11
Batch 15 5 1 0 1 5
Third Party Results: Polyphred
 Polyphred detected SNPs in only 4 batches out of 35
Batch Existing
SNPs
Detected
SNPs
Correct
SNPs
False
Positives
False
Negatives
Batch 07 10 1 0 1 10
Batch 14 4 3 0 3 4
Batch 32 26 1 0 1 26
Batch 35 15 8 1 7 14
Trimming Results
 Reads average size:
– Before trimming: 690.15bp
– After trimming: 374.74bp
– Reduction of 45%
 Reference sequence average base coverage
– Before trimming: 2.69
– After trimming: 1.77
Results: True Positive (%) x batch
Results: False Negative (%) x batch
Results: False Positive (%) x batch
Results: Summary
Polybayes Polyphred Area Avg. Height
Avg SD Avg SD Avg SD Avg SD
TP 0.3 1.4 0.2 1.1 75.4 19.2 52.6 21.5
FN 99.7 1.4 99.8 1.1 23.2 18.4 45.6 21.7
DP 0.0 0.0 0.0 0.0 1.4 4.3 1.8 4.0
FP 2.9 16.9 11.1 31.3 393.9 312.3 554.4 511.3
TP + FN + DP = 100%
Discussion
 Polybayes and Polyphred need large sets of data to
produces good results
 Our algorithm produces quite satisfactory results
taking into account data characteristics:
– Low average coverage
– High amount of low quality bases
– High amount of polymorphisms (virus DNA)
 Area Ratio strategy produces better results than
Average Height strategy
Future Work
 Test the algorithms whith larger batches,
whith higher average coverage, to improve
consensus algorithm
 Reproduce the experiments using genetic
sequences of more conserved life forms,
such as mammals
Acknowledgments

More Related Content

PDF
Single Nucleotide Polymorphism Analysis (SNPs)
PPT
SNP Genotyping Methodlogy-DWR-30-03-2010
PDF
SNP Genotyping Technologies
PPT
Tetra Arm PCR
PPT
Next generation sequencing for snp discovery(final)
PPTX
Aug2015 analysis team 10 mason epigentics
PPTX
Micro array based comparative genomic hybridisation -Dr Yogesh D
PPT
Protein Microarrays: Approaches to Printing
Single Nucleotide Polymorphism Analysis (SNPs)
SNP Genotyping Methodlogy-DWR-30-03-2010
SNP Genotyping Technologies
Tetra Arm PCR
Next generation sequencing for snp discovery(final)
Aug2015 analysis team 10 mason epigentics
Micro array based comparative genomic hybridisation -Dr Yogesh D
Protein Microarrays: Approaches to Printing

What's hot (20)

PPTX
Protein microarray Preparation of protein microarray Different methods of arr...
PPTX
Protein micro array
PDF
(050407)protein chip
PPTX
Digiwest journa club presentation_18.10.2016
PDF
2 md2016 annotation
PPTX
Genome wide association studies seminar
PPTX
Functional genomics
PPTX
PROTEIN MICROARRAYS
PDF
Genotyping, linkage mapping and binary data
 
PPTX
Use of SNP-HapMaps in plant breeding
PPTX
Protein microarray
PDF
Pooled Sequence Haplotype Estimator
PPTX
Techniques in proteomics
PPTX
Candidate Gene Approach in Crop Improvement
PDF
Gene Expression Data Analysis
PPTX
SAGE- Serial Analysis of Gene Expression
PPTX
Developing a framework for for detection of low frequency somatic genetic alt...
PPT
Analysis of gene expression
PPTX
Microarray and its application
Protein microarray Preparation of protein microarray Different methods of arr...
Protein micro array
(050407)protein chip
Digiwest journa club presentation_18.10.2016
2 md2016 annotation
Genome wide association studies seminar
Functional genomics
PROTEIN MICROARRAYS
Genotyping, linkage mapping and binary data
 
Use of SNP-HapMaps in plant breeding
Protein microarray
Pooled Sequence Haplotype Estimator
Techniques in proteomics
Candidate Gene Approach in Crop Improvement
Gene Expression Data Analysis
SAGE- Serial Analysis of Gene Expression
Developing a framework for for detection of low frequency somatic genetic alt...
Analysis of gene expression
Microarray and its application
Ad

Viewers also liked (15)

PDF
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
PPT
PPTX
Single nucleotide polymorphism
PPT
PDF
Non-synonymous SNP ID
PDF
Over- and Under-methylation in the psychiatric population ppt_as_pdf
PDF
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
PDF
L11 dna__polymorphisms__mutations_and_genetic_diseases4
PDF
Genome wide association mapping
PPT
PPTX
Genotyping by Sequencing
PPTX
Single nucleotide polymorphisms (sn ps), haplotypes,
PPTX
Genetic polymorphism
PPTX
Polymorphism
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
Single nucleotide polymorphism
Non-synonymous SNP ID
Over- and Under-methylation in the psychiatric population ppt_as_pdf
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
L11 dna__polymorphisms__mutations_and_genetic_diseases4
Genome wide association mapping
Genotyping by Sequencing
Single nucleotide polymorphisms (sn ps), haplotypes,
Genetic polymorphism
Polymorphism
Ad

Similar to New Strategy to detect SNPs (20)

PPTX
SNPs analysis methods
PPTX
Single Nucleotide Polymorphisms (2)-1.pptx
PPTX
Ppt snp detection
PDF
Daly altshuler.labmeeting
PDF
snp-150505131615-conversion-gate02.pdf
DOCX
1_chlamydia task completely best.docx
PPTX
160627 giab for festival sv workshop
PDF
Sept2016 sv illumina
PPTX
Molecular Markers
PDF
presentation in 1000 Genomes Phase2 meeting
PPTX
SNP Detection Methods and applications
PDF
[2017-05-29] DNASmartTagger
PDF
Large Scale PCA Analysis in SVS
PDF
PGX Data Mining
PDF
Classes of-molecular-markers
PDF
Classes of-molecular-markers
PDF
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
PPTX
Sept2016 sv nist_intro
PPTX
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
PPT
Creating a SNP calling pipeline
SNPs analysis methods
Single Nucleotide Polymorphisms (2)-1.pptx
Ppt snp detection
Daly altshuler.labmeeting
snp-150505131615-conversion-gate02.pdf
1_chlamydia task completely best.docx
160627 giab for festival sv workshop
Sept2016 sv illumina
Molecular Markers
presentation in 1000 Genomes Phase2 meeting
SNP Detection Methods and applications
[2017-05-29] DNASmartTagger
Large Scale PCA Analysis in SVS
PGX Data Mining
Classes of-molecular-markers
Classes of-molecular-markers
ICMP MPS SNP Panel for Missing Persons - Michelle Peck et al.
Sept2016 sv nist_intro
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Creating a SNP calling pipeline

More from Miguel Galves (9)

PDF
Processamento de tweets em tempo real com Python, Django e Celery - TDC 2014
PDF
Redis para iniciantes - TDC 2014
PPT
Comparison of Genomic DNA to cDNA Alignment Methods
PPT
Qualificação de Mestrado
PDF
Uma abordagem computacional para a determinação de polimorfismos de base única
PPT
Django: Uso de frameworks ágeis para desenvolvimento web
PPT
GIS em 3 horas
PDF
PPTX
Data Mining em redes sociais
Processamento de tweets em tempo real com Python, Django e Celery - TDC 2014
Redis para iniciantes - TDC 2014
Comparison of Genomic DNA to cDNA Alignment Methods
Qualificação de Mestrado
Uma abordagem computacional para a determinação de polimorfismos de base única
Django: Uso de frameworks ágeis para desenvolvimento web
GIS em 3 horas
Data Mining em redes sociais

Recently uploaded (20)

PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
The scientific heritage No 166 (166) (2025)
PPTX
BIOMOLECULES PPT........................
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
HPLC-PPT.docx high performance liquid chromatography
Taita Taveta Laboratory Technician Workshop Presentation.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Introduction to Cardiovascular system_structure and functions-1
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
7. General Toxicologyfor clinical phrmacy.pptx
An interstellar mission to test astrophysical black holes
Microbiology with diagram medical studies .pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
neck nodes and dissection types and lymph nodes levels
Derivatives of integument scales, beaks, horns,.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Phytochemical Investigation of Miliusa longipes.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
2. Earth - The Living Planet Module 2ELS
The scientific heritage No 166 (166) (2025)
BIOMOLECULES PPT........................
Cell Membrane: Structure, Composition & Functions
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
HPLC-PPT.docx high performance liquid chromatography

New Strategy to detect SNPs

  • 1. New Strategy to detect SNPs Miguel Galves José Augusto Quitzau Zanoni Dias Scylla Bioinformatics –Brazil {miguel,jquitzau,zanoni}@scylla.com.br
  • 2. Agenda  Introduction  HIV Dataset  Detection Strategy  Trimming Procedure  Base-Calling Strategies  Filter Algorithm  Consensus Algorithm  Tests Protocol  Results  Discussion
  • 3. Introduction  Polymorphism: set of base pair locus at which different alleles exists in individuals in some population – The second most frequent allele must appear in at least 1% of the individuals  SNP: polymorphism in a single base pair position  SNP discovery is very important to understand complex diseases
  • 4. HIV Dataset  HIV genetic sequences: – 1302 bp – Well-conserved region  35 batches from 35 individuals: – 6 PCR reads, with average size of 690bp – 1 validated sequence, with manually annotated SNPs  HIV Reference Sequence
  • 5. Detection Strategy: Survey  Trimming Procedure  Base-Calling Correction  SNPs Filter  Batch Consensus Algorithm
  • 6. Trimming Procedure  Low Quality Ends filtering  Converts phred’s quality sequence to error probability sequence: ⇒ Q = -10 x log10(p)  Subtract 0.05 from all values (Q=13)  Maximum Score Subsequence Algorithm
  • 7. Base Calling: Area Ratio  The base calling is made in 5 Steps: 1. Chromatogram area delimitation 2. Peak search 3. Choice of the nearest peaks 4. Calculation of the nearest peaks area 5. Calculation of the polymorphic/reference peak area  If the calculated ratio is above a certain threshold, the point is considered a polymorphism.
  • 8. Base Calling: Area Delimitation
  • 9. Base Calling: Peak Identification
  • 10. Base Calling: Average Height Ratio  Almost the same steps: 1. Chromatogram area delimitation 2. Peak search 3. Choice of the nearest peaks 4. Calculation of the nearest peaks average height 5. Calculation of the polymorphic/reference peak average height.  Again, if the calculated ratio is above a certain threshold, the point is considered a polymorphism.
  • 11. Base Calling: Peak Identification
  • 12. Filter Algorithm  Analyzes each sequence  Uses a window based algorithm to eliminate adjacents SNPs – Window size: 11 bases – Empirical score system assigned to polymorphism in the window
  • 13. Consensus Algorithm  Rule-based algorithm – Empirical rules  Analyzes the whole cross section to define a consensus – Take account of nucleotide frequencies and qualities  Do not create N symbols, nor tri-allelic polymorphisms.
  • 14. Consensus Algorithm: Example Sequence 1 A25 C30 C18 C30 A21 Sequence 2 A30 C25 C15 C25 A16 Sequence 3 - M18 A9 C30 - Sequence 4 - - S12 G17 T18 Consensus A M S S W
  • 15. Tests Protocol: Third Party Packages  Two external packages used to compare our results: – Polybayes: SNP detection tool based on Bayesian Methods – Polyphred: SNP detection tool based on chromatogram analysis  ACE file (contig and consensus) created for each batch using phrap  ACE file analyzed by Polyphred and Polybayes  Results viewed with consed
  • 16. Tests Protocol: Our strategy  Reads trimmed using Maximum Subsequence Algorithm  Base-calling analysis and correction using algorithms describe previously  SNP filtering  Multiple alignment – Reference sequence as anchor  Consensus creation
  • 17. Third Party Results: Polybayes  Polybayes detected SNPs in only 2 batches out of 35 Batch Existing SNPs Detected SNPs Correct SNPs False Positives False Negatives Batch 13 12 1 1 0 11 Batch 15 5 1 0 1 5
  • 18. Third Party Results: Polyphred  Polyphred detected SNPs in only 4 batches out of 35 Batch Existing SNPs Detected SNPs Correct SNPs False Positives False Negatives Batch 07 10 1 0 1 10 Batch 14 4 3 0 3 4 Batch 32 26 1 0 1 26 Batch 35 15 8 1 7 14
  • 19. Trimming Results  Reads average size: – Before trimming: 690.15bp – After trimming: 374.74bp – Reduction of 45%  Reference sequence average base coverage – Before trimming: 2.69 – After trimming: 1.77
  • 20. Results: True Positive (%) x batch
  • 21. Results: False Negative (%) x batch
  • 22. Results: False Positive (%) x batch
  • 23. Results: Summary Polybayes Polyphred Area Avg. Height Avg SD Avg SD Avg SD Avg SD TP 0.3 1.4 0.2 1.1 75.4 19.2 52.6 21.5 FN 99.7 1.4 99.8 1.1 23.2 18.4 45.6 21.7 DP 0.0 0.0 0.0 0.0 1.4 4.3 1.8 4.0 FP 2.9 16.9 11.1 31.3 393.9 312.3 554.4 511.3 TP + FN + DP = 100%
  • 24. Discussion  Polybayes and Polyphred need large sets of data to produces good results  Our algorithm produces quite satisfactory results taking into account data characteristics: – Low average coverage – High amount of low quality bases – High amount of polymorphisms (virus DNA)  Area Ratio strategy produces better results than Average Height strategy
  • 25. Future Work  Test the algorithms whith larger batches, whith higher average coverage, to improve consensus algorithm  Reproduce the experiments using genetic sequences of more conserved life forms, such as mammals