SlideShare a Scribd company logo
Base Calling Error Toleration in Reference Based
Assembly
Hadi Gharibi
Email: h_gharibi@ee.sharif.edu
Sharif University of Technology
Max Planck Institute for Molecular Genetics
May 2015
How Base Calling Error Can Be Tolerated in Next
Generation Sequencing (NGS)
2
Importance
Challenges
Our
Hypothesis
Our
Approach
• Deal with Large Amount of Data
• Impact on Sequencing Data Analysis Time and Accuracy
Researchers have developed many base calling algorithms,
however, they have not resolved the tradeoff between
accuracy and time complexity.
• Required Accuracy
• Sequencing Data Analysis Execution Time
Base Calling Error Is Compensated in Down-stream
Sequencing Steps
• Massive Data
• Diverse Algorithms
Importance: Base Calling Translates Noisy
Intensity Data Into Reads
3
© EMBO Conference, 2014 [1]
© illumina Incorporation, 2011.[2]
Intensity
Image
Processing
Base
Calling
ReadAssemblingGenome
Challenge: Base Calling Errors Are Always
Compared
4
© C. Ye, 2014 [3]
Figure: Error rate for base callers
per sequencing cycle on the
PhiX174 test data is plotted.
Accurate callers are slower than the
others. [3]
Fundamental Question:
5
Our Approach: Analytical Assumptions and
Method
6
Assumptions
• Random Genome
• Single Variations
• Mismatches << Read Length
• Uniform Substitution Error
• Equally Likely Base Errors
Method
• Variant Calling for Re-sequencing
• Derive Variant Calling Errors
Analytical Results: Base Calling Error Is
Tolerated by Mapping Mismatch
7
Figure: Variant Calling Error
Vs. Base Calling Error
Random Genome
Mismatches={2, 5, 7, 9}
Genome Size ~ 4Mbp
Read Length= 30bp
Variation Rate= 0.01
Simulation Method and Setup
8
• Generate Target Genome
• Simulate Reads [4]
• Add Base Calling Error
• Call Variants
• Calculate Variant Calling Error
Method Setup
© Gemsim, 2013[4]
Simulation Results: Simulation Verifies Analysis
Predictions
9
• E-Coli Genome [5]
• Mismatches= {3, 4, 5}
• Genome Size ~ 4Mbp
• Read Length= 30bp
• Variation Rate~ 0.01
• Single-end Shotgun Run
• Map with SOAP[6]
Figure: Variant Calling Error Vs.
Base Calling Error
© NCBI, 2014[5]
© G. BGI, 2008[6]
Simulation Results: Random Genome Obviates
Repeat Region Effect
10
• Genome Sizes ~ 4Mbp
• Mismatches= 3
• Read Length= 30bp
• Variation Rate~ 0.01
• Single-end Shotgun Run
• Map with SOAP[6]
Figure: Random Genome Vs.
E-Coli Genome
© G. BGI, 2008[6]
11
Conclusion
Simulation
Results
• Confirm the Hypothesis
• Genome Repeat Regions Impair Accuracy
• Confirm the Hypothesis
• Higher Mismatches May Not Obey
Analytical
Results
Next Steps
12
Simulation Steps
• Genome Having More Repeat Regions
• Develop Mapper with Higher Mismatches
• Genome Structure
• Paired-end Shotgun Sequencing
• Erasure Base Calling Error
• Other Variant Types
Analytical Steps
References
[1] EMBO Conference, “Human Evolution in the Genomic Era: Origins, Populations, and
Phenotypes,” 2014, [Online]. Available: events.embo.org/14-human-evo
[2] Illumina Inc., “Theory of Operation, HCS 1.4/RTA 1.12”,2011.
[3] C. Ye, C. Hsiao, and H. Corrada Bravo, “BlindCall: ultra-fast base-calling of high-
throughput sequencing data by blind deconvolution,” Bioinformatics, 30(9), 1214–1219,
2014.
[4] C. Ledergerber and C. Dessimoz, “Base-calling for next-generation sequencing
platforms”, Briefings in Bioinformatics, 2011.
[5] GemSIM, “Gemsim,” 2013. [Online]. Available:
http://sourceforge.net/projects/gemsim
[6] NCBI, “Escherichia coli o157:h7 str. sakai dna, complete genome - nucleotide - ncbi,”
2014. [Online]. Available:
http://www.ncbi.nlm.nih.gov/nuccore/47118301?report=fasta
[7] G. BGI, “Soap: Short oligonucleotide analysis package,” 2008. [Online]. Available:
http://soap.genomics.org.cn
13
Acknowledgement
Thank You for Your Patience, Time and Attention.
14

More Related Content

PDF
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
PDF
MediaEval 2016 - BUT Zero-Cost Speech Recognition
PDF
MediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
PDF
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
PDF
TerraX Corporate Presentation Oct 2013
PDF
NJ Watershed Ambassador Program
PDF
Bezahlen und mehr. Trends rund um das Smartphone und was sie für Banken bedeu...
PDF
MediaEval 2016 - ININ Submission to Zero Cost ASR Task
MediaEval 2016 - BUT Zero-Cost Speech Recognition
MediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
TerraX Corporate Presentation Oct 2013
NJ Watershed Ambassador Program
Bezahlen und mehr. Trends rund um das Smartphone und was sie für Banken bedeu...

Viewers also liked (9)

PDF
Press book AFDIAG
PPTX
Intro Getting Your Feet Wet: Intro to Different Types of Monitoring
DOCX
shop mua đồng hồ casio ở tphcm
PDF
بيلة الفينيل كيتون
PPTX
What is Globalization
PDF
Best Tips for Logo Design
PDF
Procurement negotiation and contract drafting strategy
PPTX
La reproducción de la imagen y su impacto en la historia
PPTX
2013session5 1
Press book AFDIAG
Intro Getting Your Feet Wet: Intro to Different Types of Monitoring
shop mua đồng hồ casio ở tphcm
بيلة الفينيل كيتون
What is Globalization
Best Tips for Logo Design
Procurement negotiation and contract drafting strategy
La reproducción de la imagen y su impacto en la historia
2013session5 1
Ad

Similar to Base Calling Error Toleration in Reference Base Assembly (20)

PPTX
hmmmggsbbdbfhdjdudhdhddhhdhduhdhdhdudhhdhm.pptx
PPTX
2014 nci-edrn
PDF
Daly altshuler.labmeeting
PPTX
2014 anu-canberra-streaming
PPT
High Throughput Sequencing Technologies: What We Can Know
PPTX
Tools for Using NIST Reference Materials
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PPTX
2014 agbt giab data integration poster 140206
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 2
PPTX
Understanding and controlling for sample and platform biases in NGS assays
PDF
Errors and Limitaions of Next Generation Sequencing
PDF
Next-generation sequencing course, part 1: technologies
PPTX
BFG_Chapter09_Next Generaton Sequencing_v04.pptx
PPTX
2015 illinois-talk
PDF
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
PPTX
GIAB-GRC workshop oct2015 giab introduction 151005
PPTX
2013 caltech-edrn-talk
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Workshop NGS data analysis - 2
PDF
Next-generation sequencing and quality control: An Introduction (2016)
hmmmggsbbdbfhdjdudhdhddhhdhduhdhdhdudhhdhm.pptx
2014 nci-edrn
Daly altshuler.labmeeting
2014 anu-canberra-streaming
High Throughput Sequencing Technologies: What We Can Know
Tools for Using NIST Reference Materials
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
2014 agbt giab data integration poster 140206
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Understanding and controlling for sample and platform biases in NGS assays
Errors and Limitaions of Next Generation Sequencing
Next-generation sequencing course, part 1: technologies
BFG_Chapter09_Next Generaton Sequencing_v04.pptx
2015 illinois-talk
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
GIAB-GRC workshop oct2015 giab introduction 151005
2013 caltech-edrn-talk
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Workshop NGS data analysis - 2
Next-generation sequencing and quality control: An Introduction (2016)
Ad

Recently uploaded (20)

PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPT
protein biochemistry.ppt for university classes
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Microbiology with diagram medical studies .pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
The scientific heritage No 166 (166) (2025)
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
protein biochemistry.ppt for university classes
2Systematics of Living Organisms t-.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Cell Membrane: Structure, Composition & Functions
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Microbiology with diagram medical studies .pptx
Placing the Near-Earth Object Impact Probability in Context
Biophysics 2.pdffffffffffffffffffffffffff
The scientific heritage No 166 (166) (2025)
AlphaEarth Foundations and the Satellite Embedding dataset
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Comparative Structure of Integument in Vertebrates.pptx
INTRODUCTION TO EVS | Concept of sustainability
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
7. General Toxicologyfor clinical phrmacy.pptx
ECG_Course_Presentation د.محمد صقران ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice

Base Calling Error Toleration in Reference Base Assembly

  • 1. Base Calling Error Toleration in Reference Based Assembly Hadi Gharibi Email: h_gharibi@ee.sharif.edu Sharif University of Technology Max Planck Institute for Molecular Genetics May 2015
  • 2. How Base Calling Error Can Be Tolerated in Next Generation Sequencing (NGS) 2 Importance Challenges Our Hypothesis Our Approach • Deal with Large Amount of Data • Impact on Sequencing Data Analysis Time and Accuracy Researchers have developed many base calling algorithms, however, they have not resolved the tradeoff between accuracy and time complexity. • Required Accuracy • Sequencing Data Analysis Execution Time Base Calling Error Is Compensated in Down-stream Sequencing Steps • Massive Data • Diverse Algorithms
  • 3. Importance: Base Calling Translates Noisy Intensity Data Into Reads 3 © EMBO Conference, 2014 [1] © illumina Incorporation, 2011.[2] Intensity Image Processing Base Calling ReadAssemblingGenome
  • 4. Challenge: Base Calling Errors Are Always Compared 4 © C. Ye, 2014 [3] Figure: Error rate for base callers per sequencing cycle on the PhiX174 test data is plotted. Accurate callers are slower than the others. [3]
  • 6. Our Approach: Analytical Assumptions and Method 6 Assumptions • Random Genome • Single Variations • Mismatches << Read Length • Uniform Substitution Error • Equally Likely Base Errors Method • Variant Calling for Re-sequencing • Derive Variant Calling Errors
  • 7. Analytical Results: Base Calling Error Is Tolerated by Mapping Mismatch 7 Figure: Variant Calling Error Vs. Base Calling Error Random Genome Mismatches={2, 5, 7, 9} Genome Size ~ 4Mbp Read Length= 30bp Variation Rate= 0.01
  • 8. Simulation Method and Setup 8 • Generate Target Genome • Simulate Reads [4] • Add Base Calling Error • Call Variants • Calculate Variant Calling Error Method Setup © Gemsim, 2013[4]
  • 9. Simulation Results: Simulation Verifies Analysis Predictions 9 • E-Coli Genome [5] • Mismatches= {3, 4, 5} • Genome Size ~ 4Mbp • Read Length= 30bp • Variation Rate~ 0.01 • Single-end Shotgun Run • Map with SOAP[6] Figure: Variant Calling Error Vs. Base Calling Error © NCBI, 2014[5] © G. BGI, 2008[6]
  • 10. Simulation Results: Random Genome Obviates Repeat Region Effect 10 • Genome Sizes ~ 4Mbp • Mismatches= 3 • Read Length= 30bp • Variation Rate~ 0.01 • Single-end Shotgun Run • Map with SOAP[6] Figure: Random Genome Vs. E-Coli Genome © G. BGI, 2008[6]
  • 11. 11 Conclusion Simulation Results • Confirm the Hypothesis • Genome Repeat Regions Impair Accuracy • Confirm the Hypothesis • Higher Mismatches May Not Obey Analytical Results
  • 12. Next Steps 12 Simulation Steps • Genome Having More Repeat Regions • Develop Mapper with Higher Mismatches • Genome Structure • Paired-end Shotgun Sequencing • Erasure Base Calling Error • Other Variant Types Analytical Steps
  • 13. References [1] EMBO Conference, “Human Evolution in the Genomic Era: Origins, Populations, and Phenotypes,” 2014, [Online]. Available: events.embo.org/14-human-evo [2] Illumina Inc., “Theory of Operation, HCS 1.4/RTA 1.12”,2011. [3] C. Ye, C. Hsiao, and H. Corrada Bravo, “BlindCall: ultra-fast base-calling of high- throughput sequencing data by blind deconvolution,” Bioinformatics, 30(9), 1214–1219, 2014. [4] C. Ledergerber and C. Dessimoz, “Base-calling for next-generation sequencing platforms”, Briefings in Bioinformatics, 2011. [5] GemSIM, “Gemsim,” 2013. [Online]. Available: http://sourceforge.net/projects/gemsim [6] NCBI, “Escherichia coli o157:h7 str. sakai dna, complete genome - nucleotide - ncbi,” 2014. [Online]. Available: http://www.ncbi.nlm.nih.gov/nuccore/47118301?report=fasta [7] G. BGI, “Soap: Short oligonucleotide analysis package,” 2008. [Online]. Available: http://soap.genomics.org.cn 13
  • 14. Acknowledgement Thank You for Your Patience, Time and Attention. 14