SlideShare a Scribd company logo
• Xin-zhuan Su
• Sittiporn Pattaradilokrat
• Sethu Nair
• Yanwei Qi
• Gordon Bullen
NIH/ NIAID – Malaria
Functional Genomics Section • Sebastian Gurevich
McGill University
Funding:
National Institutes of Health
Canadian Institutes of Health Research
• Philip Awadalla
University of Montreal
https://github.com/parasite-genomics/Pipelines - 2.0 Coming in July 2014
zmartine@gmail.com
ComPar: Genome Assembly, Variant Mapping, and
Validation Pipelines
Martine Zilversmit
http://www.slideshare.net/zmartine1/i-evobio-zilversmit25jun14
ComPar: Genome Assembly, Variant Mapping, and
Validation Pipelines
https://github.com/parasite-genomics/Pipelines
• BASH-scripted
pipelines
• Accurate variant
prediction
– SNPs
– Small indels
– Large indels
(>17bp)
– Focused regions of
extreme divergence
(35-70% amino acid
identity)
• In silico variant
validation
Parameters:
- Quality Metric and Cutoff
- Number of variants per cluster
- Maximum distance between variants within a cluster
- Maximum distance between smaller clusters to merge
into an HDR
Finding Highly Divergent Regions – HDR Program
VCF File
False Positive
Variants
True Positive
Variants
HDR File:
- Size of HDR
- Position of HDR
- Variants Contained
Python - Stand-alone interactive or pipelined
NumberofVariants
Position on “Chromosome”
Dye-Terminator Sequenced Variation – 50 basepair Sliding window
Comparing 2 Plasmodium Genomes
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
Predicted Variants – No filtering Based on Quality Metrics
NumberofVariants
Position on “Chromosome”
NumberofVariants
Position on “Chromosome”
Comparing 2 Plasmodium Genomes
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Unfiltered Results
NumberofVariants
Position on “Chromosome”
NumberofVariants
Position on “Chromosome”
Comparing 2 Plasmodium Genomes
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Unfiltered Results
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Quality 30 Cutoff
Predicted Variants - Filtering Based on Quality Score ≥ 30 Cutoff
NumberofVariants
Position on “Chromosome”
NumberofVariants
Position on “Chromosome”
Comparing 2 Plasmodium Genomes
Filtering Based on Consensus Quality (FQ) ≤ -100 Cutoff
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Unfiltered Results
Quality 30 Cutoff
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
FQ −100 Cuttoff
NumberofVariants
Position on “Chromosome”
NumberofVariants
Position on “Chromosome”
Comparing 2 Plasmodium Genomes
Highly-Divergent Regions (HDRs)
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Unfiltered Results
Quality 30 Cutoff
FQ −100 Cuttoff
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Unfiltered Results
Quality 30 Cutoff
FQ −100 Cuttoff
NumberofVariants
Position on “Chromosome”
NumberofVariants
Position on “Chromosome”
Comparing 2 Plasmodium Genomes
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Unfiltered Results
Quality 30 Cutoff
FQ −100 Cuttoff
Quality ≥ 30 Variants without Consensus Quality ≥ -100
Highly-Divergent Regions (HDRs)
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
0
2
4
6
8
10
12
True Variants
Quality 30, No HDRs
Characteristics of Highly Divergent Regions
33X 44.4%
By265 55.6%
N67 66.7%
histone acetyltransferase GCN5, putative (GCN5)
RNA-binding protein NOB1, putative
Percent Identity
DNA repair protein, putative
33X 41.4%
By265 79.3%
N67 51.7%
Characteristics of Highly Divergent Regions

More Related Content

PPTX
PPTX
Com par 25jun14
PPTX
I evobio zilversmit_25jun14
PDF
Proof of concept of WGS based surveillance: meningococcal disease
PPTX
ClinVar: Aggregating Data to Improve Variant Interpretation - Melissa Landrum
PDF
Utilization of NGS data and genomic selection to rescue an endangered and her...
PDF
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
PPTX
Establishing validity, reproducibility, and utility of highly scalable geneti...
Com par 25jun14
I evobio zilversmit_25jun14
Proof of concept of WGS based surveillance: meningococcal disease
ClinVar: Aggregating Data to Improve Variant Interpretation - Melissa Landrum
Utilization of NGS data and genomic selection to rescue an endangered and her...
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
Establishing validity, reproducibility, and utility of highly scalable geneti...

What's hot (20)

PPTX
Krista's Presentation at the 2019 SFAF Meeting
PDF
Next-Generation Sequencing Commercial Milestones Infographic
PDF
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
PPTX
Open zika presentation
PDF
Genome Sequencing: FAO's relevant activities in Animal Health
 
PDF
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
PDF
NetBioSIG2014-Talk by Salvatore Loguercio
DOC
Resume_Bill_Martinez
PDF
Oncogenicity Scoring in VSClinical
PPTX
Added Value of Open data sharing using examples from GenomeTrakr
PDF
academic / small company collaborations for rare and neglected diseasesv2
PDF
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
PPTX
Using Public Access Clinical Databases to Interpret NGS Variants
PDF
2nd CRISPR Congress Boston, 23-25 February 2016
PDF
Pizza club - May 2016 - Shaman
PPTX
Poster presentation [Development of an efficient VIGS system using Tobacco ri...
PPTX
Ashg sedlazeck grc_share
PPTX
Next Generation Sequencing application in virology
PDF
The server of the Spanish Population Variability
PPTX
Web applications for rapid microbial taxonomy identification
Krista's Presentation at the 2019 SFAF Meeting
Next-Generation Sequencing Commercial Milestones Infographic
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Open zika presentation
Genome Sequencing: FAO's relevant activities in Animal Health
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
NetBioSIG2014-Talk by Salvatore Loguercio
Resume_Bill_Martinez
Oncogenicity Scoring in VSClinical
Added Value of Open data sharing using examples from GenomeTrakr
academic / small company collaborations for rare and neglected diseasesv2
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using Public Access Clinical Databases to Interpret NGS Variants
2nd CRISPR Congress Boston, 23-25 February 2016
Pizza club - May 2016 - Shaman
Poster presentation [Development of an efficient VIGS system using Tobacco ri...
Ashg sedlazeck grc_share
Next Generation Sequencing application in virology
The server of the Spanish Population Variability
Web applications for rapid microbial taxonomy identification
Ad

Viewers also liked (15)

PDF
Things fall apart
PPT
Validity andreliability
PPTX
皆の日本語32
PDF
Bulk sms service.compressed
DOC
New microsoft word document
PPTX
Church history1
PDF
Etapia ubezpieczenia
PPTX
Introdução a química orgânica
PPTX
Cultura gamer
DOCX
Eletroterapia Resumo
PPTX
Entrega contínua com github e windows azure
DOC
Uepa 2009 pronto tcc
PPTX
Digital Camera Technology
Things fall apart
Validity andreliability
皆の日本語32
Bulk sms service.compressed
New microsoft word document
Church history1
Etapia ubezpieczenia
Introdução a química orgânica
Cultura gamer
Eletroterapia Resumo
Entrega contínua com github e windows azure
Uepa 2009 pronto tcc
Digital Camera Technology
Ad

Similar to Com par 25jun14 (7)

PDF
Computational_biology_project_report
DOCX
Identifying candidate antimalarial compounds by searching for molecular mimet...
PDF
Multiple populations of artemisinin resistant plasmodium falciparum in cambod...
PDF
srep08308 Bidii
PDF
Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques
PPTX
Bio da 9 (ppt)
PDF
Phylogenomic Case Studies: The Benefits (and Occasional Drawbacks) of Integra...
Computational_biology_project_report
Identifying candidate antimalarial compounds by searching for molecular mimet...
Multiple populations of artemisinin resistant plasmodium falciparum in cambod...
srep08308 Bidii
Monkeying Around: Automatically Analyzing Malaria Infections in Rhesus Macaques
Bio da 9 (ppt)
Phylogenomic Case Studies: The Benefits (and Occasional Drawbacks) of Integra...

Recently uploaded (20)

DOCX
search engine optimization ppt fir known well about this
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Unlock new opportunities with location data.pdf
PDF
STKI Israel Market Study 2025 version august
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
The various Industrial Revolutions .pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Hybrid model detection and classification of lung cancer
PDF
Five Habits of High-Impact Board Members
search engine optimization ppt fir known well about this
Zenith AI: Advanced Artificial Intelligence
Taming the Chaos: How to Turn Unstructured Data into Decisions
Group 1 Presentation -Planning and Decision Making .pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Getting started with AI Agents and Multi-Agent Systems
A comparative study of natural language inference in Swahili using monolingua...
Benefits of Physical activity for teenagers.pptx
Tartificialntelligence_presentation.pptx
CloudStack 4.21: First Look Webinar slides
Unlock new opportunities with location data.pdf
STKI Israel Market Study 2025 version august
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A novel scalable deep ensemble learning framework for big data classification...
The various Industrial Revolutions .pptx
Developing a website for English-speaking practice to English as a foreign la...
A contest of sentiment analysis: k-nearest neighbor versus neural network
DP Operators-handbook-extract for the Mautical Institute
Hybrid model detection and classification of lung cancer
Five Habits of High-Impact Board Members

Com par 25jun14

  • 1. • Xin-zhuan Su • Sittiporn Pattaradilokrat • Sethu Nair • Yanwei Qi • Gordon Bullen NIH/ NIAID – Malaria Functional Genomics Section • Sebastian Gurevich McGill University Funding: National Institutes of Health Canadian Institutes of Health Research • Philip Awadalla University of Montreal https://github.com/parasite-genomics/Pipelines - 2.0 Coming in July 2014 zmartine@gmail.com ComPar: Genome Assembly, Variant Mapping, and Validation Pipelines Martine Zilversmit http://www.slideshare.net/zmartine1/i-evobio-zilversmit25jun14
  • 2. ComPar: Genome Assembly, Variant Mapping, and Validation Pipelines https://github.com/parasite-genomics/Pipelines • BASH-scripted pipelines • Accurate variant prediction – SNPs – Small indels – Large indels (>17bp) – Focused regions of extreme divergence (35-70% amino acid identity) • In silico variant validation
  • 3. Parameters: - Quality Metric and Cutoff - Number of variants per cluster - Maximum distance between variants within a cluster - Maximum distance between smaller clusters to merge into an HDR Finding Highly Divergent Regions – HDR Program VCF File False Positive Variants True Positive Variants HDR File: - Size of HDR - Position of HDR - Variants Contained Python - Stand-alone interactive or pipelined
  • 4. NumberofVariants Position on “Chromosome” Dye-Terminator Sequenced Variation – 50 basepair Sliding window Comparing 2 Plasmodium Genomes 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12
  • 5. Predicted Variants – No filtering Based on Quality Metrics NumberofVariants Position on “Chromosome” NumberofVariants Position on “Chromosome” Comparing 2 Plasmodium Genomes 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Unfiltered Results
  • 6. NumberofVariants Position on “Chromosome” NumberofVariants Position on “Chromosome” Comparing 2 Plasmodium Genomes 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Unfiltered Results 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Quality 30 Cutoff Predicted Variants - Filtering Based on Quality Score ≥ 30 Cutoff
  • 7. NumberofVariants Position on “Chromosome” NumberofVariants Position on “Chromosome” Comparing 2 Plasmodium Genomes Filtering Based on Consensus Quality (FQ) ≤ -100 Cutoff 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Unfiltered Results Quality 30 Cutoff 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants FQ −100 Cuttoff
  • 8. NumberofVariants Position on “Chromosome” NumberofVariants Position on “Chromosome” Comparing 2 Plasmodium Genomes Highly-Divergent Regions (HDRs) 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Unfiltered Results Quality 30 Cutoff FQ −100 Cuttoff 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Unfiltered Results Quality 30 Cutoff FQ −100 Cuttoff
  • 9. NumberofVariants Position on “Chromosome” NumberofVariants Position on “Chromosome” Comparing 2 Plasmodium Genomes 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Unfiltered Results Quality 30 Cutoff FQ −100 Cuttoff Quality ≥ 30 Variants without Consensus Quality ≥ -100 Highly-Divergent Regions (HDRs) 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 8600 8800 9000 9200 9400 9600 0 2 4 6 8 10 12 True Variants Quality 30, No HDRs
  • 10. Characteristics of Highly Divergent Regions 33X 44.4% By265 55.6% N67 66.7% histone acetyltransferase GCN5, putative (GCN5) RNA-binding protein NOB1, putative Percent Identity DNA repair protein, putative 33X 41.4% By265 79.3% N67 51.7%
  • 11. Characteristics of Highly Divergent Regions

Editor's Notes

  • #3: MAPPING, DEFINE! DE NOVO, DEFINE! SAY WHAT VARIANTS ARE
  • #5: Define highly divergent regions
  • #6: Define highly divergent regions
  • #7: Define highly divergent regions
  • #8: Define highly divergent regions
  • #9: Define highly divergent regions
  • #10: Define highly divergent regions