Using long and linked reads to generate
a new Genome in a Bottle small variant
benchmark
Justin Wagner, Andrew Carroll, Ian T. Fiddes, Aaron M. Wenger, William J.
Rowell, Nathan Olson, Lindsey Harris, Jenny McDaniel, Xin Zhou, Sergey
Aganezov, Melanie Kirsche, Bohan Ni, Samantha Zarate, Byunggil Yoo, Neil
Miller, C. Xiao, Marc Salit, Justin Zook, Genome in a Bottle Consortium
GRC/GIAB Workshop ASHG 2019
Overview
• v3.3.2 benchmark variants and regions cover 87.84% of assembled
bases in chromosomes 1-22 in GRCh37 for the sample HG002
• Short read variant callers perform poorly in genomic locations with
high homology such as segmental duplications and low-complexity
repeat-rich regions
• Now utilizing PacBio CCS and 10X Genomics data to expand the GIAB
benchmark regions and reduce errors in current regions
• Long and linked reads add variants to the benchmark, mostly in
regions difficult to map with short reads
• GRCh37: 276,840 SNPs and 53,482 INDELs
• GRCh38: 286,483 SNPs and 42,980 INDELs
How the benchmark is generated
When do we trust variants and regions from
each method
Variants
PASS
Filtered outliers
Low/high coverage or low
MQ (or low GQ for gVCF)
Difficult regions/SVs
Callable regions
TR
VariantCallingMethodX
(1) (2) (3)
1/1
0/1
Arbitrating between variant calls in different
methods
PASS variants #2
Benchmark regions
0/1 1/11/1
Benchmark calls 0/11/1
Callable regions #2
Callable regions #1
1/10/11/1PASS variants #1
InputMethods
1/1
(1)
Concordant
(2)
Discordant
unresolved
(3)
Discordant
arbitrated
(4)
Concordant
not callable
Sequencing data used in integration for
HG002
Platform Characteristics Alignment; Variant Calling
Illumina 150x150bp, ~300x coverage Novoalign; GATK v3.5
CG 26x26bp; ~100x coverage Complete Genomics Pipeline
Illumina 150x150bp, ~300x coverage Novoalign; Freebayes
Illumina 250x250bp;~45x coverage Novoalign; GATK v3.5
Illumina 250x250bp;~45x coverage Novoalign; Freebayes
Illumina 6Kbp mate pair; ~13x coverage bwa_mem; GATK v3.5
Illumina 6Kbp mate pair; ~13x coverage bwa_mem; Freebayes
Ion Exome, 1000x coverage Torrent Suite v4.2; Torrent Variant Caller v4.4
Solid 75bp; ~60x coverage LifeScope v2.5.1; GATK v3.5
PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; GATK4
PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; DeepVariant v0.8
10x Genomics Linked reads; ~84x coverage LongRanger Pipeline
Long and linked reads cover more variants
and regions
Variants
PASS
Filtered outliers
Low/high coverage or low
MQ (or low GQ for gVCF)
Difficult regions/SVs
Callable regions
TR
VariantCallingMethodX
(1) (2) (3)
1/1
0/1
10x Genomics and PacBio CCS data add new variants (1), regions with good
coverage of high MQ reads (2), and access to difficult regions (3)
How the benchmark is generated
Difficult Regions Excluded from all Methods
Difficult Region Description Bases Covered
in GRCh37
Bases Covered
in GRCh38
v0.6 SV GIAB Benchmark 32,596,754 32,872,907
Potential copy number variation 51,713,344 62,666,746
Tandem Repeats > 10kb 5,731,885 71,942,255
Highly similar and high depth segmental duplications 1,232,701 2,094,143
Regions that are collapsed and expanded from GRCh37/38
Primary Assembly Alignments 17,979,597 N/A
Modeled centromere and heterochromatin N/A 62,304,573
Difficult Regions Excluded by Method
• Tandem Repeats < 51bp except GATK from Illumina PCR-free, Complete
Genomics, and CCS DeepVariant
• Tandem Repeats > 51bp and < 200bp except GATK from Illumina PCR-
Free and CCS DeepVariant
• Tandem Repeats > 200bp except CCS DeepVariant
• Homopolymers > 6bp except GATK from Illumina PCR-free, Complete
Genomics, Ion Exome, PacBio CCS
• Imperfect homopolymer > 10bp except GATK from Illumina PCR-Free
• Difficult to map regions for short reads except 10x and CCS
• LINE:L1Hs > 500bp except Illumina MatePair, 10x, and CCS
• Segmental duplications except 10x and CCS
v4 draft benchmark includes variants found
with haplotype-resolved assembly of MHC
• Worked with a team from the March 2019 NCBI Pangenome
Hackathon to generate haplotype-resolved assembly of MHC region
(chr6:28,477,797-33,448,354 in GRCh37)
• Use assembly to call small variants
• Small variants from assembly are integrated with mapping-based calls
in the MHC region for v4 draft benchmark
• v4 draft benchmark includes 23,229 variants in the MHC region
• Covers most HLA genes and CYP21A2/TNXA/TNXB
v4 draft benchmark include more bases,
variants, and segmental duplications
v4 draft GRCh37 v4 draft GRCh38
Base pairs 2,504,027,936 2,509,269,277
Reference
covered
93.2% 91.03%
SNPs 3,323,773 3,314,941
Indels 519,152 519,494
Base pairs in
Segmental
Duplications
64,300,499 73,819,342
80.00%
85.00%
90.00%
95.00%
Percent of reference covered
Some variants and segmental duplications
only covered in v3.3.2 or v4 draft
Only in v3.3.2
GRCh37
Only in v4
draft GRCh37
SNPs INDELs SNPs INDELs
Only in v3.3.2
GRCh38
Only in v4
draft GRCh38343,358
69,495
77,324
23,828
376,653
91,837
91,719
48,753
Segmental Duplications Segmental Duplications
25,445
63,949,151
1,928,353
70,187,985
v4 draft enables benchmarking in regions
difficult for short reads
Comparison of Illumina RTG VCF against benchmark sets
• SNP FNs increase by a factor of more than 3, mostly due to new
benchmark variants in difficult to map regions and segmental
duplications
• False negatives: variants present in the truth set, but missed in the query
Subset v3.3.2 FNs v4 draft FNs
All SNPs 8,594 30,229
Low mappability 6,708 25,295
Segmental duplications 1,429 14,008
v4 draft benchmark contains more medically-
relevant variants
• v4 draft covers more of the MHC region
• Outside of MHC updates, top 5 genes with variants increased from v3.3.2
to v4 draft benchmark: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1
(15), HSPG2 (13)
• PMS2 from ACMG59 has 2 more variants and RET, SCN5A, TNNI3 have 1
more variant covered in v4 draft benchmark that are not in v3.3.2
Variants in Medical Exome
(genes from OMIM, HGMD, ClinVar, UniProt)
Benchmark Regions v3.3.2 8,209
Benchmark Regions v4 draft 9,527
Sanger sequencing confirms medically-
relevant variants
• Performed long range PCR
before sequencing
• Confirmed 12 variants in
CYP21A2, which is a medically-
relevant gene in the MHC region
• Confirmed 6 variants in PMS2
• Confirmed 15 variants in 5 other
genes
Evaluation by GIAB collaborators
Compared benchmark to callsets from a variety of technologies and
variant calling methods including:
• Illumina PCR-Free and Dragen
• PacBio CCS and GATK4
• PacBio CCS and DeepVariant
• PacBio CCS and Clair (Next generation of Clairvoyante)
• ONT Promethion and Clair
Preliminary results suggest that a majority of FPs and FNs are correct in
the benchmark and errors in the tested callsets
More
volunteers
welcomed
Manual curation by callset developers
Process
• Compare callset to benchmark using
hap.py and/or vcfeval
• Randomly select 5 FP SNPs, 5 FN SNPs, 5
FP indels and 5 FN indels, each from
inside and outside the v3.3.2 benchmark
bed, in GRCh37 and GRCh38
(5*4*2*2=80 total)
• Use IGV with PCR-free Illumina, PacBio
CCS, 10x, and ONT + difficult bed files
Questions to ask
• Are both alleles correct in the
benchmark?
• Yes/No/Unsure
• Are both alleles correct in the callset
being tested?
• Yes/No/Unsure
• If the benchmark is wrong or
questionable, how did you make this
determination?
• Instructions: Be critical of the benchmark,
and select unsure if the evidence does
not strongly support the benchmark
being correct
Process for independent evaluations
Callset developer
curates putative
errors
Benchmark is
wrong or
questionable
NIST curator
disagrees
Discuss with
callset developer
NIST curator
agrees
Classify source of
potential error in
benchmark
Benchmark is
correct
No further
curation
Initial evaluation suggest a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with GATK GRCh37 FP 19 1 0 19 20
CCS with GATK GRCh37 FN 15 3 2 18 20
ONT with Clair GRCh37 FP 33 1 0 34 34
ONT with Clair GRCh37 FN 27 3 0 30 30
CCS with Clair GRCh37 FP 7 13 0 6 20
CCS with Clair GRCh37 FN 19 1 0 19 20
Illumina with Dragen GRCh37 FP 14 6 0 11 20
Illumina with Dragen GRCh37 FN 17 3 0 17 20
Evaluation FPs – Inversions
LINEs
Evaluation FPs – Complex SVs
Evaluation FPs – Near SVs
Evaluation FPs – Near low coverage
Potential refinements identified for v4.1
• Exclude VDJ
• Exclude Inversions
• Improve CNV coverage
• Use ONT for excessive coverage
• Explore smoothing on excessive coverage beds
• Use new diploid assemblies to identify CNVs
• MHC
• Exclude CNVs in the MHC, partial repeats in MHC, small regions that are questionable in the
DRB genes
• Benchmark regions density
• Regions with dense variation and many gaps in bed
• Dense variants near SVs
• Segmental duplications
• Small region of duplication covered by benchmark
• Containing an SV
Conclusions
• Long and linked reads add variants to the benchmark, mostly in
regions difficult to map with short reads
• GRCh37: 276,840 SNPs and 53,482 INDELs
• GRCh38: 286,483 SNPs and 42,980 INDELs
• v4 draft benchmark is available for GRCh37 and GRCh38
• GRCh37 Percent Chromosomes 1-22 Covered: 93.2%
• GRCh38 Percent Chromosomes 1-22 Covered: 91.03%
• Initial evaluation suggest a majority of FPs and FNs are correct in the
benchmark and errors in the tested callsets
• More volunteers welcomed
• Identified refinements for v4.1
On-going and Future Work
• Refine use of genome stratifications
• Adding variant calls from raw PacBio and Oxford Nanopore
• Improve benchmark for larger indels, homopolymers, and tandem
repeats
• Improve normalization of complex variants
• Generating benchmark variants from diploid assemblies
• Machine learning
• Outlier detection, active learning
• Generate v4 draft for other GIAB genomes
Acknowledgements
• Andrew Carroll
• Ian T. Fiddes
• Aaron M. Wenger
• William J. Rowell
• Nathan Olson
• Lindsey Harris
• Jenny McDaniel
• Chunlin Xiao
• Marc Salit
• Justin Zook
• Genome in a Bottle Consortium
Draft Benchmark Evaluators
• Xin Zhou
• Sergey Aganezov
• Melanie Kirsche
• Bohan Ni
• Samantha Zarate
• Byunggil Yoo
• Neil Miller
Backup
Initial evaluation suggest a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with GATK GRCh38 FP 16 4 0 16 20
CCS with GATK GRCh38 FN 17 3 0 16 20
ONT with Clair GRCh38 FP 19 1 0 19 20
ONT with Clair GRCh38 FN 14 6 0 19 20
CCS with Clair GRCh38 FP 15 5 0 16 20
CCS with Clair GRCh38 FN 18 2 0 20 20
Illumina with Dragen GRCh38 FP 16 3 1 16 20
Illumina with Dragen GRCh38 FN 18 2 0 18 20
Integration Pipeline Process
Find sensitive
variant calls and
callable regions
for each dataset,
excluding
difficult
regions/SVs that
are problematic
for each type of
data and variant
caller
Find
“consensus”
calls with
support from
2+
technologies
(and no other
technologies
disagree) using
callable
regions
Use “consensus”
calls to train simple
one-class model for
each dataset and
find “outliers” that
are less trustworthy
for each dataset
Find
benchmark
calls by using
callable
regions and
“outliers” to
arbitrate
between
datasets when
they disagree
Find
benchmark
regions by
taking
union of
callable
regions and
subtracting
uncertain
variants
Sanger sequencing results
Initial evaluation shows a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with DeepVariant GRCh37 FP 3 9 8 20
CCS with DeepVariant GRCh37 FN 17 3 0 20
CCS with GATK GRCh37 FP 19 1 0 19 20
CCS with GATK GRCh37 FN 15 3 2 18 20
ONT with Clair GRCh37 FP 33 1 0 34 34
ONT with Clair GRCh37 FN 27 3 0 30 30
CCS with Clair GRCh37 FP 7 13 0 6 20
CCS with Clair GRCh37 FN 19 1 0 19 20
Illumina with Dragen GRCh37 FP 14 6 0 11 20
Illumina with Dragen GRCh37 FN 17 3 0 17 20
Initial evaluation shows a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with DeepVariant GRCh38 FP 6 7 7 20
CCS with DeepVariant GRCh38 FN 20 0 0 20
CCS with GATK GRCh38 FP 16 4 0 16 20
CCS with GATK GRCh38 FN 17 3 0 16 20
ONT with Clair GRCh38 FP 19 1 0 19 20
ONT with Clair GRCh38 FN 14 6 0 19 20
CCS with Clair GRCh38 FP 15 5 0 16 20
CCS with Clair GRCh38 FN 18 2 0 20 20
Illumina with Dragen GRCh38 FP 16 3 1 16 20
Illumina with Dragen GRCh38 FN 18 2 0 18 20
Initial evaluation shows a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number Benchmark
Correct
Number Benchmark
Unsure/No
Number Callset
Incorrect
CCS with GATK GRCh37 32 8 32
CCS with GATK GRCh38 33 7 32
ONT with Clair GRCh37 60 4 60
CCS with Clair GRCh37 26 14 24
CCS with Clair GRCh38 33 7 36
Illumina with Dragen GRCh37 31 9 28
Illumina with Dragen GRCh38 34 6 34

More Related Content

PPTX
GIAB for AMP GeT-RM Forum
PPTX
GIAB ASHG 2019 Structural Variant poster
PDF
Giab agbt small_var_2020
PPTX
GIAB Technical Germline Benchmark roadmap discussion
PPTX
GIAB update for GRC GIAB workshop 191015
PPTX
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
PDF
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
PPTX
Jason Chin MHC diploid assembly
GIAB for AMP GeT-RM Forum
GIAB ASHG 2019 Structural Variant poster
Giab agbt small_var_2020
GIAB Technical Germline Benchmark roadmap discussion
GIAB update for GRC GIAB workshop 191015
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Jason Chin MHC diploid assembly

What's hot (20)

PDF
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
PPTX
GIAB ASHG 2019 Small Variant poster
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
Genome in a Bottle- reference materials to benchmark challenging variants and...
PPTX
GIAB and long reads for bio it world 190417
PDF
How giab fits in the rest of the world seqc2 tumor normal
PPTX
Giab for jax long read 190917
PDF
New methods diploid assembly with graphs
PPTX
Tools for Using NIST Reference Materials
PPTX
Genome in a bottle for ashg grc giab workshop 181016
PDF
New data from giab genomes promethion
PPTX
New methods deep variant evaluation of draft v4alpha
PPTX
Aug2013 illumina platinum genomes
PPTX
2017 amp benchmarking_poster_justin
PPTX
Giab product and tool roadmap small variants
PDF
New data from giab genomes pacbio ccs
PPTX
160627 giab for festival sv workshop
PDF
Giab ashg 2017
PPTX
Giab ashg webinar 160224
PPTX
Giab sv genotyping
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB ASHG 2019 Small Variant poster
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle- reference materials to benchmark challenging variants and...
GIAB and long reads for bio it world 190417
How giab fits in the rest of the world seqc2 tumor normal
Giab for jax long read 190917
New methods diploid assembly with graphs
Tools for Using NIST Reference Materials
Genome in a bottle for ashg grc giab workshop 181016
New data from giab genomes promethion
New methods deep variant evaluation of draft v4alpha
Aug2013 illumina platinum genomes
2017 amp benchmarking_poster_justin
Giab product and tool roadmap small variants
New data from giab genomes pacbio ccs
160627 giab for festival sv workshop
Giab ashg 2017
Giab ashg webinar 160224
Giab sv genotyping
Ad

Similar to GRC GIAB Workshop ASHG 2019 Small Variant Benchmark (20)

PPTX
New methods draft v4alpha small variant benchmark
PDF
Giab agbt small_var_2019
PPTX
Genome in a bottle for amp GeT-RM 181030
PDF
2023 GIAB AMP Update
PPTX
Benchmarking with GIAB 220907
PPTX
171017 giab for giab grc workshop
PPTX
171114 best practices for benchmarking variant calls justin
PPTX
171017 giab for giab grc workshop
PPTX
Genome in a bottle for next gen dx v2 180821
PPTX
Giab poster structural variants ashg 2018
PPTX
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
PPTX
150219 agbt giab_poster_marc
PDF
GIAB_ASHG_JZook_2023.pdf
PPTX
GIAB Integrating multiple technologies to form benchmark SVs 180517
PPTX
ASHG 2015 Genome in a bottle
PDF
Giab agbt SVs_2019
PPTX
GIAB-GRC workshop oct2015 giab introduction 151005
PPTX
161115 precision fda giab
PPTX
160628 giab for festival of genomics
PPTX
140127 GIAB update and NIST high-confidence calls
New methods draft v4alpha small variant benchmark
Giab agbt small_var_2019
Genome in a bottle for amp GeT-RM 181030
2023 GIAB AMP Update
Benchmarking with GIAB 220907
171017 giab for giab grc workshop
171114 best practices for benchmarking variant calls justin
171017 giab for giab grc workshop
Genome in a bottle for next gen dx v2 180821
Giab poster structural variants ashg 2018
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
150219 agbt giab_poster_marc
GIAB_ASHG_JZook_2023.pdf
GIAB Integrating multiple technologies to form benchmark SVs 180517
ASHG 2015 Genome in a bottle
Giab agbt SVs_2019
GIAB-GRC workshop oct2015 giab introduction 151005
161115 precision fda giab
160628 giab for festival of genomics
140127 GIAB update and NIST high-confidence calls
Ad

More from GenomeInABottle (8)

PDF
GIAB Tumor Normal ASHG 2023
PDF
Stratomod ASHG 2023
PPT
New data from giab genomes strand-seq
PPTX
New data from giab genomes intro and ultralong nanopore
PPTX
How giab fits in the rest of the world mdic somatic reference samples
PDF
How giab fits in the rest of the world telomere to telomere consortium
PPTX
How giab fits in the rest of the world human genome structural variation co...
PPTX
How giab fits in the rest of the world introduction
GIAB Tumor Normal ASHG 2023
Stratomod ASHG 2023
New data from giab genomes strand-seq
New data from giab genomes intro and ultralong nanopore
How giab fits in the rest of the world mdic somatic reference samples
How giab fits in the rest of the world telomere to telomere consortium
How giab fits in the rest of the world human genome structural variation co...
How giab fits in the rest of the world introduction

Recently uploaded (20)

PPTX
Assessment of fetal wellbeing for nurses.
PDF
MNEMONICS MNEMONICS MNEMONICS MNEMONICS s
PDF
04 dr. Rahajeng - dr.rahajeng-KOGI XIX 2025-ed1.pdf
PPT
Infections Member of Royal College of Physicians.ppt
PPTX
ROJoson PEP Talk: What / Who is a General Surgeon in the Philippines?
DOCX
PEADIATRICS NOTES.docx lecture notes for medical students
PPTX
Wheat allergies and Disease in gastroenterology
PPTX
Approach to chest pain, SOB, palpitation and prolonged fever
PPTX
The Human Reproductive System Presentation
PDF
Glaucoma Definition, Introduction, Etiology, Epidemiology, Clinical Presentat...
PDF
Nursing manual for conscious sedation.pdf
PPTX
Post Op complications in general surgery
PDF
OSCE Series Set 1 ( Questions & Answers ).pdf
PPT
Rheumatology Member of Royal College of Physicians.ppt
PPTX
SHOCK- lectures on types of shock ,and complications w
PPTX
Vesico ureteric reflux.. Introduction and clinical management
PPTX
ANESTHETIC CONSIDERATION IN ALCOHOLIC ASSOCIATED LIVER DISEASE.pptx
PPTX
Primary Tuberculous Infection/Disease by Dr Vahyala Zira Kumanda
PPTX
Vaccines and immunization including cold chain , Open vial policy.pptx
PDF
OSCE SERIES ( Questions & Answers ) - Set 5.pdf
Assessment of fetal wellbeing for nurses.
MNEMONICS MNEMONICS MNEMONICS MNEMONICS s
04 dr. Rahajeng - dr.rahajeng-KOGI XIX 2025-ed1.pdf
Infections Member of Royal College of Physicians.ppt
ROJoson PEP Talk: What / Who is a General Surgeon in the Philippines?
PEADIATRICS NOTES.docx lecture notes for medical students
Wheat allergies and Disease in gastroenterology
Approach to chest pain, SOB, palpitation and prolonged fever
The Human Reproductive System Presentation
Glaucoma Definition, Introduction, Etiology, Epidemiology, Clinical Presentat...
Nursing manual for conscious sedation.pdf
Post Op complications in general surgery
OSCE Series Set 1 ( Questions & Answers ).pdf
Rheumatology Member of Royal College of Physicians.ppt
SHOCK- lectures on types of shock ,and complications w
Vesico ureteric reflux.. Introduction and clinical management
ANESTHETIC CONSIDERATION IN ALCOHOLIC ASSOCIATED LIVER DISEASE.pptx
Primary Tuberculous Infection/Disease by Dr Vahyala Zira Kumanda
Vaccines and immunization including cold chain , Open vial policy.pptx
OSCE SERIES ( Questions & Answers ) - Set 5.pdf

GRC GIAB Workshop ASHG 2019 Small Variant Benchmark

  • 1. Using long and linked reads to generate a new Genome in a Bottle small variant benchmark Justin Wagner, Andrew Carroll, Ian T. Fiddes, Aaron M. Wenger, William J. Rowell, Nathan Olson, Lindsey Harris, Jenny McDaniel, Xin Zhou, Sergey Aganezov, Melanie Kirsche, Bohan Ni, Samantha Zarate, Byunggil Yoo, Neil Miller, C. Xiao, Marc Salit, Justin Zook, Genome in a Bottle Consortium GRC/GIAB Workshop ASHG 2019
  • 2. Overview • v3.3.2 benchmark variants and regions cover 87.84% of assembled bases in chromosomes 1-22 in GRCh37 for the sample HG002 • Short read variant callers perform poorly in genomic locations with high homology such as segmental duplications and low-complexity repeat-rich regions • Now utilizing PacBio CCS and 10X Genomics data to expand the GIAB benchmark regions and reduce errors in current regions • Long and linked reads add variants to the benchmark, mostly in regions difficult to map with short reads • GRCh37: 276,840 SNPs and 53,482 INDELs • GRCh38: 286,483 SNPs and 42,980 INDELs
  • 3. How the benchmark is generated
  • 4. When do we trust variants and regions from each method Variants PASS Filtered outliers Low/high coverage or low MQ (or low GQ for gVCF) Difficult regions/SVs Callable regions TR VariantCallingMethodX (1) (2) (3) 1/1 0/1
  • 5. Arbitrating between variant calls in different methods PASS variants #2 Benchmark regions 0/1 1/11/1 Benchmark calls 0/11/1 Callable regions #2 Callable regions #1 1/10/11/1PASS variants #1 InputMethods 1/1 (1) Concordant (2) Discordant unresolved (3) Discordant arbitrated (4) Concordant not callable
  • 6. Sequencing data used in integration for HG002 Platform Characteristics Alignment; Variant Calling Illumina 150x150bp, ~300x coverage Novoalign; GATK v3.5 CG 26x26bp; ~100x coverage Complete Genomics Pipeline Illumina 150x150bp, ~300x coverage Novoalign; Freebayes Illumina 250x250bp;~45x coverage Novoalign; GATK v3.5 Illumina 250x250bp;~45x coverage Novoalign; Freebayes Illumina 6Kbp mate pair; ~13x coverage bwa_mem; GATK v3.5 Illumina 6Kbp mate pair; ~13x coverage bwa_mem; Freebayes Ion Exome, 1000x coverage Torrent Suite v4.2; Torrent Variant Caller v4.4 Solid 75bp; ~60x coverage LifeScope v2.5.1; GATK v3.5 PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; GATK4 PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; DeepVariant v0.8 10x Genomics Linked reads; ~84x coverage LongRanger Pipeline
  • 7. Long and linked reads cover more variants and regions Variants PASS Filtered outliers Low/high coverage or low MQ (or low GQ for gVCF) Difficult regions/SVs Callable regions TR VariantCallingMethodX (1) (2) (3) 1/1 0/1 10x Genomics and PacBio CCS data add new variants (1), regions with good coverage of high MQ reads (2), and access to difficult regions (3)
  • 8. How the benchmark is generated
  • 9. Difficult Regions Excluded from all Methods Difficult Region Description Bases Covered in GRCh37 Bases Covered in GRCh38 v0.6 SV GIAB Benchmark 32,596,754 32,872,907 Potential copy number variation 51,713,344 62,666,746 Tandem Repeats > 10kb 5,731,885 71,942,255 Highly similar and high depth segmental duplications 1,232,701 2,094,143 Regions that are collapsed and expanded from GRCh37/38 Primary Assembly Alignments 17,979,597 N/A Modeled centromere and heterochromatin N/A 62,304,573
  • 10. Difficult Regions Excluded by Method • Tandem Repeats < 51bp except GATK from Illumina PCR-free, Complete Genomics, and CCS DeepVariant • Tandem Repeats > 51bp and < 200bp except GATK from Illumina PCR- Free and CCS DeepVariant • Tandem Repeats > 200bp except CCS DeepVariant • Homopolymers > 6bp except GATK from Illumina PCR-free, Complete Genomics, Ion Exome, PacBio CCS • Imperfect homopolymer > 10bp except GATK from Illumina PCR-Free • Difficult to map regions for short reads except 10x and CCS • LINE:L1Hs > 500bp except Illumina MatePair, 10x, and CCS • Segmental duplications except 10x and CCS
  • 11. v4 draft benchmark includes variants found with haplotype-resolved assembly of MHC • Worked with a team from the March 2019 NCBI Pangenome Hackathon to generate haplotype-resolved assembly of MHC region (chr6:28,477,797-33,448,354 in GRCh37) • Use assembly to call small variants • Small variants from assembly are integrated with mapping-based calls in the MHC region for v4 draft benchmark • v4 draft benchmark includes 23,229 variants in the MHC region • Covers most HLA genes and CYP21A2/TNXA/TNXB
  • 12. v4 draft benchmark include more bases, variants, and segmental duplications v4 draft GRCh37 v4 draft GRCh38 Base pairs 2,504,027,936 2,509,269,277 Reference covered 93.2% 91.03% SNPs 3,323,773 3,314,941 Indels 519,152 519,494 Base pairs in Segmental Duplications 64,300,499 73,819,342 80.00% 85.00% 90.00% 95.00% Percent of reference covered
  • 13. Some variants and segmental duplications only covered in v3.3.2 or v4 draft Only in v3.3.2 GRCh37 Only in v4 draft GRCh37 SNPs INDELs SNPs INDELs Only in v3.3.2 GRCh38 Only in v4 draft GRCh38343,358 69,495 77,324 23,828 376,653 91,837 91,719 48,753 Segmental Duplications Segmental Duplications 25,445 63,949,151 1,928,353 70,187,985
  • 14. v4 draft enables benchmarking in regions difficult for short reads Comparison of Illumina RTG VCF against benchmark sets • SNP FNs increase by a factor of more than 3, mostly due to new benchmark variants in difficult to map regions and segmental duplications • False negatives: variants present in the truth set, but missed in the query Subset v3.3.2 FNs v4 draft FNs All SNPs 8,594 30,229 Low mappability 6,708 25,295 Segmental duplications 1,429 14,008
  • 15. v4 draft benchmark contains more medically- relevant variants • v4 draft covers more of the MHC region • Outside of MHC updates, top 5 genes with variants increased from v3.3.2 to v4 draft benchmark: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13) • PMS2 from ACMG59 has 2 more variants and RET, SCN5A, TNNI3 have 1 more variant covered in v4 draft benchmark that are not in v3.3.2 Variants in Medical Exome (genes from OMIM, HGMD, ClinVar, UniProt) Benchmark Regions v3.3.2 8,209 Benchmark Regions v4 draft 9,527
  • 16. Sanger sequencing confirms medically- relevant variants • Performed long range PCR before sequencing • Confirmed 12 variants in CYP21A2, which is a medically- relevant gene in the MHC region • Confirmed 6 variants in PMS2 • Confirmed 15 variants in 5 other genes
  • 17. Evaluation by GIAB collaborators Compared benchmark to callsets from a variety of technologies and variant calling methods including: • Illumina PCR-Free and Dragen • PacBio CCS and GATK4 • PacBio CCS and DeepVariant • PacBio CCS and Clair (Next generation of Clairvoyante) • ONT Promethion and Clair Preliminary results suggest that a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets More volunteers welcomed
  • 18. Manual curation by callset developers Process • Compare callset to benchmark using hap.py and/or vcfeval • Randomly select 5 FP SNPs, 5 FN SNPs, 5 FP indels and 5 FN indels, each from inside and outside the v3.3.2 benchmark bed, in GRCh37 and GRCh38 (5*4*2*2=80 total) • Use IGV with PCR-free Illumina, PacBio CCS, 10x, and ONT + difficult bed files Questions to ask • Are both alleles correct in the benchmark? • Yes/No/Unsure • Are both alleles correct in the callset being tested? • Yes/No/Unsure • If the benchmark is wrong or questionable, how did you make this determination? • Instructions: Be critical of the benchmark, and select unsure if the evidence does not strongly support the benchmark being correct
  • 19. Process for independent evaluations Callset developer curates putative errors Benchmark is wrong or questionable NIST curator disagrees Discuss with callset developer NIST curator agrees Classify source of potential error in benchmark Benchmark is correct No further curation
  • 20. Initial evaluation suggest a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with GATK GRCh37 FP 19 1 0 19 20 CCS with GATK GRCh37 FN 15 3 2 18 20 ONT with Clair GRCh37 FP 33 1 0 34 34 ONT with Clair GRCh37 FN 27 3 0 30 30 CCS with Clair GRCh37 FP 7 13 0 6 20 CCS with Clair GRCh37 FN 19 1 0 19 20 Illumina with Dragen GRCh37 FP 14 6 0 11 20 Illumina with Dragen GRCh37 FN 17 3 0 17 20
  • 21. Evaluation FPs – Inversions LINEs
  • 22. Evaluation FPs – Complex SVs
  • 23. Evaluation FPs – Near SVs
  • 24. Evaluation FPs – Near low coverage
  • 25. Potential refinements identified for v4.1 • Exclude VDJ • Exclude Inversions • Improve CNV coverage • Use ONT for excessive coverage • Explore smoothing on excessive coverage beds • Use new diploid assemblies to identify CNVs • MHC • Exclude CNVs in the MHC, partial repeats in MHC, small regions that are questionable in the DRB genes • Benchmark regions density • Regions with dense variation and many gaps in bed • Dense variants near SVs • Segmental duplications • Small region of duplication covered by benchmark • Containing an SV
  • 26. Conclusions • Long and linked reads add variants to the benchmark, mostly in regions difficult to map with short reads • GRCh37: 276,840 SNPs and 53,482 INDELs • GRCh38: 286,483 SNPs and 42,980 INDELs • v4 draft benchmark is available for GRCh37 and GRCh38 • GRCh37 Percent Chromosomes 1-22 Covered: 93.2% • GRCh38 Percent Chromosomes 1-22 Covered: 91.03% • Initial evaluation suggest a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets • More volunteers welcomed • Identified refinements for v4.1
  • 27. On-going and Future Work • Refine use of genome stratifications • Adding variant calls from raw PacBio and Oxford Nanopore • Improve benchmark for larger indels, homopolymers, and tandem repeats • Improve normalization of complex variants • Generating benchmark variants from diploid assemblies • Machine learning • Outlier detection, active learning • Generate v4 draft for other GIAB genomes
  • 28. Acknowledgements • Andrew Carroll • Ian T. Fiddes • Aaron M. Wenger • William J. Rowell • Nathan Olson • Lindsey Harris • Jenny McDaniel • Chunlin Xiao • Marc Salit • Justin Zook • Genome in a Bottle Consortium Draft Benchmark Evaluators • Xin Zhou • Sergey Aganezov • Melanie Kirsche • Bohan Ni • Samantha Zarate • Byunggil Yoo • Neil Miller
  • 30. Initial evaluation suggest a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with GATK GRCh38 FP 16 4 0 16 20 CCS with GATK GRCh38 FN 17 3 0 16 20 ONT with Clair GRCh38 FP 19 1 0 19 20 ONT with Clair GRCh38 FN 14 6 0 19 20 CCS with Clair GRCh38 FP 15 5 0 16 20 CCS with Clair GRCh38 FN 18 2 0 20 20 Illumina with Dragen GRCh38 FP 16 3 1 16 20 Illumina with Dragen GRCh38 FN 18 2 0 18 20
  • 31. Integration Pipeline Process Find sensitive variant calls and callable regions for each dataset, excluding difficult regions/SVs that are problematic for each type of data and variant caller Find “consensus” calls with support from 2+ technologies (and no other technologies disagree) using callable regions Use “consensus” calls to train simple one-class model for each dataset and find “outliers” that are less trustworthy for each dataset Find benchmark calls by using callable regions and “outliers” to arbitrate between datasets when they disagree Find benchmark regions by taking union of callable regions and subtracting uncertain variants
  • 33. Initial evaluation shows a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with DeepVariant GRCh37 FP 3 9 8 20 CCS with DeepVariant GRCh37 FN 17 3 0 20 CCS with GATK GRCh37 FP 19 1 0 19 20 CCS with GATK GRCh37 FN 15 3 2 18 20 ONT with Clair GRCh37 FP 33 1 0 34 34 ONT with Clair GRCh37 FN 27 3 0 30 30 CCS with Clair GRCh37 FP 7 13 0 6 20 CCS with Clair GRCh37 FN 19 1 0 19 20 Illumina with Dragen GRCh37 FP 14 6 0 11 20 Illumina with Dragen GRCh37 FN 17 3 0 17 20
  • 34. Initial evaluation shows a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with DeepVariant GRCh38 FP 6 7 7 20 CCS with DeepVariant GRCh38 FN 20 0 0 20 CCS with GATK GRCh38 FP 16 4 0 16 20 CCS with GATK GRCh38 FN 17 3 0 16 20 ONT with Clair GRCh38 FP 19 1 0 19 20 ONT with Clair GRCh38 FN 14 6 0 19 20 CCS with Clair GRCh38 FP 15 5 0 16 20 CCS with Clair GRCh38 FN 18 2 0 20 20 Illumina with Dragen GRCh38 FP 16 3 1 16 20 Illumina with Dragen GRCh38 FN 18 2 0 18 20
  • 35. Initial evaluation shows a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure/No Number Callset Incorrect CCS with GATK GRCh37 32 8 32 CCS with GATK GRCh38 33 7 32 ONT with Clair GRCh37 60 4 60 CCS with Clair GRCh37 26 14 24 CCS with Clair GRCh38 33 7 36 Illumina with Dragen GRCh37 31 9 28 Illumina with Dragen GRCh38 34 6 34

Editor's Notes

  • #11: Exclude tandem repeats approximately larger than the read length for each method Homopolymers are excluded from 10x and PacBio CCS Really long homopolymers only included for GATK based calls for PCR-Free data because GATK gVCF has low genotype quality score if they don’t have reads that totally encompass the homopolymer - Trust homopolymers most from PCR-Free short reads
  • #14: Ongoing work includes checking if many are in regions that might be in potential CNVs as they could be errors in v3.3.2
  • #15: false-negatives (FN) : variants present in the truth set, but missed in the query.
  • #17: 3_79181930 Add this from what lindsey sent on slack
  • #21: Combine GRCh37 and GRCh38
  • #22: Left is an inversion Right is an likely a LINE-mediated inversion - If have an inversion near repetitive elements, then exclude the repetitive elements as well - Show just two LINEs and the inversion they flank
  • #23: Left is likely a tandem duplication or large insertion or complex insertion Right is an inversion but then deletion that is in SV benchmark, likely a complex SV
  • #36: Update this table – Includes Billy’s new results 10x-Aquila_37 16 24 16 10x-Aquila_38 22 18 17