SlideShare a Scribd company logo
Genome in a Bottle: So you’ve
sequenced a genome – how well did
you do?
February 2015
Justin Zook, Marc Salit, and the Genome
in a Bottle Consortium
Whole genome sequencing technologies
disagree about 100,000’s of variants
3,198,316
(80.05%)
125,574
(3.14%)
Platform
#1
Platform
#2
Platform #3
230,311
(5.76%)
121,440
(3.04%)
208,038
(5.21%)
71,944
(1.80%)
39,604
(0.99%)
# SNPs
(% of SNPs detected
by any platform)
Bioinformatics programs also disagree
O’Rawe et al. Genome Medicine 2013, 5:28
NIST-hosted
Genome in a Bottle Consortium
• Infrastructure for performance
assessment of NGS
– support science-based regulatory
oversight
• No widely accepted set of metrics
to characterize the fidelity of
variant calls from NGS…
• Genome in a Bottle Consortium is
developing standards to address
this…
– well-characterized human genomes
as Reference Materials (RMs)
• characterized and disseminated by NIST
– tools and methods to use these RMs
• Global Alliance for Genomics and
Health Benchmarking Team
http://genomeinabottle.org
Genome in a Bottle
Consortium Development
• NIST met with sequencing
technology developers to assess
standards needs
– Stanford, June 2011
• Open, exploratory workshop
– ASHG, Montreal, Canada
– October 2011
• Small, invitational workshop at
NIST to develop consortium for
human genome reference
materials
– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers,
clinical labs, CAP, PGP, Partners,
ABRF, others
– developed draft work plan
– April 2012
• Open, public meetings of GIAB
– August 2012 at NIST
– March 2013 at Xgen
– August 2013 at NIST
– January 2014 at Stanford
– August 2014 at NIST
– January 2015 at Stanford
• Website
– www.genomeinabottle.org
Others working in this space…
Well-characterized genomes
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Human Longevity, Inc.
• Hyditaform mole haploid
cell line
• Genome Reference
Consortium
Performance Metrics
• Global Alliance for
Genomics and Health
Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
NIST Plays a Role in the First FDA Authorization for
Next-Generation Sequencer
November 20, 2013
Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference
materials will be
developed to
characterize
performance of a part
of process
– materials will be
certified for their
variants against a
reference sequence,
with confidence
estimates
genericmeasurementprocess
Analytical
steps
Pre-Analytical
steps
Clinical
Interpretation
• NIST worked with GIAB
to select genomes
• Current genomes
– NA12878 HapMap
sample as Pilot sample
• part of 17-member
pedigree
– 2 trios from PGP
• Ashkenazim
• Asian
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
11 children
NIST Human Genome RMs in the
pipeline
• All 10 ug samples of DNA
isolated from multistage large
growth cell cultures
– all are intended to act as stable,
homogeneous references
suitable for use in regulated
applications
– all genomes also available from
Coriell repository
• Pilot Genome
– ~8400 tubes
• Ashkenazim Jewish Trio
– ~10000 son; ~2500 each parent
• Asian Trio
– ~10000 son; parents not yet
planned as NIST RM
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
11
Pilot Genome: Integrate 12 14
Datasets from 5 platforms
12
Dataset#1Dataset#2Dataset#3
Annotation #1
Histogram
(e.g., coverage)
Dataset#1Dataset#2Dataset#3
Annotation #2
Histogram
(e.g., strand bias)
Site A
Site B
Potential
Bias
Site C
Dataset Site A Site B Site C
Dataset #1 0/0 0/0 1/1
Dataset #2 0/1 0/1 1/1
Dataset #3 0/0 0/1 1/1
Integration 0/0 0/1 Uncer-
tain
Candidate
variants
Concordant
variants
Find
characteristics
of bias
Arbitrate using
evidence of
bias
Confidence
Level
Integration Methods to Establish
Benchmark Variant Calls
Integration Methods to Establish
Benchmark Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics
methods agree or we
understand the biases
causing disagreement
• At least some methods have
no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be
difficult for current
technologies
• State reasons for lower
confidence
• If a site is near a low
confidence site, make it low
confidence
Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
16
Challenge in variant comparison: Complex
variants have multiple correct representations
BWA
ssaha2
CGTools
Novo-
align
Ref:
T
insertion
TCTCT
insertion
17
FP SNPs FP MNPs FP indels
Traditional
comparison
0.38%
(610)
100%
(915)
6.5%
(733)
Comparison
with
realignment
0.15%
(249)
4.2%
(38)
2.6%
(298)
Global Alliance for Genomics and Health
Benchmarking Task Team
• Formed June 2014 to develop
methods and tools for comparing
variant calls to a benchmark
• Developed standardized definitions
for performance metrics like TP, FP,
and FN.
• Initial focus on germline SNPs/indels
• Developing benchmarking tools
• Comparison engine
• Pluggable web interface with
modules for:
• Reporting/calculation of metrics
• Visualization/user interface
• Working with Genome in a Bottle
Consortium to host data and calls
from their well-characterized
genomes www.bioplanet.com/gcat
Example User Interface
Stratifying Performance
• Measure performance for
different types of variants in
different sequence contexts
– Types of variants
• SNPs
• indels of different sizes
• complex variants
• structural variants
– Sequence contexts
• Homopolymers,
• STRs
• Duplications
– Functional context
• Exome vs genome, etc
– Data characteristics
• Coverage
• Mapping quality
• Challenge of smaller gene
panels vs genome
sequencing
– one RM may not have a
sufficient number of
examples of different classes
of variants or sequence
contexts
– likely need more samples
with specific types of variants
NCBI/CDC GeT-RM Browser
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of questionable calls
Initial uses of high-confidence NIST-
GIAB genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
Using Genome in a Bottle calls to
benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of
NA12878 technical replicates
against GIAB for each new
pipeline version.”
Benchmarking somatic variant calling
at Qiagen
Implications of Technical Accuracy in
Medical Genome Sequencing
• Collaboration with Euan
Ashley group at Stanford
• What is accuracy for
functional variants?
• How much of the exome
falls in high confidence
regions?
• “Black list” in databases
• Sensitivity
– WExS (95%) < WGS (98%)
• especially splicing
– genome < nonsyn < syn
– Most exome FNs caused by
low coverage
– Most WGS FNs cause by
filtering
• Only 81 % of ClinVar
pathogenic or likely
pathogenic SNPs fall in
high-confidence regions
– Lots of work to do!
Overview of NIST RM Development
Genome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015
HG-
001/NA1287
8
(“Pilot”
Genome)
Release NIST
RM8398;
Preliminary
large
deletions
Refined
Structural
Variants
HG-002 to
HG-004
(Ashkenazim
trio)
Illumina,
Complete
Genomics,
Ion,
BioNano,
homogeneity
/stability
Preliminary
SNPs/indels;
120x-150x
PacBio data;
“moleculo”;
mate-pair;
CG-LFR
Refined
SNPs/indels
;
Preliminary
SVs
Refined
Structural
Variants
NIST RMs
8391/839
2 release
HG-005 (son
in Asian trio)
Illumina,
Complete
Genomics,
Ion,
BioNano,
homogeneity
/stability
“moleculo”;
mate-pair;
CG-LFR
Preliminary
SNPs/indels
Refined
SNPs/indels;
Refined
Structural
Variants
NIST
RM8393
release
Ashkenazim Jewish PGP RM Trio
Dataset Characteristics Coverage Availability Good for…
Illumina Paired-
end
150x150bp ~300x/individu
al
Fastq on ftp SNPs/indels/so
me SVs
Illumina Long
Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina
“moleculo”
Custom library ~30x by long
fragments
Feb-Mar 2015 SVs/phasing/as
sembly
Complete
Genomics
100x/individual On ftp SNPs/indels/so
me SVs
Complete
Genomics
LFR ?? SNPs/indels/ph
asing
Ion Proton Exome 1000x/individu
al
On SRA SNPs/indels in
exome
BioNano
Genomics
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on
AJ trio
Finished ~Mar
2015
SVs/phasing/as
sembly/STRs
Asian PGP trio
• Similar sequencing to
Ashkenazim trio except
for PacBio
• Only son will be NIST
RM
Future Directions
Germline mutations
• Difficult regions/variants
– Long-read technologies
– Forming an analysis group
• Tools for assessing
performance
– How to stratify performance
and understand biases?
Somatic mutations
• Pilot interlaboratory study
to assess comparability of
spike-ins
• Commercial members
developing FFPE cell lines
• Participants interested in
mixing different RMs
How to get involved
• Use our integrated
SNP/indel genotypes for
NA12878 and give us
feedback
– Cells and DNA currently
available from Coriell
– NIST RM available April
2015
• Join our new Analysis
group
– Use Long-read
technologies
– Structural Variant calls
– De novo assembly
– Help create the best-ever
characterized trio
• Attend our biannual
workshops (January in CA,
August in MD)
• Develop tools/metrics
with Global Alliance for
Genomics and Health
Benchmarking Team
Acknowledgments
• FDA – Elizabeth Mansfield,
HPC staff
• HSPH
• GCAT - David Mittelman,
Jason Wang
• Francisco De La Vega
• Illumina - Mike Eberle
• Personalis - Deanna Church
• NCBI – Chunlin Xiao
• Celera - Andrew Grupe
• Genome in a Bottle
– www.genomeinabottle.org
– New members welcome!
– Sign up for email newsletters
– jzook@nist.gov

More Related Content

PPTX
Sept2016 plenary nist_intro
PPTX
171114 best practices for benchmarking variant calls justin
PDF
2017 agbt benchmarking_poster
PPTX
170120 giab stanford genetics seminar
PPTX
171017 giab for giab grc workshop
PPTX
Aug2015 salit standards architecture
PPTX
Tools for Using NIST Reference Materials
PPTX
Jan2016 bina giab
Sept2016 plenary nist_intro
171114 best practices for benchmarking variant calls justin
2017 agbt benchmarking_poster
170120 giab stanford genetics seminar
171017 giab for giab grc workshop
Aug2015 salit standards architecture
Tools for Using NIST Reference Materials
Jan2016 bina giab

What's hot (20)

PPTX
Aug2015 horizon diagnostics
PPTX
GIAB GRC Workshop slides
PPTX
161115 precision fda giab
PPTX
Genome in a Bottle
PPTX
Giab aug2015 intro and update 150821.pptx
PPTX
GIAB-GRC workshop oct2015 giab introduction 151005
PPTX
2017 amp benchmarking_poster_justin
PPTX
170326 giab abrf
PPTX
160627 giab for festival sv workshop
PPTX
Aug2013 illumina platinum genomes
PDF
2016 ashg giab poster
PPTX
ASHG 2015 Genome in a bottle
PPTX
Jan2016 horizon GIAB
PPTX
GIAB Sep2016 Lightning megan cleveland targeted seq
PPTX
160628 giab for festival of genomics
PPTX
Giab jan2016 intro and update 160128
PPTX
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
PPTX
GIAB Sep2016 Lightning tera bowers horizon nipt
PDF
Sept2016 plenary mercer_sequins
PDF
heb_lab_talk_2015
Aug2015 horizon diagnostics
GIAB GRC Workshop slides
161115 precision fda giab
Genome in a Bottle
Giab aug2015 intro and update 150821.pptx
GIAB-GRC workshop oct2015 giab introduction 151005
2017 amp benchmarking_poster_justin
170326 giab abrf
160627 giab for festival sv workshop
Aug2013 illumina platinum genomes
2016 ashg giab poster
ASHG 2015 Genome in a bottle
Jan2016 horizon GIAB
GIAB Sep2016 Lightning megan cleveland targeted seq
160628 giab for festival of genomics
Giab jan2016 intro and update 160128
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
GIAB Sep2016 Lightning tera bowers horizon nipt
Sept2016 plenary mercer_sequins
heb_lab_talk_2015
Ad

Similar to 150224 giab 30 min generic slides (20)

PPTX
150219 agbt giab_poster_marc
PPTX
Genome in a bottle for next gen dx v2 180821
PPTX
171017 giab for giab grc workshop
PPTX
140127 GIAB update and NIST high-confidence calls
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
Giab ashg webinar 160224
PPTX
Jan2015 GIAB intro, Update, and Data Analysis Planning
PPTX
Aug2014 giab status update and wg charge
PPTX
GIAB Integrating multiple technologies to form benchmark SVs 180517
PPTX
GIAB for AMP GeT-RM Forum
PPTX
GIAB update for GRC GIAB workshop 191015
PPTX
Giab for jax long read 190917
PPTX
Genome in a bottle for amp GeT-RM 181030
PPTX
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
PPTX
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
PDF
2017 agbt giab_poster
PPTX
Aug2014 giab intro slides
PPTX
Genome in a bottle for ashg grc giab workshop 181016
PPTX
Genome in a Bottle- reference materials to benchmark challenging variants and...
PPTX
Genome in a bottle april 30 2015 hvp Leiden
150219 agbt giab_poster_marc
Genome in a bottle for next gen dx v2 180821
171017 giab for giab grc workshop
140127 GIAB update and NIST high-confidence calls
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Giab ashg webinar 160224
Jan2015 GIAB intro, Update, and Data Analysis Planning
Aug2014 giab status update and wg charge
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB for AMP GeT-RM Forum
GIAB update for GRC GIAB workshop 191015
Giab for jax long read 190917
Genome in a bottle for amp GeT-RM 181030
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
2017 agbt giab_poster
Aug2014 giab intro slides
Genome in a bottle for ashg grc giab workshop 181016
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a bottle april 30 2015 hvp Leiden
Ad

More from GenomeInABottle (20)

PDF
2023 GIAB AMP Update
PDF
GIAB Tumor Normal ASHG 2023
PDF
Stratomod ASHG 2023
PDF
GIAB_ASHG_JZook_2023.pdf
PPTX
Benchmarking with GIAB 220907
PPTX
GIAB Technical Germline Benchmark roadmap discussion
PDF
Giab agbt small_var_2020
PDF
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
PPTX
GIAB ASHG 2019 Structural Variant poster
PDF
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
PPTX
GIAB ASHG 2019 Small Variant poster
PPTX
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
PPTX
Jason Chin MHC diploid assembly
PPTX
GIAB and long reads for bio it world 190417
PDF
New methods diploid assembly with graphs
PDF
How giab fits in the rest of the world seqc2 tumor normal
PDF
New data from giab genomes pacbio ccs
PPT
New data from giab genomes strand-seq
PDF
New data from giab genomes promethion
PPTX
New data from giab genomes intro and ultralong nanopore
2023 GIAB AMP Update
GIAB Tumor Normal ASHG 2023
Stratomod ASHG 2023
GIAB_ASHG_JZook_2023.pdf
Benchmarking with GIAB 220907
GIAB Technical Germline Benchmark roadmap discussion
Giab agbt small_var_2020
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
GIAB ASHG 2019 Structural Variant poster
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB ASHG 2019 Small Variant poster
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
Jason Chin MHC diploid assembly
GIAB and long reads for bio it world 190417
New methods diploid assembly with graphs
How giab fits in the rest of the world seqc2 tumor normal
New data from giab genomes pacbio ccs
New data from giab genomes strand-seq
New data from giab genomes promethion
New data from giab genomes intro and ultralong nanopore

Recently uploaded (20)

PPT
MENTAL HEALTH - NOTES.ppt for nursing students
PPTX
surgery guide for USMLE step 2-part 1.pptx
PPTX
Electrolyte Disturbance in Paediatric - Nitthi.pptx
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
PPTX
Anatomy and physiology of the digestive system
PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PDF
شيت_عطا_0000000000000000000000000000.pdf
PDF
Hemostasis, Bleeding and Blood Transfusion.pdf
DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
PPTX
vertigo topics for undergraduate ,mbbs/md/fcps
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PPTX
Stimulation Protocols for IUI | Dr. Laxmi Shrikhande
PPTX
Transforming Regulatory Affairs with ChatGPT-5.pptx
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPTX
Spontaneous Subarachinoid Haemorrhage. Ppt
PPT
Obstructive sleep apnea in orthodontics treatment
PPTX
Important Obstetric Emergency that must be recognised
PPTX
Neuropathic pain.ppt treatment managment
PPTX
NRPchitwan6ab2802f9.pptxnepalindiaindiaindiapakistan
MENTAL HEALTH - NOTES.ppt for nursing students
surgery guide for USMLE step 2-part 1.pptx
Electrolyte Disturbance in Paediatric - Nitthi.pptx
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
Anatomy and physiology of the digestive system
OPIOID ANALGESICS AND THEIR IMPLICATIONS
شيت_عطا_0000000000000000000000000000.pdf
Hemostasis, Bleeding and Blood Transfusion.pdf
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
vertigo topics for undergraduate ,mbbs/md/fcps
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
Stimulation Protocols for IUI | Dr. Laxmi Shrikhande
Transforming Regulatory Affairs with ChatGPT-5.pptx
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
Medical Evidence in the Criminal Justice Delivery System in.pdf
Spontaneous Subarachinoid Haemorrhage. Ppt
Obstructive sleep apnea in orthodontics treatment
Important Obstetric Emergency that must be recognised
Neuropathic pain.ppt treatment managment
NRPchitwan6ab2802f9.pptxnepalindiaindiaindiapakistan

150224 giab 30 min generic slides

  • 1. Genome in a Bottle: So you’ve sequenced a genome – how well did you do? February 2015 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
  • 2. Whole genome sequencing technologies disagree about 100,000’s of variants 3,198,316 (80.05%) 125,574 (3.14%) Platform #1 Platform #2 Platform #3 230,311 (5.76%) 121,440 (3.04%) 208,038 (5.21%) 71,944 (1.80%) 39,604 (0.99%) # SNPs (% of SNPs detected by any platform)
  • 3. Bioinformatics programs also disagree O’Rawe et al. Genome Medicine 2013, 5:28
  • 4. NIST-hosted Genome in a Bottle Consortium • Infrastructure for performance assessment of NGS – support science-based regulatory oversight • No widely accepted set of metrics to characterize the fidelity of variant calls from NGS… • Genome in a Bottle Consortium is developing standards to address this… – well-characterized human genomes as Reference Materials (RMs) • characterized and disseminated by NIST – tools and methods to use these RMs • Global Alliance for Genomics and Health Benchmarking Team http://genomeinabottle.org
  • 5. Genome in a Bottle Consortium Development • NIST met with sequencing technology developers to assess standards needs – Stanford, June 2011 • Open, exploratory workshop – ASHG, Montreal, Canada – October 2011 • Small, invitational workshop at NIST to develop consortium for human genome reference materials – FDA, NCBI, NHGRI, NCI, CDC, Wash U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others – developed draft work plan – April 2012 • Open, public meetings of GIAB – August 2012 at NIST – March 2013 at Xgen – August 2013 at NIST – January 2014 at Stanford – August 2014 at NIST – January 2015 at Stanford • Website – www.genomeinabottle.org
  • 6. Others working in this space… Well-characterized genomes • Illumina Platinum Genomes • CDC GeT-RM • Korean Genome Project • Human Longevity, Inc. • Hyditaform mole haploid cell line • Genome Reference Consortium Performance Metrics • Global Alliance for Genomics and Health Benchmarking Team • NCBI/CDC GeT-RM Browser • GCAT website
  • 7. NIST Plays a Role in the First FDA Authorization for Next-Generation Sequencer November 20, 2013
  • 8. Measurement Process Sample gDNA isolation Library Prep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • gDNA reference materials will be developed to characterize performance of a part of process – materials will be certified for their variants against a reference sequence, with confidence estimates genericmeasurementprocess Analytical steps Pre-Analytical steps Clinical Interpretation
  • 9. • NIST worked with GIAB to select genomes • Current genomes – NA12878 HapMap sample as Pilot sample • part of 17-member pedigree – 2 trios from PGP • Ashkenazim • Asian 12889 12890 12891 12892 12877 12878 12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893 CEPH Utah Pedigree 1463 Putting “Genomes” in Bottles 11 children
  • 10. NIST Human Genome RMs in the pipeline • All 10 ug samples of DNA isolated from multistage large growth cell cultures – all are intended to act as stable, homogeneous references suitable for use in regulated applications – all genomes also available from Coriell repository • Pilot Genome – ~8400 tubes • Ashkenazim Jewish Trio – ~10000 son; ~2500 each parent • Asian Trio – ~10000 son; parents not yet planned as NIST RM
  • 11. Goals for Data to Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 11
  • 12. Pilot Genome: Integrate 12 14 Datasets from 5 platforms 12
  • 13. Dataset#1Dataset#2Dataset#3 Annotation #1 Histogram (e.g., coverage) Dataset#1Dataset#2Dataset#3 Annotation #2 Histogram (e.g., strand bias) Site A Site B Potential Bias Site C Dataset Site A Site B Site C Dataset #1 0/0 0/0 1/1 Dataset #2 0/1 0/1 1/1 Dataset #3 0/0 0/1 1/1 Integration 0/0 0/1 Uncer- tain Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Integration Methods to Establish Benchmark Variant Calls
  • 14. Integration Methods to Establish Benchmark Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  • 15. Assigning confidence to genotypes High-confidence sites • Sequencing/bioinformatics methods agree or we understand the biases causing disagreement • At least some methods have no evidence of bias • Inherited as expected Less confident sites • In a region known to be difficult for current technologies • State reasons for lower confidence • If a site is near a low confidence site, make it low confidence
  • 16. Challenges with assessing performance • All variant types are not equal • All regions of the genome are not equal • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 16
  • 17. Challenge in variant comparison: Complex variants have multiple correct representations BWA ssaha2 CGTools Novo- align Ref: T insertion TCTCT insertion 17 FP SNPs FP MNPs FP indels Traditional comparison 0.38% (610) 100% (915) 6.5% (733) Comparison with realignment 0.15% (249) 4.2% (38) 2.6% (298)
  • 18. Global Alliance for Genomics and Health Benchmarking Task Team • Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark • Developed standardized definitions for performance metrics like TP, FP, and FN. • Initial focus on germline SNPs/indels • Developing benchmarking tools • Comparison engine • Pluggable web interface with modules for: • Reporting/calculation of metrics • Visualization/user interface • Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes www.bioplanet.com/gcat Example User Interface
  • 19. Stratifying Performance • Measure performance for different types of variants in different sequence contexts – Types of variants • SNPs • indels of different sizes • complex variants • structural variants – Sequence contexts • Homopolymers, • STRs • Duplications – Functional context • Exome vs genome, etc – Data characteristics • Coverage • Mapping quality • Challenge of smaller gene panels vs genome sequencing – one RM may not have a sufficient number of examples of different classes of variants or sequence contexts – likely need more samples with specific types of variants
  • 20. NCBI/CDC GeT-RM Browser • http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/ • Allows visualization of questionable calls
  • 21. Initial uses of high-confidence NIST- GIAB genotypes for NA12878 • NIST have released several versions of high- confidence genotypes for its pilot RM • These data are presently being used for benchmarking – prior to release of RMs – SNPs & indels • ~77% of the genome
  • 22. Using Genome in a Bottle calls to benchmark clinical exome sequencing at Mount Sinai School of Medicine “We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”
  • 23. Benchmarking somatic variant calling at Qiagen
  • 24. Implications of Technical Accuracy in Medical Genome Sequencing • Collaboration with Euan Ashley group at Stanford • What is accuracy for functional variants? • How much of the exome falls in high confidence regions? • “Black list” in databases • Sensitivity – WExS (95%) < WGS (98%) • especially splicing – genome < nonsyn < syn – Most exome FNs caused by low coverage – Most WGS FNs cause by filtering • Only 81 % of ClinVar pathogenic or likely pathogenic SNPs fall in high-confidence regions – Lots of work to do!
  • 25. Overview of NIST RM Development Genome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015 HG- 001/NA1287 8 (“Pilot” Genome) Release NIST RM8398; Preliminary large deletions Refined Structural Variants HG-002 to HG-004 (Ashkenazim trio) Illumina, Complete Genomics, Ion, BioNano, homogeneity /stability Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”; mate-pair; CG-LFR Refined SNPs/indels ; Preliminary SVs Refined Structural Variants NIST RMs 8391/839 2 release HG-005 (son in Asian trio) Illumina, Complete Genomics, Ion, BioNano, homogeneity /stability “moleculo”; mate-pair; CG-LFR Preliminary SNPs/indels Refined SNPs/indels; Refined Structural Variants NIST RM8393 release
  • 26. Ashkenazim Jewish PGP RM Trio Dataset Characteristics Coverage Availability Good for… Illumina Paired- end 150x150bp ~300x/individu al Fastq on ftp SNPs/indels/so me SVs Illumina Long Mate pair ~6000 bp insert ~40x/individual Feb-Mar 2015 SVs Illumina “moleculo” Custom library ~30x by long fragments Feb-Mar 2015 SVs/phasing/as sembly Complete Genomics 100x/individual On ftp SNPs/indels/so me SVs Complete Genomics LFR ?? SNPs/indels/ph asing Ion Proton Exome 1000x/individu al On SRA SNPs/indels in exome BioNano Genomics Feb 2015 SVs/assembly PacBio ~10kb reads ~120-150x on AJ trio Finished ~Mar 2015 SVs/phasing/as sembly/STRs
  • 27. Asian PGP trio • Similar sequencing to Ashkenazim trio except for PacBio • Only son will be NIST RM
  • 28. Future Directions Germline mutations • Difficult regions/variants – Long-read technologies – Forming an analysis group • Tools for assessing performance – How to stratify performance and understand biases? Somatic mutations • Pilot interlaboratory study to assess comparability of spike-ins • Commercial members developing FFPE cell lines • Participants interested in mixing different RMs
  • 29. How to get involved • Use our integrated SNP/indel genotypes for NA12878 and give us feedback – Cells and DNA currently available from Coriell – NIST RM available April 2015 • Join our new Analysis group – Use Long-read technologies – Structural Variant calls – De novo assembly – Help create the best-ever characterized trio • Attend our biannual workshops (January in CA, August in MD) • Develop tools/metrics with Global Alliance for Genomics and Health Benchmarking Team
  • 30. Acknowledgments • FDA – Elizabeth Mansfield, HPC staff • HSPH • GCAT - David Mittelman, Jason Wang • Francisco De La Vega • Illumina - Mike Eberle • Personalis - Deanna Church • NCBI – Chunlin Xiao • Celera - Andrew Grupe • Genome in a Bottle – www.genomeinabottle.org – New members welcome! – Sign up for email newsletters – jzook@nist.gov