2
Most read
4
Most read
8
Most read
Assessing the v4𝛂 truth set
with DeepVariant
Andrew Carroll - awcarroll@google.com
Genomics team in Google Brain
P 2
s
v
or
a
Genomics
Conv Network - 26 layers
Data encoded as
"tensors"
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
{ HOM_REF, HET, HOM_VAR }
Probability
distribution
DeepVariant: Deep Learning for
Variant Calling
Confidential + Proprietary
ConvNets
P 4
s
v
or
a
Genomics
DeepVariant: Deep Learning for
Variant Calling
HET VAR
HOM VAR
REF
Training:
Show example + label
to update weights
Label Example Updated Weights
Training Strategy: Use chr1-19 for training, chr21-22 for tune.
Never train on chr20 in any sample, Never train on HG002 (except only now on PacBio CCS)
P 5Genomics
Training and Applying
DeepVariant on CCS
Input Data:
Model
Trained On:
Results:
8M chip, 32x coverage
11kb insert, In GIAB FTP
Mix of 1M+8M, inserts
9x-30x coverage
Release Model
8M chip 100x coverage
Mix of inserts
8M chip 100x coverage
Mix of inserts
Type Metric v3.3.2 v4
SNP
Recall 99.95% 99.70%
Precision 99.85% 99.71%
F1 99.90% 99.71%
Indel
Recall 98.22% 97.83%
Precision 98.35% 98.22%
F1 98.29% 98.03%
Type Metric v3.3.2 v4
SNP
Recall 99.96% 99.40%
Precision 99.84% 99.92%
F1 99.90% 99.66%
Indel
Recall 99.35% 99.26%
Precision 99.30% 99.48%
F1 99.33% 99.37%
P 6Genomics
Observations
SNP accuracy not improved with more coverage (more in a bit)
Indel accuracy in v4 is higher. Also much improved by more coverage
Experiment: Train with v4 truth set. Then evaluate on v3.3.2
Result: Indel F1 improves ; SNP F1 decreases
All evidence here points to Indel changes in v4 being quite positive, but SNP mixed
P 7Genomics
Error Analysis on 100x model:
P 8Genomics
Summary of Error Analysis:
INDELS:
Indel errors are beyond my ability to assess (homopolymer junctions, diverse evidence in reads)
SNPs:
When DeepVariant makes a SNP call not in v4𝛂, DeepVariant is almost always correct
128 putative FP on chr20 || I assess 2 are real FP, 4 unclear, 122 v4𝛂 errors
When DeepVariant does not call a SNP present in v4𝛂, v4𝛂 is always correct (that I looked)
And there is a characteristic signature of nearby variants in these errors
P 9Genomics
DeepVariant FNs/Unlikely Errors in v4
100x PacBio CCS (8M)
28x PacBio CCS (1M)
84x 10X HP-trio
P 10Genomics
DeepVariant FPs/Likely Errors in v4
100x PacBio CCS (8M)
28x PacBio CCS (1M)
84x 10X HP-trio
P 11Genomics
Conclusions
● Truth labels for indels look improved across every metric investigated
● Truth label errors dominate the real remaining errors for DeepVariant in SNPs on v4
● DeepVariant is able to call true signal in this data very well, is left trying to figure out mislabel
signature
● This seems to be be associated with nearby variants in regions hard to map with Illumina
● This label error currently limits our ability to use v4 in training (v4 still needs some work for SNPs)
● If the label error is corrected, this may solve the remaining FP+FN (since DeepVariant will not
learn the nearby variant signature). Potential accuracy might be very high based on FP #
‘
P 12
1 605031 . A
ATACATGGAGGGGAACAACACACACCAGGGCCTCTCAGGGGGACAGGGGGTAGGAGACCATCAGGAC
AAACACGTGGATACATGGAGGGGAACAACACACACCAGGGCCTCTCAGGGGGACAGGGGGTAGGAGA
CCATCAGGACAAACACGTGGATACATGGAGGGGAACAACACACACCAGGGCCTCTCAGGGGGACAGG
GGGTAGGAGACCATCAGGACAAACACGTGGG 22.3 PASS . GT:GQ:DP:AD:VAF:PL
0/1:22:63:39,19:0.301587:22,0,50
Unexpected Finding - SV Calls in CCS
SV Calls in Data: (e.g. 232 bp HET insertion)
Type Recall Precision F1
ALL 39.5 % 94.3 % 55.7 %
INS 54.3 % 93.2 % 68.6 %
DEL 20.3 % 98.6 % 33.7 %
Evaluated by Aaron Wenger (Pacbio) on Truvari v0.6 SV setGenomics
P 13Genomics
Thanks

More Related Content

PDF
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
PPTX
GIAB update for GRC GIAB workshop 191015
PPTX
GIAB for AMP GeT-RM Forum
PPTX
Jason Chin MHC diploid assembly
PPTX
GIAB Technical Germline Benchmark roadmap discussion
PDF
New data from giab genomes promethion
PPTX
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
PPTX
GIAB ASHG 2019 Structural Variant poster
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB update for GRC GIAB workshop 191015
GIAB for AMP GeT-RM Forum
Jason Chin MHC diploid assembly
GIAB Technical Germline Benchmark roadmap discussion
New data from giab genomes promethion
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GIAB ASHG 2019 Structural Variant poster

What's hot (20)

PDF
New data from giab genomes pacbio ccs
PPTX
GIAB ASHG 2019 Small Variant poster
PPTX
Sept2016 sv mt_sinai_assembly_discussionintro
PPTX
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
PPTX
GIAB and long reads for bio it world 190417
PPTX
Sept2016 plenary datajamboree_summary
PDF
Giab agbt small_var_2020
PDF
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
PDF
Ngs part iii 2013
PDF
Sept2016 sv illumina
PPTX
Sept2016 sv nist_intro
PPTX
Genome in a Bottle- reference materials to benchmark challenging variants and...
PPTX
Sept2016 smallvar nist intro
PDF
The Clinical Significance of Transcript Alignment Discrepancies
PDF
Jan2016 pac bio giab
PDF
Goodwin2016 ngs 10 years
PPTX
Aug2015 analysis team 04 10x genomics
PDF
ChIP-seq - Data processing
PPTX
The Marriage between Music and Machine Learning in KKBOX

PPTX
NGx Sequencing 101-platforms
New data from giab genomes pacbio ccs
GIAB ASHG 2019 Small Variant poster
Sept2016 sv mt_sinai_assembly_discussionintro
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB and long reads for bio it world 190417
Sept2016 plenary datajamboree_summary
Giab agbt small_var_2020
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ngs part iii 2013
Sept2016 sv illumina
Sept2016 sv nist_intro
Genome in a Bottle- reference materials to benchmark challenging variants and...
Sept2016 smallvar nist intro
The Clinical Significance of Transcript Alignment Discrepancies
Jan2016 pac bio giab
Goodwin2016 ngs 10 years
Aug2015 analysis team 04 10x genomics
ChIP-seq - Data processing
The Marriage between Music and Machine Learning in KKBOX

NGx Sequencing 101-platforms
Ad

More from GenomeInABottle (15)

PDF
2023 GIAB AMP Update
PDF
GIAB Tumor Normal ASHG 2023
PDF
Stratomod ASHG 2023
PDF
GIAB_ASHG_JZook_2023.pdf
PPTX
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
PPTX
Benchmarking with GIAB 220907
PPTX
Giab for jax long read 190917
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PDF
New methods diploid assembly with graphs
PDF
How giab fits in the rest of the world seqc2 tumor normal
PPT
New data from giab genomes strand-seq
PPTX
New data from giab genomes intro and ultralong nanopore
PPTX
How giab fits in the rest of the world mdic somatic reference samples
PDF
How giab fits in the rest of the world telomere to telomere consortium
PPTX
How giab fits in the rest of the world human genome structural variation co...
2023 GIAB AMP Update
GIAB Tumor Normal ASHG 2023
Stratomod ASHG 2023
GIAB_ASHG_JZook_2023.pdf
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Benchmarking with GIAB 220907
Giab for jax long read 190917
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
New methods diploid assembly with graphs
How giab fits in the rest of the world seqc2 tumor normal
New data from giab genomes strand-seq
New data from giab genomes intro and ultralong nanopore
How giab fits in the rest of the world mdic somatic reference samples
How giab fits in the rest of the world telomere to telomere consortium
How giab fits in the rest of the world human genome structural variation co...
Ad

Recently uploaded (20)

PPTX
SHOCK- lectures on types of shock ,and complications w
PPTX
Wheat allergies and Disease in gastroenterology
PPTX
Physiology of Thyroid Hormones.pptx
PPTX
NUCLEAR-MEDICINE-Copy.pptxbabaabahahahaahha
PDF
OSCE SERIES ( Questions & Answers ) - Set 3.pdf
PDF
Forensic Psychology and Its Impact on the Legal System.pdf
PPTX
Assessment of fetal wellbeing for nurses.
PPTX
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
PPTX
abgs and brain death dr js chinganga.pptx
PPTX
Approach to chest pain, SOB, palpitation and prolonged fever
PPTX
Post Op complications in general surgery
PDF
The Digestive System Science Educational Presentation in Dark Orange, Blue, a...
PPT
Infections Member of Royal College of Physicians.ppt
PPTX
thio and propofol mechanism and uses.pptx
PDF
OSCE Series ( Questions & Answers ) - Set 6.pdf
PDF
The_EHRA_Book_of_Interventional Electrophysiology.pdf
PPTX
Critical Issues in Periodontal Research- An overview
PPTX
Vaccines and immunization including cold chain , Open vial policy.pptx
PPTX
4. Abdominal Trauma 2020.jiuiwhewh2udwepptx
PPTX
ROJoson PEP Talk: What / Who is a General Surgeon in the Philippines?
SHOCK- lectures on types of shock ,and complications w
Wheat allergies and Disease in gastroenterology
Physiology of Thyroid Hormones.pptx
NUCLEAR-MEDICINE-Copy.pptxbabaabahahahaahha
OSCE SERIES ( Questions & Answers ) - Set 3.pdf
Forensic Psychology and Its Impact on the Legal System.pdf
Assessment of fetal wellbeing for nurses.
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
abgs and brain death dr js chinganga.pptx
Approach to chest pain, SOB, palpitation and prolonged fever
Post Op complications in general surgery
The Digestive System Science Educational Presentation in Dark Orange, Blue, a...
Infections Member of Royal College of Physicians.ppt
thio and propofol mechanism and uses.pptx
OSCE Series ( Questions & Answers ) - Set 6.pdf
The_EHRA_Book_of_Interventional Electrophysiology.pdf
Critical Issues in Periodontal Research- An overview
Vaccines and immunization including cold chain , Open vial policy.pptx
4. Abdominal Trauma 2020.jiuiwhewh2udwepptx
ROJoson PEP Talk: What / Who is a General Surgeon in the Philippines?

New methods deep variant evaluation of draft v4alpha

  • 1. Assessing the v4𝛂 truth set with DeepVariant Andrew Carroll - awcarroll@google.com Genomics team in Google Brain
  • 2. P 2 s v or a Genomics Conv Network - 26 layers Data encoded as "tensors" { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } { HOM_REF, HET, HOM_VAR } Probability distribution DeepVariant: Deep Learning for Variant Calling
  • 4. P 4 s v or a Genomics DeepVariant: Deep Learning for Variant Calling HET VAR HOM VAR REF Training: Show example + label to update weights Label Example Updated Weights Training Strategy: Use chr1-19 for training, chr21-22 for tune. Never train on chr20 in any sample, Never train on HG002 (except only now on PacBio CCS)
  • 5. P 5Genomics Training and Applying DeepVariant on CCS Input Data: Model Trained On: Results: 8M chip, 32x coverage 11kb insert, In GIAB FTP Mix of 1M+8M, inserts 9x-30x coverage Release Model 8M chip 100x coverage Mix of inserts 8M chip 100x coverage Mix of inserts Type Metric v3.3.2 v4 SNP Recall 99.95% 99.70% Precision 99.85% 99.71% F1 99.90% 99.71% Indel Recall 98.22% 97.83% Precision 98.35% 98.22% F1 98.29% 98.03% Type Metric v3.3.2 v4 SNP Recall 99.96% 99.40% Precision 99.84% 99.92% F1 99.90% 99.66% Indel Recall 99.35% 99.26% Precision 99.30% 99.48% F1 99.33% 99.37%
  • 6. P 6Genomics Observations SNP accuracy not improved with more coverage (more in a bit) Indel accuracy in v4 is higher. Also much improved by more coverage Experiment: Train with v4 truth set. Then evaluate on v3.3.2 Result: Indel F1 improves ; SNP F1 decreases All evidence here points to Indel changes in v4 being quite positive, but SNP mixed
  • 7. P 7Genomics Error Analysis on 100x model:
  • 8. P 8Genomics Summary of Error Analysis: INDELS: Indel errors are beyond my ability to assess (homopolymer junctions, diverse evidence in reads) SNPs: When DeepVariant makes a SNP call not in v4𝛂, DeepVariant is almost always correct 128 putative FP on chr20 || I assess 2 are real FP, 4 unclear, 122 v4𝛂 errors When DeepVariant does not call a SNP present in v4𝛂, v4𝛂 is always correct (that I looked) And there is a characteristic signature of nearby variants in these errors
  • 9. P 9Genomics DeepVariant FNs/Unlikely Errors in v4 100x PacBio CCS (8M) 28x PacBio CCS (1M) 84x 10X HP-trio
  • 10. P 10Genomics DeepVariant FPs/Likely Errors in v4 100x PacBio CCS (8M) 28x PacBio CCS (1M) 84x 10X HP-trio
  • 11. P 11Genomics Conclusions ● Truth labels for indels look improved across every metric investigated ● Truth label errors dominate the real remaining errors for DeepVariant in SNPs on v4 ● DeepVariant is able to call true signal in this data very well, is left trying to figure out mislabel signature ● This seems to be be associated with nearby variants in regions hard to map with Illumina ● This label error currently limits our ability to use v4 in training (v4 still needs some work for SNPs) ● If the label error is corrected, this may solve the remaining FP+FN (since DeepVariant will not learn the nearby variant signature). Potential accuracy might be very high based on FP # ‘
  • 12. P 12 1 605031 . A ATACATGGAGGGGAACAACACACACCAGGGCCTCTCAGGGGGACAGGGGGTAGGAGACCATCAGGAC AAACACGTGGATACATGGAGGGGAACAACACACACCAGGGCCTCTCAGGGGGACAGGGGGTAGGAGA CCATCAGGACAAACACGTGGATACATGGAGGGGAACAACACACACCAGGGCCTCTCAGGGGGACAGG GGGTAGGAGACCATCAGGACAAACACGTGGG 22.3 PASS . GT:GQ:DP:AD:VAF:PL 0/1:22:63:39,19:0.301587:22,0,50 Unexpected Finding - SV Calls in CCS SV Calls in Data: (e.g. 232 bp HET insertion) Type Recall Precision F1 ALL 39.5 % 94.3 % 55.7 % INS 54.3 % 93.2 % 68.6 % DEL 20.3 % 98.6 % 33.7 % Evaluated by Aaron Wenger (Pacbio) on Truvari v0.6 SV setGenomics