SlideShare a Scribd company logo
Variant Annotation and
Interpretation:
Tools and Public Data
Functional Genomics Symposium, Qatar
December 12, 2015
Gabe Rudy
@gabeinformatics
VP Product Management and Engineering
Golden Helix
My Background
 Golden Helix
- Founded in 1998
- Genetic association software
- Analytic services
- Over ten-thousand users worldwide
- Over 800 customer citations in journals
 Products I Build with My Team
- SNP & Variation Suite (SVS) - Research
- VarSeq – Clinical & NGS Research
- GenomeBrowse (Free!) - All
 What I Do (Coding, Bioinformatics)
- Build tools, build pipelines of tools
- Blog
- Participate in GA4GH, HGVS Discussions,
NCBI EVAC
Topics
• ACMG Guidelines for Variant Interpretation
• Necessity of visualization
• Public data and tools for annotations
• Accurate gene annotations, choice of “clinical transcripts”
• Variant representation, “left-shifting” and HGVS nomenclature
• Warehousing variants
ACMG Guidlines
 Five-tier terminology system:
- “pathogenic,” “likely pathogenic,” “uncertain significance,” “likely
benign,” and “benign”
- Mendelian and mitochondrial variants
- Variant assessment guidelines as combined from 11 labs
- Report variant with condition and inheritance pattern
- c.1521_1523delCTT (p.Phe508del),
pathogenic, cystic fibrosis, autosomal recessive
- Likely pathogenic, likely benign mean 90% certainty
- Provide genomic coordinates (g.)
- Transcript selection up to lab to define "clinically relevant"
You Need Visualization, Not Just a Table to Interpret
 Recovery of Frameshift (in Supercentenarian)
Visualization of Variants to Aid Interpretation
 Variants + Genomic Context
- Where it is in gene
- Annotations that match, don’t match
- Other variants in cohort / warehouse
- Locality and rare/common variants
- Locality of pathogenic variants
 Interpreting Multiple Transcript
 Alignment Evidence
- BAM files provide more than is in VCF
- Phasing of same-ready mutations
- Examine sites of related samples with
no variants called
Visualization
 Free Genome Browsers:
- IGV
- Popular desktop by Broad
- UCSC
- Web-based, most extensive
annotations
- GenomeBrowse
- Designed to be publication ready
- Smooth zoom and navigation
- Built in all Golden Helix curated
annotations (stream or
download)
Annotation with Public Data
 Pop databases
- Don't assume “population” == healthy controls
- ExAC, EVS, 1kG, dbSNP
 Disease databases:
- OMIM, ClinVar, HGMD
 In-Silico Prediction
- Whether missense change is damaging
- 65–80% accurate when examining known disease variants
- Expect over-sensitive, but can be a low-pass filter to call "likely benign”
- Expect correlation between tools as often using similar underlying pieces of evidence.
- Splicing: predicting effect on splicing on genes
 RefSeqGenes and Human Reference
Annotations are Hard!
 HGVS is a standard that is not
computable
- Tries to serve different goals
- Many representations of same variant
- Difficult when used as identifier, but only
alternative is genomic representation (g.)
 Transcripts
- Transcript set choice extremely important
- Hard to curate with meaningful tx attributes.
 Public Data Curation
- ClinVar: multi-record lines, bits in VCF/XML
- NHLBI: MAF vs AAF, splitting “glob” fields
- 1kG: No genotype counts
- ExAC: Multi-allelic splitting, left-align
- ClinVitae (and COSMIC): only HGVS
- dbNSFP: Abbreviations and aggregate
scores
 Versioning and Issues
- ClinVar missing ~5K pathogenic in VCF
- dbSNP patches without version changes
Population Catalogs
 1000 Genomes (WGS, Exome, SNP Array)
- Many releases, most recent now
standardized, still incrementally updated
- 2,500 genomes – Phase3
 “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS)
- Had many releases, now V2-SSA137 0.0.30
- European American / African American only
 ExAC (Broad 61,486 Exomes v0.3)
- Many sub-populations
 Supercentenarians (110+ yo, 17 WGS)
- Available as raw Complete Genomic data
- Requires normalizing to match Illumina NGS
InSilico Predictions
 Non-synonymous functional
predictions
- SIFT, Polyphen2, LFT, MutationTaster,
MutationAccessor, FATHMM
 Conservation
- GERP++, PhyloP, phastCons
 All-In-One Scores
- CADD, VAAST,VEST3, DANN, FATHM-
MKL, MetaSVM and MetaLR
- Use machine learning, “feature selection”,
train and predict on public databases
- Can predicting synonymous and intergenic
 dbNSFP 3.0 – 82M precomputed scores
- N of 6 Voting on prediction algorithms
 RNA Splicing Effect (dbscSNV)
- 5+ splice algorithms, can pre-compute
- −3 to +8 at the 5’, −12 to +2 at the 3’
Disease Databases
 ClinVar
- Voluntary submissions of lab
- Use 5-tier classification (variant + phenotype
pairs)
- Star-rating of variants
- Lab owns submission, can revoke and
monitor status
 ClinVitae (Invitae curated, not updated)
 OMIM
- Gene to Phenotype documentation
- Expertly curated, hand updated
- Changes dynamically
- Small list of cited / implicated variants
 HGMD
- Commercially supported
- Best linkage of (possible) publication to
variant/genes
- Classifications not directly trusted
 Your own Lab (more later)
Web-Based Annotation Tools
 NCBI Variant Reporter
- HGVS Annotation
- PubMed, ClinVar links
 SeattleSeq
- NHLBI supported
- Some public annotations
 Ensembl VEP
- Same as running VEP locally
 Scripps Genome ADVISER
- Out of of date annotations
- Scripps Wellderly Frequencies
- Splice Site Predictions
- Basic Java GUI for filtering
 Mutalyzer – HGVS only
Variant Annotation Tools
 snpEff
- Open source, commercial use allowed
- Tx Annotation, HGVS output
- Limited public annotations
 ANNOVAR
- Academic/Commercial split
- Many public annotations
- Non-standard Tx prioritization
 Ensembl VEP
- Ensembl tx only, HGVS output
- Limited public annotations
 VarSeq
- Commercially supported
- Largest public annotation repo
- RefSeq/Ensembl tx, HGVS
- Clinical Tx, many export formats
- Integrated data transformations
Reference Sequence and Transcripts
 RefSeqGenes – mRNA sequence archive, with mappings to genomes
- Provided mappings to Locus Reference Gene (LRG) database
- Use genome mappings by NCBI (through genome annotation builds). NOT UCSC
- “Clinically Relevant” metric:
- LRG if available
- Longest if tied
 Ensembl – defined directly against the human genome
- More inclusive of genes discovered with high-throughput methods
- Gencode subset – similar to RefSeqGenes in size / definition
 Each have unique Accessions and Version Numbers
- Newer releases GRCh38
- GRCh37 mappings not being updated (unfortunately)
Reference Sequence Versus Gene Sequence
EMG1 on GRCh37
 “Gap” of the mRNA coding sequence versus reference seq:
 Handled differently by 3 different “gene alignments”
Reference Sequence Versus Gene Sequence
EMG1 on GRCh38
 Reference sequence patched, no gap
 Alignments agree
2015 functional genomics   variant annotation and interpretation- tools and public data
RefSeq Accession Not Sufficient for Var-Tx Interaction
RefSeq defines transcripts as mRNA sequence
 NCBI “Annotation Releases” (like v105) provides alignments using “Splign”
 UCSC pulls RefSeq mRNA and aligns themselves using “BLAT”
 They can choose equally valid but different alignments for the same accession
 This alignment of NM_052814.3 places the exon at dramatically different loci.
 Will result in different annotations of any variant overlapping these exons
Variant Representation and Normalization
 Allelic Primitives
- AG/CT -> A/C & G/T
- AT/G -> A/- & T/G
- May have different annotations
 Left Align
- NGS standard, not consistent
historically
- May be needed after primitives
- HGVS -> 3’ shift (right for forward)
 Multi-Allelic (2 Non-Ref Alleles)
- Each non-ref has own annotations
- Pop level should be “split” for counts
 HGVS, Transcript Projection
- Dependent on Tx->Genome Mapping
- hgvs-eval: Benchmarking tool in
progress
Left-Align Annotations
 Using a Smith-
Waterman
algorithm to left-
align variants
from public
databases show
non-obvious
differences
 NGS alignment
and variant
calling always
left-aligned
 Left-align your
database so they
can be annotated
Left-Align Delta F508 to Make it Match
Called in Both Locations – Affect Frequencies
Allelic Split + Left Align. Discover Existing Freq
Multi Allelic
 The Supercentenarian annotation found records for both alternates, and looks
like this:
 Trio Analysis, Variant is a G/T/C (Reference G, Alternates of T/C):
Variant Warehouse
"Clinical laboratories
should implement an
internal system to track
all sequence variants
identified in each gene
and clinical assertions
when reported.
This is is important for
tracking genotype–
phenotype correlations
and the frequency of
variants in affected and
normal populations."
Why Warehouse?
 A place to archive full VCFs of every
sequenced sample (by assay/test)
 Query and retrieve subsets of data
at any time
 Ask the Variant Warehouse:
- Have I ever seen this variant in my
previous test samples?
- At what frequency? (counts as well)
- Does this gene contain other rare variants
in my cohort?
- Did I provide a pathogenicity assessment
for this variant? Has that changed?
- Has ClinVar changed since that
assessment was initially made?
- Have I put this variant into a clinical report
for any previous samples?
2015 functional genomics   variant annotation and interpretation- tools and public data
NM_002626.4:c.1877G>C in PFKL
 NP_002617.3:p.Arg626Pro missense mutation
 Predicted damaging by 4/5 functional predictions
 VEST3: 0.948, GERP++: 4.59
 ExAC and 1kG have a G>A, but G>C is novel
 Variants in region are extremely rare (G>C ExAC 4 of 122,364 alleles) – 0.003%
 No ClinVar variants for gene
 OMIM entry has no known disease association
 PubMed search shows few recent articles: Most recent 1998 paper showed
- phosphofructokinase (PFKL) overexpressed in Down syndrome (DS)
- Transgenic PFKL mice had an abnormal glucose metabolism with reduced clearance
rate from blood and enhanced metabolic rate in brain.
 d
 d
35 LoF Variants, None Homozygous
Questions?

More Related Content

PPTX
Comparative genomics
PPTX
Next generation sequencing methods
PDF
Genome Assembly
PPT
Comparative genomics @ sid 2003 format
PPTX
Crisper Cas system
PPTX
Genomics(functional genomics)
PPTX
Integrative omics approches
PPTX
next generation sequencing
Comparative genomics
Next generation sequencing methods
Genome Assembly
Comparative genomics @ sid 2003 format
Crisper Cas system
Genomics(functional genomics)
Integrative omics approches
next generation sequencing

What's hot (20)

PDF
Rna seq
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
Next generation-sequencing.ppt-converted
PDF
Next Generation Sequencing
PDF
Genome Assembly 2018
PPT
Proteome databases
PPTX
Next Generation Sequencing
PPTX
InterPro and InterProScan 5.0
 
PPTX
Introduction to Next Generation Sequencing
PPTX
NGS data formats and analyses
PDF
Amplicon Sequencing Introduction
PPTX
Knowing Your NGS Upstream: Alignment and Variants
PPTX
Metagenomics by microbiology dept. panjab university2018copy
PPTX
Gemome annotation
PPTX
Metagenomics
PPTX
Overview on arabidopsis and rice genome
PPTX
Sequencedatabases
Rna seq
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Next generation-sequencing.ppt-converted
Next Generation Sequencing
Genome Assembly 2018
Proteome databases
Next Generation Sequencing
InterPro and InterProScan 5.0
 
Introduction to Next Generation Sequencing
NGS data formats and analyses
Amplicon Sequencing Introduction
Knowing Your NGS Upstream: Alignment and Variants
Metagenomics by microbiology dept. panjab university2018copy
Gemome annotation
Metagenomics
Overview on arabidopsis and rice genome
Sequencedatabases
Ad

Viewers also liked (20)

PPTX
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
PPTX
Qt Framework Events Signals Threads
PDF
How humanities changed_the_world
 
PPT
15 ways to take control of your time at work
PPTX
2015 Bioc4010 lecture1and2
PPTX
Web Apollo: A Web-based Genomic Annotation Editing Platform ISB2013
PPTX
DIYA: An annotation pipeline for any genomics lab
PPTX
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
PDF
Identification, annotation and visualisation of extreme changes in splicing w...
PPT
Trends in Annotation of Genomic Data
PDF
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
PPTX
UCSD / DBMI seminar 2015-02-6
PPTX
Open biomedical knowledge using crowdsourcing and citizen science
ODP
MyGene.info talk at ISMB/BOSC 2013
PDF
PPTX
MyGene.info learn-more
PDF
Ensembl Browser Workshop
PPT
Prediction of protein function from sequence derived protein features
PDF
BIOL335: How to annotate a genome
PPT
Protein function prediction
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
Qt Framework Events Signals Threads
How humanities changed_the_world
 
15 ways to take control of your time at work
2015 Bioc4010 lecture1and2
Web Apollo: A Web-based Genomic Annotation Editing Platform ISB2013
DIYA: An annotation pipeline for any genomics lab
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
Identification, annotation and visualisation of extreme changes in splicing w...
Trends in Annotation of Genomic Data
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
UCSD / DBMI seminar 2015-02-6
Open biomedical knowledge using crowdsourcing and citizen science
MyGene.info talk at ISMB/BOSC 2013
MyGene.info learn-more
Ensembl Browser Workshop
Prediction of protein function from sequence derived protein features
BIOL335: How to annotate a genome
Protein function prediction
Ad

Similar to 2015 functional genomics variant annotation and interpretation- tools and public data (20)

PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PDF
Annotation capabilities
PDF
Using VarSeq to Improve Variant Analysis Research Workflows
PDF
Using VarSeq to Improve Variant Analysis Research Workflows
PDF
Bioinformatics and NGS for advancing in hearing loss research
PPTX
Using Public Access Clinical Databases to Interpret NGS Variants
PPTX
Chunlei wu heart_bd2k_201602_ebi
PPTX
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
PDF
Mar2013 Performance Metrics Working Group
PPTX
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
PPTX
Functionally annotate genomic variants
PDF
Variant Calling Workshop: Bioinformatics Tools
PPTX
Integrating Custom Gene Panels for Variant Innovations
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
PDF
Variant analysis and whole exome sequencing
PPTX
GIAB for AMP GeT-RM Forum
PPTX
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
PPTX
2015 bio it visualizing genomic variants and annotations is vital for accur...
PDF
How to transform genomic big data into valuable clinical information
PPTX
ASHG 2015 - Redundant Annotations in Tertiary Analysis
CS Lecture 2017 04-11 from Data to Precision Medicine
Annotation capabilities
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
Bioinformatics and NGS for advancing in hearing loss research
Using Public Access Clinical Databases to Interpret NGS Variants
Chunlei wu heart_bd2k_201602_ebi
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Mar2013 Performance Metrics Working Group
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Functionally annotate genomic variants
Variant Calling Workshop: Bioinformatics Tools
Integrating Custom Gene Panels for Variant Innovations
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Variant analysis and whole exome sequencing
GIAB for AMP GeT-RM Forum
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
2015 bio it visualizing genomic variants and annotations is vital for accur...
How to transform genomic big data into valuable clinical information
ASHG 2015 - Redundant Annotations in Tertiary Analysis

Recently uploaded (20)

PPTX
neonatal infection(7392992y282939y5.pptx
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
PPTX
Acid Base Disorders educational power point.pptx
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PPTX
surgery guide for USMLE step 2-part 1.pptx
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPT
Management of Acute Kidney Injury at LAUTECH
PDF
شيت_عطا_0000000000000000000000000000.pdf
PPTX
anaemia in PGJKKKKKKKKKKKKKKKKHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH...
PDF
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
PPTX
ACID BASE management, base deficit correction
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PPTX
SKIN Anatomy and physiology and associated diseases
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PPTX
anal canal anatomy with illustrations...
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
neonatal infection(7392992y282939y5.pptx
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
Acid Base Disorders educational power point.pptx
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
surgery guide for USMLE step 2-part 1.pptx
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
Management of Acute Kidney Injury at LAUTECH
شيت_عطا_0000000000000000000000000000.pdf
anaemia in PGJKKKKKKKKKKKKKKKKHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH...
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
ACID BASE management, base deficit correction
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
SKIN Anatomy and physiology and associated diseases
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
anal canal anatomy with illustrations...
Medical Evidence in the Criminal Justice Delivery System in.pdf

2015 functional genomics variant annotation and interpretation- tools and public data

  • 1. Variant Annotation and Interpretation: Tools and Public Data Functional Genomics Symposium, Qatar December 12, 2015 Gabe Rudy @gabeinformatics VP Product Management and Engineering Golden Helix
  • 2. My Background  Golden Helix - Founded in 1998 - Genetic association software - Analytic services - Over ten-thousand users worldwide - Over 800 customer citations in journals  Products I Build with My Team - SNP & Variation Suite (SVS) - Research - VarSeq – Clinical & NGS Research - GenomeBrowse (Free!) - All  What I Do (Coding, Bioinformatics) - Build tools, build pipelines of tools - Blog - Participate in GA4GH, HGVS Discussions, NCBI EVAC
  • 3. Topics • ACMG Guidelines for Variant Interpretation • Necessity of visualization • Public data and tools for annotations • Accurate gene annotations, choice of “clinical transcripts” • Variant representation, “left-shifting” and HGVS nomenclature • Warehousing variants
  • 4. ACMG Guidlines  Five-tier terminology system: - “pathogenic,” “likely pathogenic,” “uncertain significance,” “likely benign,” and “benign” - Mendelian and mitochondrial variants - Variant assessment guidelines as combined from 11 labs - Report variant with condition and inheritance pattern - c.1521_1523delCTT (p.Phe508del), pathogenic, cystic fibrosis, autosomal recessive - Likely pathogenic, likely benign mean 90% certainty - Provide genomic coordinates (g.) - Transcript selection up to lab to define "clinically relevant"
  • 5. You Need Visualization, Not Just a Table to Interpret  Recovery of Frameshift (in Supercentenarian)
  • 6. Visualization of Variants to Aid Interpretation  Variants + Genomic Context - Where it is in gene - Annotations that match, don’t match - Other variants in cohort / warehouse - Locality and rare/common variants - Locality of pathogenic variants  Interpreting Multiple Transcript  Alignment Evidence - BAM files provide more than is in VCF - Phasing of same-ready mutations - Examine sites of related samples with no variants called
  • 7. Visualization  Free Genome Browsers: - IGV - Popular desktop by Broad - UCSC - Web-based, most extensive annotations - GenomeBrowse - Designed to be publication ready - Smooth zoom and navigation - Built in all Golden Helix curated annotations (stream or download)
  • 8. Annotation with Public Data  Pop databases - Don't assume “population” == healthy controls - ExAC, EVS, 1kG, dbSNP  Disease databases: - OMIM, ClinVar, HGMD  In-Silico Prediction - Whether missense change is damaging - 65–80% accurate when examining known disease variants - Expect over-sensitive, but can be a low-pass filter to call "likely benign” - Expect correlation between tools as often using similar underlying pieces of evidence. - Splicing: predicting effect on splicing on genes  RefSeqGenes and Human Reference
  • 9. Annotations are Hard!  HGVS is a standard that is not computable - Tries to serve different goals - Many representations of same variant - Difficult when used as identifier, but only alternative is genomic representation (g.)  Transcripts - Transcript set choice extremely important - Hard to curate with meaningful tx attributes.  Public Data Curation - ClinVar: multi-record lines, bits in VCF/XML - NHLBI: MAF vs AAF, splitting “glob” fields - 1kG: No genotype counts - ExAC: Multi-allelic splitting, left-align - ClinVitae (and COSMIC): only HGVS - dbNSFP: Abbreviations and aggregate scores  Versioning and Issues - ClinVar missing ~5K pathogenic in VCF - dbSNP patches without version changes
  • 10. Population Catalogs  1000 Genomes (WGS, Exome, SNP Array) - Many releases, most recent now standardized, still incrementally updated - 2,500 genomes – Phase3  “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS) - Had many releases, now V2-SSA137 0.0.30 - European American / African American only  ExAC (Broad 61,486 Exomes v0.3) - Many sub-populations  Supercentenarians (110+ yo, 17 WGS) - Available as raw Complete Genomic data - Requires normalizing to match Illumina NGS
  • 11. InSilico Predictions  Non-synonymous functional predictions - SIFT, Polyphen2, LFT, MutationTaster, MutationAccessor, FATHMM  Conservation - GERP++, PhyloP, phastCons  All-In-One Scores - CADD, VAAST,VEST3, DANN, FATHM- MKL, MetaSVM and MetaLR - Use machine learning, “feature selection”, train and predict on public databases - Can predicting synonymous and intergenic  dbNSFP 3.0 – 82M precomputed scores - N of 6 Voting on prediction algorithms  RNA Splicing Effect (dbscSNV) - 5+ splice algorithms, can pre-compute - −3 to +8 at the 5’, −12 to +2 at the 3’
  • 12. Disease Databases  ClinVar - Voluntary submissions of lab - Use 5-tier classification (variant + phenotype pairs) - Star-rating of variants - Lab owns submission, can revoke and monitor status  ClinVitae (Invitae curated, not updated)  OMIM - Gene to Phenotype documentation - Expertly curated, hand updated - Changes dynamically - Small list of cited / implicated variants  HGMD - Commercially supported - Best linkage of (possible) publication to variant/genes - Classifications not directly trusted  Your own Lab (more later)
  • 13. Web-Based Annotation Tools  NCBI Variant Reporter - HGVS Annotation - PubMed, ClinVar links  SeattleSeq - NHLBI supported - Some public annotations  Ensembl VEP - Same as running VEP locally  Scripps Genome ADVISER - Out of of date annotations - Scripps Wellderly Frequencies - Splice Site Predictions - Basic Java GUI for filtering  Mutalyzer – HGVS only
  • 14. Variant Annotation Tools  snpEff - Open source, commercial use allowed - Tx Annotation, HGVS output - Limited public annotations  ANNOVAR - Academic/Commercial split - Many public annotations - Non-standard Tx prioritization  Ensembl VEP - Ensembl tx only, HGVS output - Limited public annotations  VarSeq - Commercially supported - Largest public annotation repo - RefSeq/Ensembl tx, HGVS - Clinical Tx, many export formats - Integrated data transformations
  • 15. Reference Sequence and Transcripts
  • 16.  RefSeqGenes – mRNA sequence archive, with mappings to genomes - Provided mappings to Locus Reference Gene (LRG) database - Use genome mappings by NCBI (through genome annotation builds). NOT UCSC - “Clinically Relevant” metric: - LRG if available - Longest if tied  Ensembl – defined directly against the human genome - More inclusive of genes discovered with high-throughput methods - Gencode subset – similar to RefSeqGenes in size / definition  Each have unique Accessions and Version Numbers - Newer releases GRCh38 - GRCh37 mappings not being updated (unfortunately)
  • 17. Reference Sequence Versus Gene Sequence EMG1 on GRCh37  “Gap” of the mRNA coding sequence versus reference seq:  Handled differently by 3 different “gene alignments”
  • 18. Reference Sequence Versus Gene Sequence EMG1 on GRCh38  Reference sequence patched, no gap  Alignments agree
  • 20. RefSeq Accession Not Sufficient for Var-Tx Interaction RefSeq defines transcripts as mRNA sequence  NCBI “Annotation Releases” (like v105) provides alignments using “Splign”  UCSC pulls RefSeq mRNA and aligns themselves using “BLAT”  They can choose equally valid but different alignments for the same accession  This alignment of NM_052814.3 places the exon at dramatically different loci.  Will result in different annotations of any variant overlapping these exons
  • 21. Variant Representation and Normalization  Allelic Primitives - AG/CT -> A/C & G/T - AT/G -> A/- & T/G - May have different annotations  Left Align - NGS standard, not consistent historically - May be needed after primitives - HGVS -> 3’ shift (right for forward)  Multi-Allelic (2 Non-Ref Alleles) - Each non-ref has own annotations - Pop level should be “split” for counts  HGVS, Transcript Projection - Dependent on Tx->Genome Mapping - hgvs-eval: Benchmarking tool in progress
  • 22. Left-Align Annotations  Using a Smith- Waterman algorithm to left- align variants from public databases show non-obvious differences  NGS alignment and variant calling always left-aligned  Left-align your database so they can be annotated
  • 23. Left-Align Delta F508 to Make it Match
  • 24. Called in Both Locations – Affect Frequencies
  • 25. Allelic Split + Left Align. Discover Existing Freq
  • 26. Multi Allelic  The Supercentenarian annotation found records for both alternates, and looks like this:  Trio Analysis, Variant is a G/T/C (Reference G, Alternates of T/C):
  • 27. Variant Warehouse "Clinical laboratories should implement an internal system to track all sequence variants identified in each gene and clinical assertions when reported. This is is important for tracking genotype– phenotype correlations and the frequency of variants in affected and normal populations."
  • 28. Why Warehouse?  A place to archive full VCFs of every sequenced sample (by assay/test)  Query and retrieve subsets of data at any time  Ask the Variant Warehouse: - Have I ever seen this variant in my previous test samples? - At what frequency? (counts as well) - Does this gene contain other rare variants in my cohort? - Did I provide a pathogenicity assessment for this variant? Has that changed? - Has ClinVar changed since that assessment was initially made? - Have I put this variant into a clinical report for any previous samples?
  • 30. NM_002626.4:c.1877G>C in PFKL  NP_002617.3:p.Arg626Pro missense mutation  Predicted damaging by 4/5 functional predictions  VEST3: 0.948, GERP++: 4.59  ExAC and 1kG have a G>A, but G>C is novel  Variants in region are extremely rare (G>C ExAC 4 of 122,364 alleles) – 0.003%  No ClinVar variants for gene  OMIM entry has no known disease association  PubMed search shows few recent articles: Most recent 1998 paper showed - phosphofructokinase (PFKL) overexpressed in Down syndrome (DS) - Transgenic PFKL mice had an abnormal glucose metabolism with reduced clearance rate from blood and enhanced metabolic rate in brain.
  • 31.  d
  • 32.  d 35 LoF Variants, None Homozygous