SlideShare a Scribd company logo
Next Generation Sequencing Bioinformatics:
From Data To Precision Medicine
April 11, 2017
Gabe Rudy, Vice President of
Product and Engineering
My Background
 Golden Helix
- Founded in 1998
- Genetic association software
- Clinical lab variant analysis software
- Thousands of users worldwide
- Over 1000 customer citations in journals
 Products I Build with My Team
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
Sequencers: Versatile tools for science
Genomics is Big Data
 5,000 public data repositories
 Broad Institute:
- Process 40K samples/year
- 1000 people
- 51 High Throughput Sequencers
- 10+ PB of storage
 1 Genome in Data
- ~300GB Compressed Sequence Data
- ~150MB Compressed Variant Data
- Seq data went through 5-6 steps
CS Lecture 2017 04-11 from Data to Precision Medicine
Next Generation Sequencing Analysis
Primary
Analysis
Secondary
Analysis
Tertiary
Analysis
“Sense Making”
 Analysis of hardware generated data, software built by vendors
 Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
 Filtering/clipping of “reads” and their qualities
 Alignment/Assembly of reads
 Recalibrating, de-duplication, variant calling on aligned reads
 QA and filtering of variant calls
 Annotation (querying) variants to databases, filtering on results
 Merging/comparing multiple samples (multiple files)
 Visualization of variants in genomic context
 Statistics on matrixes
BWA+GATK Best Practices Pipeline
Agenda
Data Access Patterns: Databases or Flat Files?
Big Data Tables: Tricks from Data Warehousing
2
3
4
A Genomic Index: Specialized R-Trees, Bins, NC-Lists
Bioinformatics 101: Pipelines and File Formats1
Questions5
Genomic Data Lives in 1D Coordinate Space
FASTQ
 Contains 3 things per read:
- Sequence identifier (unique)
- Sequence bases [len N]
- Base quality scores [len N]
 Often “gzip” compressed (fq.gz)
 If not demultiplexed, first 4 or 6bp
is the “barcode” index. Used to
split lanes out by sample.
 Filtering may include:
- Removing adapters & primers
- Clip poor quality bases at ends
- Remove flagged low-quality reads
@HWI-ST845:4:1101:16436:2254#0/1
CAAACAGGCATGCGAGGTGCCTTTGGAAAGCCCCAGGGCACTGTGGCCAG
+
Y[SQORPMPYRSNP_][_babBBBBBBBBBBBBBBBBBBBBBBBBBB
Aligners
 1981 Smith and Waterman
- Dynamic algorithm
- Finds optimal local alignment of two sequences
- Seq of length m and n, O(mn) time required
 Hashing-Based Aligners (2008)
- SOAP, Eland, MAQ
- ~14GB RAM to use with human
 Burrows Wheeler Transform Aligners (2009)
- BWA, BowTie, SOAP2 (2009)
- Order of magnitude less RAM and Time
 Hybrid Aligners (2012/13)
- RTG, BWA-Mem, bowtie2, Issac
- Seed and expand
- Handle longer reads (>100bp) with larger gaper
BWT
Backtracking – query ‘ggta’ with 1 mismatch
SAM/BAM
 Spec defined by samtools author
Heng Li, aka Li H, aka lh3.
 SAM is text version (easy for any
program to output)
 BAM is binary/compressed version
with indexing support
 Alignment in terms of code of
matches, insertions, deletions,
gaps and clipping
 Can have any custom flags set by
analysis program (and many do)
Key Fields
 Chr, position
 Mapping quality
 CIGAR
 Name/position of mate
 Total template length
 Sequence
 Quality
Variant Callers
 Samtools
- “mpileup” command computes BAQ, preforms local realignment
- Many filters can be applied to get high-quality variants
 GATK
- More than just a variant caller, but UnifiedGenotyper is widely used
- Also provides pre-calling tools like local InDel realignment and quality
score recalibration
 FreeBayes
 Custom tools specific to platform:
- CASAVA includes a variant caller for illumna whole-genome data
- Ion Torrent has a caller that handles InDels better for their tech
 Commercial:
- Real Time Genomics
- Arpeggi
VCF
 Specification defined by the 1000 genomes
group (now v4.1)
 Commonly compressed indexed with
bgzip/tabix (allows for reading directly by a
Genome Browser)
 Contains arbitrary data per “site” (INFO
fields) and per sample
 Single-Sample VCF:
- Contains only the variants for the sample.
 Multi-Sample VCF:
- Whenever one sample has a variant, all samples get
a “genotype” (often “ref”)
 Caveat:
- VCF requires a reference base be specified. Leaving
insertions to be “encoded” 1bp differently than they
are annotated
- Various opinions on how to encode CNV/SV
- gVCF is a VCF file with lines for tracts of ref match
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Sa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequen
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral All
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membershi
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 members
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have d
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Q
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype
Visualization
 Genome browsers:
- Validate variant calls
- Look at gene annotations,
problematic regions, population
catalogs
- Compare samples where no
variant called
 Free Genome Browsers:
- IGV
- Popular desktop by Broad
- UCSC
- Web-based, most extensive
annotations
- GenomeBrowse
- Designed to be publication ready
- Smooth zoom and navigation
Sample Variant Analysis Workflow
Filter out common and
low-quality variants
Filter by inheritance
or zygosity state
Reduce to non-
synonymous
Prioritize
Remaining
Variants
?
VCF file goes in
 Many NGS tertiary analysis
workflows follow a system of
annotation-based filtering
 Common to have a long list of
candidate variants
 Variants need to be prioritized
for validation experiments
 Prioritizing those candidates
is extrememly important, but can
be a very difficult process
Annotation with Public Data
 Public Annotation Data and Tools
- Most produced through academic research our consortia
- Centralized hosting on NCBI, Ensembl, UCSC
 Important categories:
- Population catalogs: how common is a variant?
- Gene/Transcript: is a variant in a gene and how does it change the gene?
- In-silico predictions: How likely is a variant to impare the genes function?
- Knowledgebase: What do we know about particular variants/genes in human diseases?
Population Catalogs
 1000 Genomes (WGS, Exome, SNP Array)
- Many releases, most recent now
standardized, still incrementally updated
- 2,500 genomes – Phase3
 “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS)
- Had many releases, now V2-SSA137 0.0.30
- European American / African American only
 ExAC (Broad 61,486 Exomes v0.3)
- Many sub-populations
 Supercentenarians (110+ yo, 17 WGS)
- Available as raw Complete Genomic data
- Requires normalizing to match Illumina NGS
InSilico Predictions
 Non-synonymous functional
predictions
- SIFT, Polyphen2, LFT, MutationTaster,
MutationAccessor, FATHMM
 Conservation
- GERP++, PhyloP, phastCons
 All-In-One Scores
- CADD, VAAST,VEST3, DANN, FATHM-
MKL, MetaSVM and MetaLR
- Use machine learning, “feature selection”,
train and predict on public databases
- Can predicting synonymous and intergenic
 dbNSFP 3.0 – 82M precomputed scores
- N of 6 Voting on prediction algorithms
 RNA Splicing Effect (dbscSNV)
- 5+ splice algorithms, can pre-compute
- −3 to +8 at the 5’, −12 to +2 at the 3’
Disease Knowledgebases
 ClinVar
- Voluntary submissions of lab
- Use 5-tier classification (variant + phenotype
pairs)
- Star-rating of variants
- Lab owns submission, can revoke and
monitor status
 ClinVitae (Invitae curated, not updated)
 OMIM
- Gene to Phenotype documentation
- Expertly curated of literature, hand updated
- Changes dynamically
- Small list of cited / implicated variants
 HGMD
- Commercially supported
- Best linkage of (possible) publication to
variant/genes
- Classifications not directly trusted
 Your own Lab (more later)
My Exome Case Study 1:
Hemizygous OTC Pathogenic
X:38226614 - G/A
• Novel in all Population Catalogs… except ExAC’s ~60K exomes!
X:38226614 - G/A
• Recent Addition to ClinVar:
• 2013-05-09 G/A - Untested with Disease Unspecified
• 2014-03-03 G/A – Changed to “Pathogenic with not_provided”
1 Citation:
X:38226614 - G/A
• Cited PubMed article was on ResearchGate, Hiroki Morizono contacted
• Provided full text and lots of interesting backstory on OTC
• “If you are able to eat all the steak you want, you may have the mutation; it would
appear to be a hypomorphic allele (and a very mild one at that)”
• “Is possible that the late onset case that [was] identified may have been someone
who was having a very bad day, and several things went poorly for them.”
• “The R40H mutation, there was a grandfather or granduncle who was affected who
ate whatever he wanted, and seemed unaffected while the proband had several
episodes.”
X:38226614 - G/A
• Most likely partial penetrance, with potential risk of triggering with shock event
• The Glycine is conserved down to Opossum (Platypus, Zebafish has a Alanine)
Live Analysis
Questions?
The Central Dogma of Molecular Biology
“The central dogma of molecular biology deals with
the detailed residue-by-residue transfer of
sequential information. It states that such
information cannot be transferred back from protein
to either protein or nucleic acid.”
-- Francis Crick, 1958
 In other words:
- DNA is transcribed to RNA
- RNA is translated to create proteins
- Unidirectional process
 Protein is where damaging effects of a DNA
mutation will be observed
 Functional prediction algorithms are based
almost entirely on protein sequences
Image from Wikimedia Commons, Dhorspool
Transcription
 Transcription is the process by which an RNA transcript is created from DNA
within the cell nucleus before moving to the cytoplasm
 Includes splicing exons
together to create
meaningful
transcripts
 The complete collection
of mRNA transcripts in
a given cell or tissue is
often called the
“transcriptome”
Image from genome.gov
Translation
 mRNA transcripts are converted to
amino acid sequences via the
translation process
 Think of it as a different language;
nucleic acids versus amino acids
Images from genome.gov and WikiMedia Commons, by ladyofhats
Amino Acid Properties
 Amino Acids are distinguished by their
respective residues (aka side-chains
or R-groups)
 Residues are classified by polarity,
volume, hydrophobic and other
physicochemical properties
Images from WikiMedia Commons,
by YassineMrabet and DanCojocari
Levels of Protein Structure
 Primary Structure
- Linear sequence of amino acids
 Secondary Structure
- Interaction between amino acids via hydrogen
bonding results in regular substructures called
alpha helices and beta sheets
 Tertiary Structure
- The final three-dimensional form of an amino
acid chain
- Is influenced by attractions between
secondary structures
 Quaternary Structure
- Several tertiary structures may interact to
form quaternary structures
Image from WikiMedia Commons, ladyofhats
From Structure to Function
 Proteins include various types of functional domains, binding sites and other
surface features
- This determines how the protein interacts with other molecules
 Replacing certain amino acids may have drastic effects on the protein structure
- Thereby affecting the protein function
http://www.vanderbilt.edu/vicb/DiscoveriesArchives/g_protein_receptor.html
 If we know how the protein
structure is affected by an
amino acid substitution, we
can make a good guess
about functional
consequences.
 The problem is that we
don’t know the wild-type
3D strucuture of most
proteins.
Using Primary Structure as Proxy for Tertiary
 83% of disease-causing mutations affect
stability of proteins (Wang and Moult, 2001)
 90% of disease-causing mutations can be
detected using structure and stability
 Many human proteins have numerous
homologs:
- Paralogs: Separated by a gene duplication
event
- Orthologs: Separated by speciation
 Don’t know the exact structure of most
proteins, but we can compare amino acid
sequences to identify domains and motifs
conserved by evolution
 Disease causing mutations are
overrepresented at conserved sites in the
primary structure (Miller and Kumar, 2001)

More Related Content

PDF
Tips for effective use of BLAST and other NCBI tools
PDF
The Clinical Significance of Transcript Alignment Discrepancies
PDF
GMueller_Barcelona
PPTX
DATA ANALYSIS SOFTWARE: METAPY
PPTX
What's New at Araport - ICAR 2017
PPTX
Dgaston dec-06-2012
PPTX
Cool Informatics Tools and Services for Biomedical Research
PDF
Variant analysis and whole exome sequencing
Tips for effective use of BLAST and other NCBI tools
The Clinical Significance of Transcript Alignment Discrepancies
GMueller_Barcelona
DATA ANALYSIS SOFTWARE: METAPY
What's New at Araport - ICAR 2017
Dgaston dec-06-2012
Cool Informatics Tools and Services for Biomedical Research
Variant analysis and whole exome sequencing

Similar to CS Lecture 2017 04-11 from Data to Precision Medicine (20)

PPTX
2015 functional genomics variant annotation and interpretation- tools and p...
PPT
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
PPT
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
PPTX
PPTX
Nida ws neale_seq_data_gen
PDF
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
PPTX
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
PDF
Bioinformatics and NGS for advancing in hearing loss research
PPTX
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PPTX
Workshop NGS data analysis - 2
PPTX
Knowing Your NGS Upstream: Alignment and Variants
PDF
Mar2013 Performance Metrics Working Group
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PDF
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
PPTX
GIAB for AMP GeT-RM Forum
PPTX
Whole exome sequencing data analysis.pptx
PDF
Forum on Personalized Medicine: Challenges for the next decade
PDF
20140710 6 c_mason_ercc2.0_workshop
PDF
Annotation capabilities
2015 functional genomics variant annotation and interpretation- tools and p...
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
Nida ws neale_seq_data_gen
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Bioinformatics and NGS for advancing in hearing loss research
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Workshop NGS data analysis - 2
Knowing Your NGS Upstream: Alignment and Variants
Mar2013 Performance Metrics Working Group
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
GIAB for AMP GeT-RM Forum
Whole exome sequencing data analysis.pptx
Forum on Personalized Medicine: Challenges for the next decade
20140710 6 c_mason_ercc2.0_workshop
Annotation capabilities
Ad

Recently uploaded (20)

PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPTX
surgery guide for USMLE step 2-part 1.pptx
PDF
شيت_عطا_0000000000000000000000000000.pdf
PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PPT
Breast Cancer management for medicsl student.ppt
PPTX
Important Obstetric Emergency that must be recognised
PPTX
Transforming Regulatory Affairs with ChatGPT-5.pptx
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPT
ASRH Presentation for students and teachers 2770633.ppt
PPTX
anal canal anatomy with illustrations...
PPTX
ACID BASE management, base deficit correction
PPTX
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
PDF
Human Health And Disease hggyutgghg .pdf
PPTX
post stroke aphasia rehabilitation physician
PPTX
Neuropathic pain.ppt treatment managment
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
surgery guide for USMLE step 2-part 1.pptx
شيت_عطا_0000000000000000000000000000.pdf
OPIOID ANALGESICS AND THEIR IMPLICATIONS
Breast Cancer management for medicsl student.ppt
Important Obstetric Emergency that must be recognised
Transforming Regulatory Affairs with ChatGPT-5.pptx
Medical Evidence in the Criminal Justice Delivery System in.pdf
ASRH Presentation for students and teachers 2770633.ppt
anal canal anatomy with illustrations...
ACID BASE management, base deficit correction
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
Human Health And Disease hggyutgghg .pdf
post stroke aphasia rehabilitation physician
Neuropathic pain.ppt treatment managment
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
Ad

CS Lecture 2017 04-11 from Data to Precision Medicine

  • 1. Next Generation Sequencing Bioinformatics: From Data To Precision Medicine April 11, 2017 Gabe Rudy, Vice President of Product and Engineering
  • 2. My Background  Golden Helix - Founded in 1998 - Genetic association software - Clinical lab variant analysis software - Thousands of users worldwide - Over 1000 customer citations in journals  Products I Build with My Team - VarSeq - Annotate and filter variants in gene panels, exomes and genomes for clinical labs and researchers. - SNP & Variation Suite (SVS) - SNP, CNV, NGS tertiary analysis - Import and deal with all flavors of upstream data - GenomeBrowse (Free!) - Visualization of everything with genomic coordinates. All standardized file formats.
  • 6. Genomics is Big Data  5,000 public data repositories  Broad Institute: - Process 40K samples/year - 1000 people - 51 High Throughput Sequencers - 10+ PB of storage  1 Genome in Data - ~300GB Compressed Sequence Data - ~150MB Compressed Variant Data - Seq data went through 5-6 steps
  • 8. Next Generation Sequencing Analysis Primary Analysis Secondary Analysis Tertiary Analysis “Sense Making”  Analysis of hardware generated data, software built by vendors  Use FPGA and GPUs to handle real-time optical or eletrical signals from sequencing hardware  Filtering/clipping of “reads” and their qualities  Alignment/Assembly of reads  Recalibrating, de-duplication, variant calling on aligned reads  QA and filtering of variant calls  Annotation (querying) variants to databases, filtering on results  Merging/comparing multiple samples (multiple files)  Visualization of variants in genomic context  Statistics on matrixes
  • 10. Agenda Data Access Patterns: Databases or Flat Files? Big Data Tables: Tricks from Data Warehousing 2 3 4 A Genomic Index: Specialized R-Trees, Bins, NC-Lists Bioinformatics 101: Pipelines and File Formats1 Questions5
  • 11. Genomic Data Lives in 1D Coordinate Space
  • 12. FASTQ  Contains 3 things per read: - Sequence identifier (unique) - Sequence bases [len N] - Base quality scores [len N]  Often “gzip” compressed (fq.gz)  If not demultiplexed, first 4 or 6bp is the “barcode” index. Used to split lanes out by sample.  Filtering may include: - Removing adapters & primers - Clip poor quality bases at ends - Remove flagged low-quality reads @HWI-ST845:4:1101:16436:2254#0/1 CAAACAGGCATGCGAGGTGCCTTTGGAAAGCCCCAGGGCACTGTGGCCAG + Y[SQORPMPYRSNP_][_babBBBBBBBBBBBBBBBBBBBBBBBBBB
  • 13. Aligners  1981 Smith and Waterman - Dynamic algorithm - Finds optimal local alignment of two sequences - Seq of length m and n, O(mn) time required  Hashing-Based Aligners (2008) - SOAP, Eland, MAQ - ~14GB RAM to use with human  Burrows Wheeler Transform Aligners (2009) - BWA, BowTie, SOAP2 (2009) - Order of magnitude less RAM and Time  Hybrid Aligners (2012/13) - RTG, BWA-Mem, bowtie2, Issac - Seed and expand - Handle longer reads (>100bp) with larger gaper
  • 14. BWT
  • 15. Backtracking – query ‘ggta’ with 1 mismatch
  • 16. SAM/BAM  Spec defined by samtools author Heng Li, aka Li H, aka lh3.  SAM is text version (easy for any program to output)  BAM is binary/compressed version with indexing support  Alignment in terms of code of matches, insertions, deletions, gaps and clipping  Can have any custom flags set by analysis program (and many do) Key Fields  Chr, position  Mapping quality  CIGAR  Name/position of mate  Total template length  Sequence  Quality
  • 17. Variant Callers  Samtools - “mpileup” command computes BAQ, preforms local realignment - Many filters can be applied to get high-quality variants  GATK - More than just a variant caller, but UnifiedGenotyper is widely used - Also provides pre-calling tools like local InDel realignment and quality score recalibration  FreeBayes  Custom tools specific to platform: - CASAVA includes a variant caller for illumna whole-genome data - Ion Torrent has a caller that handles InDels better for their tech  Commercial: - Real Time Genomics - Arpeggi
  • 18. VCF  Specification defined by the 1000 genomes group (now v4.1)  Commonly compressed indexed with bgzip/tabix (allows for reading directly by a Genome Browser)  Contains arbitrary data per “site” (INFO fields) and per sample  Single-Sample VCF: - Contains only the variants for the sample.  Multi-Sample VCF: - Whenever one sample has a variant, all samples get a “genotype” (often “ref”)  Caveat: - VCF requires a reference base be specified. Leaving insertions to be “encoded” 1bp differently than they are annotated - Various opinions on how to encode CNV/SV - gVCF is a VCF file with lines for tracts of ref match ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Sa ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth" ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequen ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral All ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membershi ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 members ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have d ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Q ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype
  • 19. Visualization  Genome browsers: - Validate variant calls - Look at gene annotations, problematic regions, population catalogs - Compare samples where no variant called  Free Genome Browsers: - IGV - Popular desktop by Broad - UCSC - Web-based, most extensive annotations - GenomeBrowse - Designed to be publication ready - Smooth zoom and navigation
  • 20. Sample Variant Analysis Workflow Filter out common and low-quality variants Filter by inheritance or zygosity state Reduce to non- synonymous Prioritize Remaining Variants ? VCF file goes in  Many NGS tertiary analysis workflows follow a system of annotation-based filtering  Common to have a long list of candidate variants  Variants need to be prioritized for validation experiments  Prioritizing those candidates is extrememly important, but can be a very difficult process
  • 21. Annotation with Public Data  Public Annotation Data and Tools - Most produced through academic research our consortia - Centralized hosting on NCBI, Ensembl, UCSC  Important categories: - Population catalogs: how common is a variant? - Gene/Transcript: is a variant in a gene and how does it change the gene? - In-silico predictions: How likely is a variant to impare the genes function? - Knowledgebase: What do we know about particular variants/genes in human diseases?
  • 22. Population Catalogs  1000 Genomes (WGS, Exome, SNP Array) - Many releases, most recent now standardized, still incrementally updated - 2,500 genomes – Phase3  “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS) - Had many releases, now V2-SSA137 0.0.30 - European American / African American only  ExAC (Broad 61,486 Exomes v0.3) - Many sub-populations  Supercentenarians (110+ yo, 17 WGS) - Available as raw Complete Genomic data - Requires normalizing to match Illumina NGS
  • 23. InSilico Predictions  Non-synonymous functional predictions - SIFT, Polyphen2, LFT, MutationTaster, MutationAccessor, FATHMM  Conservation - GERP++, PhyloP, phastCons  All-In-One Scores - CADD, VAAST,VEST3, DANN, FATHM- MKL, MetaSVM and MetaLR - Use machine learning, “feature selection”, train and predict on public databases - Can predicting synonymous and intergenic  dbNSFP 3.0 – 82M precomputed scores - N of 6 Voting on prediction algorithms  RNA Splicing Effect (dbscSNV) - 5+ splice algorithms, can pre-compute - −3 to +8 at the 5’, −12 to +2 at the 3’
  • 24. Disease Knowledgebases  ClinVar - Voluntary submissions of lab - Use 5-tier classification (variant + phenotype pairs) - Star-rating of variants - Lab owns submission, can revoke and monitor status  ClinVitae (Invitae curated, not updated)  OMIM - Gene to Phenotype documentation - Expertly curated of literature, hand updated - Changes dynamically - Small list of cited / implicated variants  HGMD - Commercially supported - Best linkage of (possible) publication to variant/genes - Classifications not directly trusted  Your own Lab (more later)
  • 25. My Exome Case Study 1: Hemizygous OTC Pathogenic
  • 26. X:38226614 - G/A • Novel in all Population Catalogs… except ExAC’s ~60K exomes!
  • 27. X:38226614 - G/A • Recent Addition to ClinVar: • 2013-05-09 G/A - Untested with Disease Unspecified • 2014-03-03 G/A – Changed to “Pathogenic with not_provided” 1 Citation:
  • 28. X:38226614 - G/A • Cited PubMed article was on ResearchGate, Hiroki Morizono contacted • Provided full text and lots of interesting backstory on OTC • “If you are able to eat all the steak you want, you may have the mutation; it would appear to be a hypomorphic allele (and a very mild one at that)” • “Is possible that the late onset case that [was] identified may have been someone who was having a very bad day, and several things went poorly for them.” • “The R40H mutation, there was a grandfather or granduncle who was affected who ate whatever he wanted, and seemed unaffected while the proband had several episodes.”
  • 29. X:38226614 - G/A • Most likely partial penetrance, with potential risk of triggering with shock event • The Glycine is conserved down to Opossum (Platypus, Zebafish has a Alanine)
  • 32. The Central Dogma of Molecular Biology “The central dogma of molecular biology deals with the detailed residue-by-residue transfer of sequential information. It states that such information cannot be transferred back from protein to either protein or nucleic acid.” -- Francis Crick, 1958  In other words: - DNA is transcribed to RNA - RNA is translated to create proteins - Unidirectional process  Protein is where damaging effects of a DNA mutation will be observed  Functional prediction algorithms are based almost entirely on protein sequences Image from Wikimedia Commons, Dhorspool
  • 33. Transcription  Transcription is the process by which an RNA transcript is created from DNA within the cell nucleus before moving to the cytoplasm  Includes splicing exons together to create meaningful transcripts  The complete collection of mRNA transcripts in a given cell or tissue is often called the “transcriptome” Image from genome.gov
  • 34. Translation  mRNA transcripts are converted to amino acid sequences via the translation process  Think of it as a different language; nucleic acids versus amino acids Images from genome.gov and WikiMedia Commons, by ladyofhats
  • 35. Amino Acid Properties  Amino Acids are distinguished by their respective residues (aka side-chains or R-groups)  Residues are classified by polarity, volume, hydrophobic and other physicochemical properties Images from WikiMedia Commons, by YassineMrabet and DanCojocari
  • 36. Levels of Protein Structure  Primary Structure - Linear sequence of amino acids  Secondary Structure - Interaction between amino acids via hydrogen bonding results in regular substructures called alpha helices and beta sheets  Tertiary Structure - The final three-dimensional form of an amino acid chain - Is influenced by attractions between secondary structures  Quaternary Structure - Several tertiary structures may interact to form quaternary structures Image from WikiMedia Commons, ladyofhats
  • 37. From Structure to Function  Proteins include various types of functional domains, binding sites and other surface features - This determines how the protein interacts with other molecules  Replacing certain amino acids may have drastic effects on the protein structure - Thereby affecting the protein function http://www.vanderbilt.edu/vicb/DiscoveriesArchives/g_protein_receptor.html  If we know how the protein structure is affected by an amino acid substitution, we can make a good guess about functional consequences.  The problem is that we don’t know the wild-type 3D strucuture of most proteins.
  • 38. Using Primary Structure as Proxy for Tertiary  83% of disease-causing mutations affect stability of proteins (Wang and Moult, 2001)  90% of disease-causing mutations can be detected using structure and stability  Many human proteins have numerous homologs: - Paralogs: Separated by a gene duplication event - Orthologs: Separated by speciation  Don’t know the exact structure of most proteins, but we can compare amino acid sequences to identify domains and motifs conserved by evolution  Disease causing mutations are overrepresented at conserved sites in the primary structure (Miller and Kumar, 2001)