CS Lecture 2017 04-11 from Data to Precision Medicine

Next Generation Sequencing Bioinformatics:
From Data To Precision Medicine
April 11, 2017
Gabe Rudy, Vice President of
Product and Engineering

My Background
 Golden Helix
- Founded in 1998
- Genetic association software
- Clinical lab variant analysis software
- Thousands of users worldwide
- Over 1000 customer citations in journals
 Products I Build with My Team
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.

Sequencers: Versatile tools for science

Genomics is Big Data
 5,000 public data repositories
 Broad Institute:
- Process 40K samples/year
- 1000 people
- 51 High Throughput Sequencers
- 10+ PB of storage
 1 Genome in Data
- ~300GB Compressed Sequence Data
- ~150MB Compressed Variant Data
- Seq data went through 5-6 steps

Next Generation Sequencing Analysis
Primary
Analysis
Secondary
Analysis
Tertiary
Analysis
“Sense Making”
 Analysis of hardware generated data, software built by vendors
 Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
 Filtering/clipping of “reads” and their qualities
 Alignment/Assembly of reads
 Recalibrating, de-duplication, variant calling on aligned reads
 QA and filtering of variant calls
 Annotation (querying) variants to databases, filtering on results
 Merging/comparing multiple samples (multiple files)
 Visualization of variants in genomic context
 Statistics on matrixes

BWA+GATK Best Practices Pipeline

Agenda
Data Access Patterns: Databases or Flat Files?
Big Data Tables: Tricks from Data Warehousing
2
3
4
A Genomic Index: Specialized R-Trees, Bins, NC-Lists
Bioinformatics 101: Pipelines and File Formats1
Questions5

Genomic Data Lives in 1D Coordinate Space

FASTQ
 Contains 3 things per read:
- Sequence identifier (unique)
- Sequence bases [len N]
- Base quality scores [len N]
 Often “gzip” compressed (fq.gz)
 If not demultiplexed, first 4 or 6bp
is the “barcode” index. Used to
split lanes out by sample.
 Filtering may include:
- Removing adapters & primers
- Clip poor quality bases at ends
- Remove flagged low-quality reads
@HWI-ST845:4:1101:16436:2254#0/1
CAAACAGGCATGCGAGGTGCCTTTGGAAAGCCCCAGGGCACTGTGGCCAG
+
Y[SQORPMPYRSNP_][_babBBBBBBBBBBBBBBBBBBBBBBBBBB

Aligners
 1981 Smith and Waterman
- Dynamic algorithm
- Finds optimal local alignment of two sequences
- Seq of length m and n, O(mn) time required
 Hashing-Based Aligners (2008)
- SOAP, Eland, MAQ
- ~14GB RAM to use with human
 Burrows Wheeler Transform Aligners (2009)
- BWA, BowTie, SOAP2 (2009)
- Order of magnitude less RAM and Time
 Hybrid Aligners (2012/13)
- RTG, BWA-Mem, bowtie2, Issac
- Seed and expand
- Handle longer reads (>100bp) with larger gaper

Backtracking – query ‘ggta’ with 1 mismatch

SAM/BAM
 Spec defined by samtools author
Heng Li, aka Li H, aka lh3.
 SAM is text version (easy for any
program to output)
 BAM is binary/compressed version
with indexing support
 Alignment in terms of code of
matches, insertions, deletions,
gaps and clipping
 Can have any custom flags set by
analysis program (and many do)
Key Fields
 Chr, position
 Mapping quality
 CIGAR
 Name/position of mate
 Total template length
 Sequence
 Quality

Variant Callers
 Samtools
- “mpileup” command computes BAQ, preforms local realignment
- Many filters can be applied to get high-quality variants
 GATK
- More than just a variant caller, but UnifiedGenotyper is widely used
- Also provides pre-calling tools like local InDel realignment and quality
score recalibration
 FreeBayes
 Custom tools specific to platform:
- CASAVA includes a variant caller for illumna whole-genome data
- Ion Torrent has a caller that handles InDels better for their tech
 Commercial:
- Real Time Genomics
- Arpeggi

VCF
 Specification defined by the 1000 genomes
group (now v4.1)
 Commonly compressed indexed with
bgzip/tabix (allows for reading directly by a
Genome Browser)
 Contains arbitrary data per “site” (INFO
fields) and per sample
 Single-Sample VCF:
- Contains only the variants for the sample.
 Multi-Sample VCF:
- Whenever one sample has a variant, all samples get
a “genotype” (often “ref”)
 Caveat:
- VCF requires a reference base be specified. Leaving
insertions to be “encoded” 1bp differently than they
are annotated
- Various opinions on how to encode CNV/SV
- gVCF is a VCF file with lines for tracts of ref match
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Sa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequen
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral All
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membershi
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 members
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have d
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Q
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype

Visualization
 Genome browsers:
- Validate variant calls
- Look at gene annotations,
problematic regions, population
catalogs
- Compare samples where no
variant called
 Free Genome Browsers:
- IGV
- Popular desktop by Broad
- UCSC
- Web-based, most extensive
annotations
- GenomeBrowse
- Designed to be publication ready
- Smooth zoom and navigation

Sample Variant Analysis Workflow
Filter out common and
low-quality variants
Filter by inheritance
or zygosity state
Reduce to non-
synonymous
Prioritize
Remaining
Variants
?
VCF file goes in
 Many NGS tertiary analysis
workflows follow a system of
annotation-based filtering
 Common to have a long list of
candidate variants
 Variants need to be prioritized
for validation experiments
 Prioritizing those candidates
is extrememly important, but can
be a very difficult process

Annotation with Public Data
 Public Annotation Data and Tools
- Most produced through academic research our consortia
- Centralized hosting on NCBI, Ensembl, UCSC
 Important categories:
- Population catalogs: how common is a variant?
- Gene/Transcript: is a variant in a gene and how does it change the gene?
- In-silico predictions: How likely is a variant to impare the genes function?
- Knowledgebase: What do we know about particular variants/genes in human diseases?

Population Catalogs
 1000 Genomes (WGS, Exome, SNP Array)
- Many releases, most recent now
standardized, still incrementally updated
- 2,500 genomes – Phase3
 “ESP” (NHLBI 6,500 Exomes) (a.k.a EVS)
- Had many releases, now V2-SSA137 0.0.30
- European American / African American only
 ExAC (Broad 61,486 Exomes v0.3)
- Many sub-populations
 Supercentenarians (110+ yo, 17 WGS)
- Available as raw Complete Genomic data
- Requires normalizing to match Illumina NGS

InSilico Predictions
 Non-synonymous functional
predictions
- SIFT, Polyphen2, LFT, MutationTaster,
MutationAccessor, FATHMM
 Conservation
- GERP++, PhyloP, phastCons
 All-In-One Scores
- CADD, VAAST,VEST3, DANN, FATHM-
MKL, MetaSVM and MetaLR
- Use machine learning, “feature selection”,
train and predict on public databases
- Can predicting synonymous and intergenic
 dbNSFP 3.0 – 82M precomputed scores
- N of 6 Voting on prediction algorithms
 RNA Splicing Effect (dbscSNV)
- 5+ splice algorithms, can pre-compute
- −3 to +8 at the 5’, −12 to +2 at the 3’

Disease Knowledgebases
 ClinVar
- Voluntary submissions of lab
- Use 5-tier classification (variant + phenotype
pairs)
- Star-rating of variants
- Lab owns submission, can revoke and
monitor status
 ClinVitae (Invitae curated, not updated)
 OMIM
- Gene to Phenotype documentation
- Expertly curated of literature, hand updated
- Changes dynamically
- Small list of cited / implicated variants
 HGMD
- Commercially supported
- Best linkage of (possible) publication to
variant/genes
- Classifications not directly trusted
 Your own Lab (more later)

My Exome Case Study 1:
Hemizygous OTC Pathogenic

X:38226614 - G/A
• Novel in all Population Catalogs… except ExAC’s ~60K exomes!

X:38226614 - G/A
• Recent Addition to ClinVar:
• 2013-05-09 G/A - Untested with Disease Unspecified
• 2014-03-03 G/A – Changed to “Pathogenic with not_provided”
1 Citation:

X:38226614 - G/A
• Cited PubMed article was on ResearchGate, Hiroki Morizono contacted
• Provided full text and lots of interesting backstory on OTC
• “If you are able to eat all the steak you want, you may have the mutation; it would
appear to be a hypomorphic allele (and a very mild one at that)”
• “Is possible that the late onset case that [was] identified may have been someone
who was having a very bad day, and several things went poorly for them.”
• “The R40H mutation, there was a grandfather or granduncle who was affected who
ate whatever he wanted, and seemed unaffected while the proband had several
episodes.”

X:38226614 - G/A
• Most likely partial penetrance, with potential risk of triggering with shock event
• The Glycine is conserved down to Opossum (Platypus, Zebafish has a Alanine)

The Central Dogma of Molecular Biology
“The central dogma of molecular biology deals with
the detailed residue-by-residue transfer of
sequential information. It states that such
information cannot be transferred back from protein
to either protein or nucleic acid.”
-- Francis Crick, 1958
 In other words:
- DNA is transcribed to RNA
- RNA is translated to create proteins
- Unidirectional process
 Protein is where damaging effects of a DNA
mutation will be observed
 Functional prediction algorithms are based
almost entirely on protein sequences
Image from Wikimedia Commons, Dhorspool

Transcription
 Transcription is the process by which an RNA transcript is created from DNA
within the cell nucleus before moving to the cytoplasm
 Includes splicing exons
together to create
meaningful
transcripts
 The complete collection
of mRNA transcripts in
a given cell or tissue is
often called the
“transcriptome”
Image from genome.gov

Translation
 mRNA transcripts are converted to
amino acid sequences via the
translation process
 Think of it as a different language;
nucleic acids versus amino acids
Images from genome.gov and WikiMedia Commons, by ladyofhats

Amino Acid Properties
 Amino Acids are distinguished by their
respective residues (aka side-chains
or R-groups)
 Residues are classified by polarity,
volume, hydrophobic and other
physicochemical properties
Images from WikiMedia Commons,
by YassineMrabet and DanCojocari

Levels of Protein Structure
 Primary Structure
- Linear sequence of amino acids
 Secondary Structure
- Interaction between amino acids via hydrogen
bonding results in regular substructures called
alpha helices and beta sheets
 Tertiary Structure
- The final three-dimensional form of an amino
acid chain
- Is influenced by attractions between
secondary structures
 Quaternary Structure
- Several tertiary structures may interact to
form quaternary structures
Image from WikiMedia Commons, ladyofhats

From Structure to Function
 Proteins include various types of functional domains, binding sites and other
surface features
- This determines how the protein interacts with other molecules
 Replacing certain amino acids may have drastic effects on the protein structure
- Thereby affecting the protein function
http://www.vanderbilt.edu/vicb/DiscoveriesArchives/g_protein_receptor.html
 If we know how the protein
structure is affected by an
amino acid substitution, we
can make a good guess
about functional
consequences.
 The problem is that we
don’t know the wild-type
3D strucuture of most
proteins.

Using Primary Structure as Proxy for Tertiary
 83% of disease-causing mutations affect
stability of proteins (Wang and Moult, 2001)
 90% of disease-causing mutations can be
detected using structure and stability
 Many human proteins have numerous
homologs:
- Paralogs: Separated by a gene duplication
event
- Orthologs: Separated by speciation
 Don’t know the exact structure of most
proteins, but we can compare amino acid
sequences to identify domains and motifs
conserved by evolution
 Disease causing mutations are
overrepresented at conserved sites in the
primary structure (Miller and Kumar, 2001)

CS Lecture 2017 04-11 from Data to Precision Medicine

More Related Content

Similar to CS Lecture 2017 04-11 from Data to Precision Medicine (20)

Recently uploaded (20)

CS Lecture 2017 04-11 from Data to Precision Medicine