SlideShare a Scribd company logo
Whole exome sequencing
(WES|WXS)
and its data analysis
Feb 28, 2023
Haibo Liu
Senior Bioinformatician
UMass Medical School, Worcester, MA
Email: haibol2017@gmail.com
Eukaryotic Exome
The human exome contains about 180,000 exons. These constitute about 1% of
the human genome (~40 Mb).
Exome sequencing
• A NGS method that selectively sequences the transcribed
regions of the genome.
• Provides a cost-effective alternative to WGS
• Produces a smaller, more manageable data set for faster, easier data
analysis (4–5 Gb WES vs ~90 Gb WGS)
• Identify both somatic and germline variants
• Single Nucleotide Polymorphisms (SNPs)
• Small Insertions-Deletions (indels)
• Loss of Heterozygosity (LOH)
• Copy Number Variants (CNVs), structural variants (SV)
• Microsatellite stability
Performance of WES in clinical studies
Workflow of WES
Genotyping by Microarray, WES, and WGS
(not updated, data
analysis cost not
included)
Experimental design of WES
• Tissue sampling
• Somatic mutations
• Tumor (tumor purity and freshness are critical)
• Normal tissue or blood sample
• Germline mutations
• Blood or any other tissue
• Sample size and sample population
• cohort (disease vs health)
• Trio, related family (non-carrier, carrier, and patient)
• Capture methods
• Sequencing strategies
• platform, PE|SE, UMI, read length, seq. depth
Rescue to DNA preparation from FFPE fixed
samples
Exome capture: Target-enrichment strategies
Array-based capture
https://en.wikipedia.org/wiki/Exome_sequencing
• Twist Exome 2.0 (Twist
Bioscience)
• Nextera Rapid Capture
Exomes (Illumina)
• xGen WES (IDT)
• SureSelect (Agilent)
• KAPA HyperExome
(Roche)
• SeqCap (NimblGen)
• …
Capture toolkits
UMI for detecting low frequency mutations for
prenatal or cancer research
The Cell3™ Target library preparation behind our whole exome enrichment incorporates error suppression
technology. This includes unique molecular indexes (UMIs) and unique dual indexes (UDIs), to remove both
PCR and sequencing errors and index hopping events. This error suppression technique, combined with our
excellent uniformity of coverage, allows you to confidently and accurately call mutations down to 0.1% VAF
and enables generation of sequencing libraries from as little as 1 ng cfDNA input.
Comparison of different library preparation methods
Comparison of different library preparation methods
Sequencing depth
Quality control in WES
Raw data QC BAM QC
variant QC
Raw data QC
• QC tools
• FastQC/MultiQC
• NGS QC toolkit (https://github.com/mjain-lab/NGSQCToolkit)
• QC-chain (contamination detection)
• PRINSEQ
• QC3
• Important QC metrics
• Base quality
• Nucleotide distribution along cycles
• GC content distribution
• Duplication rate
• Adaptor content
QC3
Read trimming
• Trimmomatic, cutadapt, fastp (auto adaptor detection), …
• Quality/adaptor trimming
• Don’t trim 5’ end (markduplicates)
From raw fastq to analysis-ready BAM
Aligner
• BWA-mem
• Bowtie2, Novoalign, GMAP
Selection of reference genomes
• Completeness
• Decoyed genome (1000 Genomes analysis pipeline)
• EBV (herpesvirus 4 type 1, AC:NC_007605) and decoy sequences
derived from HuRef, Human BAC and Fosmid clones and NA12878.
(~36Mb)
• T2T- CHM13v1.1, the latest, complete human reference
genome
Quality control in WES
Raw data QC BAM QC
variant QC
BAM QC
• Important QC metrics
• % of reads that map to the reference
• % of reads that map to the baits
• Coverage depth distribution (target regions)
• Coverage unevenness & Cohort Coverage Sparseness
• Insert size distribution
• Duplicate rate
• Tools
• Alfred
• QC3
• Various picard CollectMetrics tools
• covReport
Cohort Coverage Sparseness (CCS) and
Unevenness (UE) Scores for a detailed
assessment of the distribution of coverage of
sequence reads
https://www.nature.com/articles/s41598-017-01005-x
Local and global non-uniformity of
different capture toolkits
Differences from Capture toolkits
Differences from Capture toolkits
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4092227/
Exome probe design is one of the major
culprits
• Most of the observed bias in modern WES stems from
mappability limitations of short reads and exome probe design
rather than sequence composition.
https://www.nature.com/articles/s41598-020-59026-y
Alfred QC metrics
https://academic.oup.com/bioinformatics/article/35/14/2489/5232224
Alignment Metric DNA-Seq (WGS) DNA-Seq (Capture) RNA-Seq ChIP-Seq/ATAC-Seq Chart Type
Mapping Statistics ✔ ✔ ✔ ✔ Table
Duplicate Statistics ✔ ✔ ✔ ✔ Table
Sequencing Error Rates ✔ ✔ ✔ ✔ Table
Base Content Distribution ✔ ✔ ✔ ✔ Grouped Line Chart
Read Length Distribution ✔ ✔ ✔ ✔ Line Chart
Base Quality Distribution ✔ ✔ ✔ ✔ Line Chart
Coverage Histogram ✔ ✔ ✔ ✔ Line Chart
Insert Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart
InDel Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart
InDel Context ✔ ✔ ✔ ✔ Bar Chart
GC Content ✔ ✔ ✔ ✔ Grouped Line Chart
On-Target Rate ✔ Line Chart
Target Coverage Distribution ✔ Line Chart
TSS Enrichment ✔ Table
DNA pitch / Nucleosome pattern ✔ Grouped Line Chart
https://www.gear-genomics.com/docs/alfred/webapp/#featuresty-control)
CovReport
From BAM to VCF
GATK:
 Slop exon by 200 bp
 Analysis for each
chromosome
Variant callers
(Mutect2)
(HaplotypeCaller)
BreakSeq, LUMPY, Hydra,DELLY, CNVNator, Pindel
FreeBayes/SAMtools, DeepVariant
GATK Best practices for population-
based germline variant calling
GATK Mutect2 Best practices for population-
based soMATIC variant calling
Discrepancy of variants called by
different callers
Integrated variant calling
• Integration of multiple tools’ results
• Isma (integrative somatic mutation analysis)
• Ensemble Machine learning method
• BAYSIC
• SomaticSeq
• NeoMutate
• SMuRF
(Bartha and Gyorffy2019)
(Nanni et al. 2019)
Quality control in WES
Raw data QC BAM QC
variant QC
Sample-level Variant QC
• Tools
• GATK
CollectVariantCallingMetrics
, VCFtools, PLINK/seq, QC3
• Important QC metrics
• Ti/Tv ratio, nonsynonymous/synonymous,
heterozygous/nonreference-homozygous
(het/nonref-hom) ratio, mean depth,
• Genotype missing rate
• Genotype concordance to related data
(different platforms)
• Cross-sample DNA contamination
(VerifyBamID)
• Identity-by-descent (IBD) analysis (PLINK)
• Related samples
• PCA (EIGENSTRAT)
• Population stratum (ethnicity)
• Sex check (PLINK)
Ti/Tv ratio and het/nonref-hom ratio
• The Ti/Tv ratio varies greatly by genome region and
functionality, but not by ancestry.
• The het/nonref-hom ratio varies greatly by ancestry,
but not by genome regions and functionality.
• extreme guanine + cytosine content (either high or
low) is negatively associated with the Ti/Tv ratio
magnitude.
• when performing QC assessment using these two
measures, care must be taken to apply the correct
thresholds based on ancestry and genome region.
https://academic.oup.com/bioinformatics/article/31/3/318/2366248
Too low ==> high false positive rate; too high ==> bias.
Example report
Potential error sources in next-generation sequencing
workflow
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1659-6
Origin of variant artifacts
• Artifacts introduced by sample/library preparation
• low-quality base calls (Read-end artifacts and other low Qual bases)
• Alignment artifacts
• Local misalignment near indels,
• Erroneous alignments in low-complexity regions
• Paralogous alignments of reads not well represented in the reference
• Strand orientation bias artifacts (Strand Orientation Bias Detector
(SOBDetector), Fisher score)--
https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-666
•
Artifacts (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00791-w)
Low base qual Read end
Strand bias Low complexity misalignment Paralog misalgnment
Variant-level QC
• Important QC metrics
• Genotype missing rate
• Hardy-Weinberg Equilibrium (caution) p-value
• Mendelian error rate
• Allele balance of heterozygous calls
• Variant quality score (GATK): filtering SNP and INDELS
separately(https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-
filtering-germline-short-variants)
• Hard filter
• QualByDepth (QD)
• FisherStrand (FS)
• StrandOddsRatio (SOR)
• RMSMappingQuality (MQ)
• MappingQualityRankSumTest (MQRankSum)
• ReadPosRankSumTest (ReadPosRankSum)
• Machine learning-based filtering: Variant Quality Score Recalibration
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
--filterName "my_snp_filter"
Variant-level filtering
• Tools
• GATK, VCFtools, PLINK/Seq
• Sequencing data-based filtering
• Exclude potential artifacts
• Database-based filtering:
• Exclude known variants which are present in public SNP databases,
published studies or in-house databases as it is assumed that common
variants represent harmless variations
• Pedigree-based filtering
• Each generation introduces up to 4.5 deleterious mutations, it might be as
well that a de novo mutation is causing the disease.
• Function-based filtering
• Caution: risk removing the pathogenic variant
Allelic balance
https://www.cureffi.org/2012/09/19/exome-sequencing-pipeline-using-gatk/
Allelic balance
• SLIVAR: genotype quality, sequencing depth, allele balance, and
population allele frequency : https://github.com/brentp/slivar
https://onlinelibrary.wiley.com/doi/full/10.1002/humu.23674
Variant annotation tools
VAT Annotation of variants
by functionality in a
cloud computing
environment.
Variant annotation databases
Functional predictors/Prioritization tools
snpSift http://pcingola.github.io/SnpEff/ss_introduction/
(Hintzsche et al., 2016)
VAAST https://github.com/Yandell-Lab/VVP-pub
VarSifter, VarSight
gNome, KGGseq
(Cheng et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense.
SCIENCE, 19 Sep 2023Vol 381, Issue 6664,DOI: 10.1126/science.adg7492)
Latest, advanced AI tool for infer effect of missense mutations: AlphaMissense
(Hintzsche et al., 2016)
Tools and resources for linking variants to
therapeutics
Variant visualization tools
VIVA, vcfR
oncoprint
Oncoprint for visualizing cohort variants
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6895801/
Beyond variants
Summary
WES and its data analysis
Whole exome sequencing data analysis.pptx
WES data analysis pipelines
• DRAGEN (Illumina)
• https://www.illumina.com/products/by-type/informatics-
products/basespace-sequence-hub/apps/dragen-enrichment.html
• JWES
• A high-performance commercial solution
(https://www.sentieon.com/products/)
• improves upon BWA, STAR, Minimap2, GATK, HaplotypeCaller,
Mutect, and Mutect2 based pipelines and is deployable on any
generic-CPU-based computing system
WES data analysis pipelines

More Related Content

PDF
Variant analysis and whole exome sequencing
PPTX
Whole Exome Sequencing .pptx
PDF
Exome sequence analysis
PPTX
Exome seuencing (steps, method, and applications)
PPTX
Next Gen Sequencing (NGS) Technology Overview
PDF
Ngs intro_v6_public
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PPTX
Third Generation Sequencing
Variant analysis and whole exome sequencing
Whole Exome Sequencing .pptx
Exome sequence analysis
Exome seuencing (steps, method, and applications)
Next Gen Sequencing (NGS) Technology Overview
Ngs intro_v6_public
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Third Generation Sequencing

What's hot (20)

PPTX
SNP Detection Methods and applications
PPTX
Recombinant dna technology
PDF
Rna seq
PPT
COMPARATIVE GENOMICS.ppt
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Genome editing
PPTX
DNA microarray final ppt.
PPSX
Next Generation Sequencing
PPTX
Next generation sequencing
PPTX
Transcriptome analysis
PPTX
Next generation sequencing
PPTX
Next generation sequencing
PDF
Whole Genome Analysis
PPTX
Microarray (DNA and SNP microarray)
PPTX
Comparative genomics
PPTX
Human genome
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PPTX
Next-Generation Sequencing and Data Analysis.pptx
PPTX
SNPs analysis methods
SNP Detection Methods and applications
Recombinant dna technology
Rna seq
COMPARATIVE GENOMICS.ppt
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Genome editing
DNA microarray final ppt.
Next Generation Sequencing
Next generation sequencing
Transcriptome analysis
Next generation sequencing
Next generation sequencing
Whole Genome Analysis
Microarray (DNA and SNP microarray)
Comparative genomics
Human genome
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Next-Generation Sequencing and Data Analysis.pptx
SNPs analysis methods
Ad

Similar to Whole exome sequencing data analysis.pptx (20)

PPTX
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PPTX
Nida ws neale_seq_data_gen
PPTX
Preserving the currency of genomics outcomes over time through selective re-c...
PPTX
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
PPTX
Whole Genome Sequencing Analysis
PPTX
genomics et human genetic variation.pptx
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 2
PPTX
Workshop NGS data analysis - 2
PDF
How to transform genomic big data into valuable clinical information
PDF
ECCB10 talk - Nextgen sequencing and SNPs
PDF
Using VarSeq to Improve Variant Analysis Research Workflows
PDF
Using VarSeq to Improve Variant Analysis Research Workflows
PPTX
Whole exome sequencing(wes)
PDF
Large Scale PCA Analysis in SVS
PDF
picard_poster_12_16_15
PPTX
171017 giab for giab grc workshop
PPTX
171017 giab for giab grc workshop
PDF
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
PDF
Gwas.emes.comp
CS Lecture 2017 04-11 from Data to Precision Medicine
Nida ws neale_seq_data_gen
Preserving the currency of genomics outcomes over time through selective re-c...
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Whole Genome Sequencing Analysis
genomics et human genetic variation.pptx
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Workshop NGS data analysis - 2
How to transform genomic big data into valuable clinical information
ECCB10 talk - Nextgen sequencing and SNPs
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
Whole exome sequencing(wes)
Large Scale PCA Analysis in SVS
picard_poster_12_16_15
171017 giab for giab grc workshop
171017 giab for giab grc workshop
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Gwas.emes.comp
Ad

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction to Inferential Statistics.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
annual-report-2024-2025 original latest.
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Global Data and Analytics Market Outlook Report
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
CYBER SECURITY the Next Warefare Tactics
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Microsoft Core Cloud Services powerpoint
A Complete Guide to Streamlining Business Processes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Data Science and Data Analysis
Introduction to Inferential Statistics.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
annual-report-2024-2025 original latest.
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
retention in jsjsksksksnbsndjddjdnFPD.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Global Data and Analytics Market Outlook Report
Acceptance and paychological effects of mandatory extra coach I classes.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
CYBER SECURITY the Next Warefare Tactics

Whole exome sequencing data analysis.pptx

  • 1. Whole exome sequencing (WES|WXS) and its data analysis Feb 28, 2023 Haibo Liu Senior Bioinformatician UMass Medical School, Worcester, MA Email: haibol2017@gmail.com
  • 2. Eukaryotic Exome The human exome contains about 180,000 exons. These constitute about 1% of the human genome (~40 Mb).
  • 3. Exome sequencing • A NGS method that selectively sequences the transcribed regions of the genome. • Provides a cost-effective alternative to WGS • Produces a smaller, more manageable data set for faster, easier data analysis (4–5 Gb WES vs ~90 Gb WGS) • Identify both somatic and germline variants • Single Nucleotide Polymorphisms (SNPs) • Small Insertions-Deletions (indels) • Loss of Heterozygosity (LOH) • Copy Number Variants (CNVs), structural variants (SV) • Microsatellite stability
  • 4. Performance of WES in clinical studies
  • 6. Genotyping by Microarray, WES, and WGS (not updated, data analysis cost not included)
  • 7. Experimental design of WES • Tissue sampling • Somatic mutations • Tumor (tumor purity and freshness are critical) • Normal tissue or blood sample • Germline mutations • Blood or any other tissue • Sample size and sample population • cohort (disease vs health) • Trio, related family (non-carrier, carrier, and patient) • Capture methods • Sequencing strategies • platform, PE|SE, UMI, read length, seq. depth
  • 8. Rescue to DNA preparation from FFPE fixed samples
  • 9. Exome capture: Target-enrichment strategies Array-based capture https://en.wikipedia.org/wiki/Exome_sequencing • Twist Exome 2.0 (Twist Bioscience) • Nextera Rapid Capture Exomes (Illumina) • xGen WES (IDT) • SureSelect (Agilent) • KAPA HyperExome (Roche) • SeqCap (NimblGen) • … Capture toolkits
  • 10. UMI for detecting low frequency mutations for prenatal or cancer research The Cell3™ Target library preparation behind our whole exome enrichment incorporates error suppression technology. This includes unique molecular indexes (UMIs) and unique dual indexes (UDIs), to remove both PCR and sequencing errors and index hopping events. This error suppression technique, combined with our excellent uniformity of coverage, allows you to confidently and accurately call mutations down to 0.1% VAF and enables generation of sequencing libraries from as little as 1 ng cfDNA input.
  • 11. Comparison of different library preparation methods
  • 12. Comparison of different library preparation methods
  • 14. Quality control in WES Raw data QC BAM QC variant QC
  • 15. Raw data QC • QC tools • FastQC/MultiQC • NGS QC toolkit (https://github.com/mjain-lab/NGSQCToolkit) • QC-chain (contamination detection) • PRINSEQ • QC3 • Important QC metrics • Base quality • Nucleotide distribution along cycles • GC content distribution • Duplication rate • Adaptor content QC3
  • 16. Read trimming • Trimmomatic, cutadapt, fastp (auto adaptor detection), … • Quality/adaptor trimming • Don’t trim 5’ end (markduplicates)
  • 17. From raw fastq to analysis-ready BAM
  • 19. Selection of reference genomes • Completeness • Decoyed genome (1000 Genomes analysis pipeline) • EBV (herpesvirus 4 type 1, AC:NC_007605) and decoy sequences derived from HuRef, Human BAC and Fosmid clones and NA12878. (~36Mb) • T2T- CHM13v1.1, the latest, complete human reference genome
  • 20. Quality control in WES Raw data QC BAM QC variant QC
  • 21. BAM QC • Important QC metrics • % of reads that map to the reference • % of reads that map to the baits • Coverage depth distribution (target regions) • Coverage unevenness & Cohort Coverage Sparseness • Insert size distribution • Duplicate rate • Tools • Alfred • QC3 • Various picard CollectMetrics tools • covReport
  • 22. Cohort Coverage Sparseness (CCS) and Unevenness (UE) Scores for a detailed assessment of the distribution of coverage of sequence reads https://www.nature.com/articles/s41598-017-01005-x
  • 23. Local and global non-uniformity of different capture toolkits
  • 25. Differences from Capture toolkits https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4092227/
  • 26. Exome probe design is one of the major culprits • Most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. https://www.nature.com/articles/s41598-020-59026-y
  • 27. Alfred QC metrics https://academic.oup.com/bioinformatics/article/35/14/2489/5232224 Alignment Metric DNA-Seq (WGS) DNA-Seq (Capture) RNA-Seq ChIP-Seq/ATAC-Seq Chart Type Mapping Statistics ✔ ✔ ✔ ✔ Table Duplicate Statistics ✔ ✔ ✔ ✔ Table Sequencing Error Rates ✔ ✔ ✔ ✔ Table Base Content Distribution ✔ ✔ ✔ ✔ Grouped Line Chart Read Length Distribution ✔ ✔ ✔ ✔ Line Chart Base Quality Distribution ✔ ✔ ✔ ✔ Line Chart Coverage Histogram ✔ ✔ ✔ ✔ Line Chart Insert Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart InDel Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart InDel Context ✔ ✔ ✔ ✔ Bar Chart GC Content ✔ ✔ ✔ ✔ Grouped Line Chart On-Target Rate ✔ Line Chart Target Coverage Distribution ✔ Line Chart TSS Enrichment ✔ Table DNA pitch / Nucleosome pattern ✔ Grouped Line Chart https://www.gear-genomics.com/docs/alfred/webapp/#featuresty-control)
  • 29. From BAM to VCF GATK:  Slop exon by 200 bp  Analysis for each chromosome
  • 30. Variant callers (Mutect2) (HaplotypeCaller) BreakSeq, LUMPY, Hydra,DELLY, CNVNator, Pindel FreeBayes/SAMtools, DeepVariant
  • 31. GATK Best practices for population- based germline variant calling
  • 32. GATK Mutect2 Best practices for population- based soMATIC variant calling
  • 33. Discrepancy of variants called by different callers
  • 34. Integrated variant calling • Integration of multiple tools’ results • Isma (integrative somatic mutation analysis) • Ensemble Machine learning method • BAYSIC • SomaticSeq • NeoMutate • SMuRF (Bartha and Gyorffy2019) (Nanni et al. 2019)
  • 35. Quality control in WES Raw data QC BAM QC variant QC
  • 36. Sample-level Variant QC • Tools • GATK CollectVariantCallingMetrics , VCFtools, PLINK/seq, QC3 • Important QC metrics • Ti/Tv ratio, nonsynonymous/synonymous, heterozygous/nonreference-homozygous (het/nonref-hom) ratio, mean depth, • Genotype missing rate • Genotype concordance to related data (different platforms) • Cross-sample DNA contamination (VerifyBamID) • Identity-by-descent (IBD) analysis (PLINK) • Related samples • PCA (EIGENSTRAT) • Population stratum (ethnicity) • Sex check (PLINK)
  • 37. Ti/Tv ratio and het/nonref-hom ratio • The Ti/Tv ratio varies greatly by genome region and functionality, but not by ancestry. • The het/nonref-hom ratio varies greatly by ancestry, but not by genome regions and functionality. • extreme guanine + cytosine content (either high or low) is negatively associated with the Ti/Tv ratio magnitude. • when performing QC assessment using these two measures, care must be taken to apply the correct thresholds based on ancestry and genome region. https://academic.oup.com/bioinformatics/article/31/3/318/2366248 Too low ==> high false positive rate; too high ==> bias.
  • 39. Potential error sources in next-generation sequencing workflow https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1659-6
  • 40. Origin of variant artifacts • Artifacts introduced by sample/library preparation • low-quality base calls (Read-end artifacts and other low Qual bases) • Alignment artifacts • Local misalignment near indels, • Erroneous alignments in low-complexity regions • Paralogous alignments of reads not well represented in the reference • Strand orientation bias artifacts (Strand Orientation Bias Detector (SOBDetector), Fisher score)-- https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-666 •
  • 41. Artifacts (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00791-w) Low base qual Read end Strand bias Low complexity misalignment Paralog misalgnment
  • 42. Variant-level QC • Important QC metrics • Genotype missing rate • Hardy-Weinberg Equilibrium (caution) p-value • Mendelian error rate • Allele balance of heterozygous calls • Variant quality score (GATK): filtering SNP and INDELS separately(https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard- filtering-germline-short-variants) • Hard filter • QualByDepth (QD) • FisherStrand (FS) • StrandOddsRatio (SOR) • RMSMappingQuality (MQ) • MappingQualityRankSumTest (MQRankSum) • ReadPosRankSumTest (ReadPosRankSum) • Machine learning-based filtering: Variant Quality Score Recalibration --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter"
  • 43. Variant-level filtering • Tools • GATK, VCFtools, PLINK/Seq • Sequencing data-based filtering • Exclude potential artifacts • Database-based filtering: • Exclude known variants which are present in public SNP databases, published studies or in-house databases as it is assumed that common variants represent harmless variations • Pedigree-based filtering • Each generation introduces up to 4.5 deleterious mutations, it might be as well that a de novo mutation is causing the disease. • Function-based filtering • Caution: risk removing the pathogenic variant
  • 45. Allelic balance • SLIVAR: genotype quality, sequencing depth, allele balance, and population allele frequency : https://github.com/brentp/slivar https://onlinelibrary.wiley.com/doi/full/10.1002/humu.23674
  • 46. Variant annotation tools VAT Annotation of variants by functionality in a cloud computing environment.
  • 48. Functional predictors/Prioritization tools snpSift http://pcingola.github.io/SnpEff/ss_introduction/ (Hintzsche et al., 2016) VAAST https://github.com/Yandell-Lab/VVP-pub VarSifter, VarSight gNome, KGGseq
  • 49. (Cheng et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. SCIENCE, 19 Sep 2023Vol 381, Issue 6664,DOI: 10.1126/science.adg7492) Latest, advanced AI tool for infer effect of missense mutations: AlphaMissense (Hintzsche et al., 2016)
  • 50. Tools and resources for linking variants to therapeutics
  • 52. Oncoprint for visualizing cohort variants
  • 55. WES and its data analysis
  • 57. WES data analysis pipelines • DRAGEN (Illumina) • https://www.illumina.com/products/by-type/informatics- products/basespace-sequence-hub/apps/dragen-enrichment.html • JWES
  • 58. • A high-performance commercial solution (https://www.sentieon.com/products/) • improves upon BWA, STAR, Minimap2, GATK, HaplotypeCaller, Mutect, and Mutect2 based pipelines and is deployable on any generic-CPU-based computing system WES data analysis pipelines