Large-scale Genomic Analysis Enabled by Gordon

Large-scale Genomic Analysis
Enabled by Gordon!
Kristopher Standish*^, Tristan M. Carland*,
Glenn K. Lockwood+^, Mahidhar Tatineni+^,
Wayne Pfeiffer+^, Nicholas J. Schork*^!
*

Scripps Translational Science Institute!
+ San Diego Supercomputer Center!
^ University of California San Diego!

Project funding provided by Janssen R&D!

Background!
•  Janssen R&D performed whole-genome
sequencing on 438 patients undergoing
treatment for rheumatoid arthritis!
•  Problem: correlate response or non-response to
drug therapy with genetic variants!
•  Solution combines multi-disciplinary expertise!
•  Genomic analytics from Janssen R&D and Scripps
Translational Science Institute (STSI)!
•  Data-intensive computing from San Diego Supercomputer
Center (SDSC)!

SAN DIEGO SUPERCOMPUTER CENTER

Technical Challenges!
•  Data Volume: raw reads from 438 full human
genomes!
•  50 TB of compressed data from Janssen R&D!
•  encrypted on 8x 6 TB SATA RAID enclosures!

•  Compute: perform read mapping and variant
calling on all genomes!
•  9-step pipeline to achieve high-quality read mapping!
•  5-step pipeline to do group variant calling for analysis!

•  Project requirements:!
•  FAST turnaround (assembly in < 2 months)!
•  EFFICIENT (minimum core-hours used)!

Read Mapping Pipeline: Looks Uniform from
Traditional HPC Perspective...!
Thread-level Parallelism

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!


Read Mapping Pipeline: Non-Traditional
Bottlenecks (DRAM & IO)!
Memory Requirement

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!


Sort Step: Bound by Disk IO and
Capacity!
Problem: 16 threads require...!
•  25 GB DRAM!
•  3.5 TB local disk!
•  1.6 TB input data!
which generate...!
•  3,500 IOPs  
(metadata-rich)!
•  1 GB/s read rate!
Solution: BigFlash!
•  64 GB DRAM/node!
•  16x300 GB SSDs 
(4.4 TB usable local ﬂash)!
•  1.6 GB/s from Lustre to SSDs, dedicated I/O InﬁniBand rail!

Group Variant Calling Pipeline!

Walltime

Thread-level Parallelism

•  Massive data
reduction at ﬁrst
step!
•  Reduction in data
parallelism!
•  Subsequent steps
(#2 - #5) ofﬂoaded to
campus cluster!
Dimensions approx. drawn to scale!
•  1-6 threads each!
•  10-30 min each!

Footprint on Gordon: CPUs and Storage Used!
257 TB Lustre
scratch used at peak
!


5,000 cores (30% of
Gordon) in use at once
!

Time to Completion...!
•  Overall: !
•  36 core-years of compute used in 6 weeks—equivalent
to 310 cores running 24/7!
•  57 TB DRAM used (aggregate)!

•  Read Mapping (9-step Pipeline)!
•  5 weeks including time for learning on Gordon (16 days
of compute in public batch queue)!
•  Over 2.5 years of 24/7 compute on a single 8-core
workstation (> 4 years realistically)!

•  Variant Calling (GATK Haplotype Caller)!
•  5 days and 3 hours on Gordon!
•  10.5 months of 24/7 compute on a 16-core workstation!

Acknowledgements
•  Chris Huang

•  Ed Jaeger

•  Sarah Lamberth

•  Lance Smith

•  Zhenya Cherkas

•  Martin Dellwo

•  Carrie Brodmerkel

•  Sandor Szalma

•  Mark Curran

•  Guna Rajagopal

Janssen Research & Development

Large-scale Genomic Analysis Enabled by Gordon

More Related Content

Similar to Large-scale Genomic Analysis Enabled by Gordon (20)

More from Glenn K. Lockwood (6)

Recently uploaded (20)

Large-scale Genomic Analysis Enabled by Gordon