Large-scale Genomic Analysis
Enabled by Gordon!
Kristopher Standish*^, Tristan M. Carland*,
Glenn K. Lockwood+^, Mahidhar Tatineni+^,
Wayne Pfeiffer+^, Nicholas J. Schork*^!
*

Scripps Translational Science Institute!
+ San Diego Supercomputer Center!
^ University of California San Diego!

Project funding provided by Janssen R&D!
Background!
•  Janssen R&D performed whole-genome
sequencing on 438 patients undergoing
treatment for rheumatoid arthritis!
•  Problem: correlate response or non-response to
drug therapy with genetic variants!
•  Solution combines multi-disciplinary expertise!
•  Genomic analytics from Janssen R&D and Scripps
Translational Science Institute (STSI)!
•  Data-intensive computing from San Diego Supercomputer
Center (SDSC)!

SAN DIEGO SUPERCOMPUTER CENTER
Technical Challenges!
•  Data Volume: raw reads from 438 full human
genomes!
•  50 TB of compressed data from Janssen R&D!
•  encrypted on 8x 6 TB SATA RAID enclosures!

•  Compute: perform read mapping and variant
calling on all genomes!
•  9-step pipeline to achieve high-quality read mapping!
•  5-step pipeline to do group variant calling for analysis!

•  Project requirements:!
•  FAST turnaround (assembly in < 2 months)!
•  EFFICIENT (minimum core-hours used)!
SAN DIEGO SUPERCOMPUTER CENTER
Read Mapping Pipeline: Looks Uniform from
Traditional HPC Perspective...!
Thread-level Parallelism

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!

SAN DIEGO SUPERCOMPUTER CENTER
Read Mapping Pipeline: Non-Traditional
Bottlenecks (DRAM & IO)!
Memory Requirement

Map (BWA)
sam to bam (SAMtools)
Merge Lanes (SAMtools)
Sort (SAMtools)
Mark Duplicates (Picard)
Target Creator (GATK)
Indel Realigner (GATK)
Base Quality Score
Recalibration (GATK)
9.  Print Reads (GATK)

Walltime

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

Dimensions drawn to scale!

SAN DIEGO SUPERCOMPUTER CENTER
Sort Step: Bound by Disk IO and
Capacity!
Problem: 16 threads require...!
•  25 GB DRAM!
•  3.5 TB local disk!
•  1.6 TB input data!
which generate...!
•  3,500 IOPs 

(metadata-rich)!
•  1 GB/s read rate!
Solution: BigFlash!
•  64 GB DRAM/node!
•  16x300 GB SSDs

(4.4 TB usable local flash)!
•  1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail!
SAN DIEGO SUPERCOMPUTER CENTER
Group Variant Calling Pipeline!

Walltime

Thread-level Parallelism

•  Massive data
reduction at first
step!
•  Reduction in data
parallelism!
•  Subsequent steps
(#2 - #5) offloaded to
campus cluster!
Dimensions approx. drawn to scale!
•  1-6 threads each!
•  10-30 min each!
SAN DIEGO SUPERCOMPUTER CENTER
Footprint on Gordon: CPUs and Storage Used!
257 TB Lustre
scratch used at peak
!

SAN DIEGO SUPERCOMPUTER CENTER

5,000 cores (30% of
Gordon) in use at once
!
Time to Completion...!
•  Overall: !
•  36 core-years of compute used in 6 weeks—equivalent
to 310 cores running 24/7!
•  57 TB DRAM used (aggregate)!

•  Read Mapping (9-step Pipeline)!
•  5 weeks including time for learning on Gordon (16 days
of compute in public batch queue)!
•  Over 2.5 years of 24/7 compute on a single 8-core
workstation (> 4 years realistically)!

•  Variant Calling (GATK Haplotype Caller)!
•  5 days and 3 hours on Gordon!
•  10.5 months of 24/7 compute on a 16-core workstation!
SAN DIEGO SUPERCOMPUTER CENTER
Acknowledgements
•  Chris Huang

•  Ed Jaeger

•  Sarah Lamberth

•  Lance Smith

•  Zhenya Cherkas

•  Martin Dellwo

•  Carrie Brodmerkel

•  Sandor Szalma

•  Mark Curran

•  Guna Rajagopal

Janssen Research & Development

More Related Content

PPTX
Bioinformatica t2-databases
PPTX
Base editing in plants
PPTX
Editing rice-genome with CRISPR/Cas9: To improve agronomic traits for increa...
PDF
myHadoop 0.30
PPTX
Overview of Spark for HPC
PPTX
Making powerful science: an introduction to NGS data analysis
PDF
Challenges and Opportunities of Big Data Genomics
PDF
40 Years of Genome Assembly: Are We Done Yet?
Bioinformatica t2-databases
Base editing in plants
Editing rice-genome with CRISPR/Cas9: To improve agronomic traits for increa...
myHadoop 0.30
Overview of Spark for HPC
Making powerful science: an introduction to NGS data analysis
Challenges and Opportunities of Big Data Genomics
40 Years of Genome Assembly: Are We Done Yet?

Similar to Large-scale Genomic Analysis Enabled by Gordon (20)

PDF
ChipSeq Data Analysis
PPTX
Accelerate pharmaceutical r&d with mongo db
PDF
Guy Coates
PPT
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
PPT
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
PPTX
Accelerate Pharmaceutical R&D with Big Data and MongoDB
PPTX
Lecture-1_NGS.pptx important document it
PDF
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
PPT
High Throughput Sequencing Technologies: What We Can Know
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PDF
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
PPTX
2011 jeroen vanhoudt_ngs
PDF
Reproducible research - to infinity
PPTX
Fish546
PPTX
Next-generation sequencing format and visualization with ngs.plot
PDF
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
PDF
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
PPT
sequencing of genome
PDF
Cpgr services brochure 14 may 2013 - v 16
PPT
Finding Needles in Haystacks (The Size of Countries)
ChipSeq Data Analysis
Accelerate pharmaceutical r&d with mongo db
Guy Coates
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Lecture-1_NGS.pptx important document it
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
High Throughput Sequencing Technologies: What We Can Know
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
2011 jeroen vanhoudt_ngs
Reproducible research - to infinity
Fish546
Next-generation sequencing format and visualization with ngs.plot
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
sequencing of genome
Cpgr services brochure 14 may 2013 - v 16
Finding Needles in Haystacks (The Size of Countries)
Ad

More from Glenn K. Lockwood (6)

PDF
Understanding and Measuring I/O Performance
PDF
Parallel R and Hadoop
PPT
ASCI Terascale Simulation Requirements and Deployments
PPTX
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
PDF
Hadoop Streaming: Programming Hadoop without Java
PDF
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
Understanding and Measuring I/O Performance
Parallel R and Hadoop
ASCI Terascale Simulation Requirements and Deployments
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Hadoop Streaming: Programming Hadoop without Java
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
Ad

Recently uploaded (20)

PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
WOOl fibre morphology and structure.pdf for textiles
PPT
What is a Computer? Input Devices /output devices
PPTX
The various Industrial Revolutions .pptx
PPT
Geologic Time for studying geology for geologist
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
DOCX
search engine optimization ppt fir known well about this
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Chapter 5: Probability Theory and Statistics
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
STKI Israel Market Study 2025 version august
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
sustainability-14-14877-v2.pddhzftheheeeee
1 - Historical Antecedents, Social Consideration.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
WOOl fibre morphology and structure.pdf for textiles
What is a Computer? Input Devices /output devices
The various Industrial Revolutions .pptx
Geologic Time for studying geology for geologist
A comparative study of natural language inference in Swahili using monolingua...
Developing a website for English-speaking practice to English as a foreign la...
O2C Customer Invoices to Receipt V15A.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Getting started with AI Agents and Multi-Agent Systems
search engine optimization ppt fir known well about this
DP Operators-handbook-extract for the Mautical Institute
Chapter 5: Probability Theory and Statistics
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Group 1 Presentation -Planning and Decision Making .pptx
STKI Israel Market Study 2025 version august
Univ-Connecticut-ChatGPT-Presentaion.pdf
A novel scalable deep ensemble learning framework for big data classification...

Large-scale Genomic Analysis Enabled by Gordon

  • 1. Large-scale Genomic Analysis Enabled by Gordon! Kristopher Standish*^, Tristan M. Carland*, Glenn K. Lockwood+^, Mahidhar Tatineni+^, Wayne Pfeiffer+^, Nicholas J. Schork*^! * Scripps Translational Science Institute! + San Diego Supercomputer Center! ^ University of California San Diego! Project funding provided by Janssen R&D!
  • 2. Background! •  Janssen R&D performed whole-genome sequencing on 438 patients undergoing treatment for rheumatoid arthritis! •  Problem: correlate response or non-response to drug therapy with genetic variants! •  Solution combines multi-disciplinary expertise! •  Genomic analytics from Janssen R&D and Scripps Translational Science Institute (STSI)! •  Data-intensive computing from San Diego Supercomputer Center (SDSC)! SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Technical Challenges! •  Data Volume: raw reads from 438 full human genomes! •  50 TB of compressed data from Janssen R&D! •  encrypted on 8x 6 TB SATA RAID enclosures! •  Compute: perform read mapping and variant calling on all genomes! •  9-step pipeline to achieve high-quality read mapping! •  5-step pipeline to do group variant calling for analysis! •  Project requirements:! •  FAST turnaround (assembly in < 2 months)! •  EFFICIENT (minimum core-hours used)! SAN DIEGO SUPERCOMPUTER CENTER
  • 4. Read Mapping Pipeline: Looks Uniform from Traditional HPC Perspective...! Thread-level Parallelism Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
  • 5. Read Mapping Pipeline: Non-Traditional Bottlenecks (DRAM & IO)! Memory Requirement Map (BWA) sam to bam (SAMtools) Merge Lanes (SAMtools) Sort (SAMtools) Mark Duplicates (Picard) Target Creator (GATK) Indel Realigner (GATK) Base Quality Score Recalibration (GATK) 9.  Print Reads (GATK) Walltime 1.  2.  3.  4.  5.  6.  7.  8.  Dimensions drawn to scale! SAN DIEGO SUPERCOMPUTER CENTER
  • 6. Sort Step: Bound by Disk IO and Capacity! Problem: 16 threads require...! •  25 GB DRAM! •  3.5 TB local disk! •  1.6 TB input data! which generate...! •  3,500 IOPs 
 (metadata-rich)! •  1 GB/s read rate! Solution: BigFlash! •  64 GB DRAM/node! •  16x300 GB SSDs
 (4.4 TB usable local flash)! •  1.6 GB/s from Lustre to SSDs, dedicated I/O InfiniBand rail! SAN DIEGO SUPERCOMPUTER CENTER
  • 7. Group Variant Calling Pipeline! Walltime Thread-level Parallelism •  Massive data reduction at first step! •  Reduction in data parallelism! •  Subsequent steps (#2 - #5) offloaded to campus cluster! Dimensions approx. drawn to scale! •  1-6 threads each! •  10-30 min each! SAN DIEGO SUPERCOMPUTER CENTER
  • 8. Footprint on Gordon: CPUs and Storage Used! 257 TB Lustre scratch used at peak ! SAN DIEGO SUPERCOMPUTER CENTER 5,000 cores (30% of Gordon) in use at once !
  • 9. Time to Completion...! •  Overall: ! •  36 core-years of compute used in 6 weeks—equivalent to 310 cores running 24/7! •  57 TB DRAM used (aggregate)! •  Read Mapping (9-step Pipeline)! •  5 weeks including time for learning on Gordon (16 days of compute in public batch queue)! •  Over 2.5 years of 24/7 compute on a single 8-core workstation (> 4 years realistically)! •  Variant Calling (GATK Haplotype Caller)! •  5 days and 3 hours on Gordon! •  10.5 months of 24/7 compute on a 16-core workstation! SAN DIEGO SUPERCOMPUTER CENTER
  • 10. Acknowledgements •  Chris Huang •  Ed Jaeger •  Sarah Lamberth •  Lance Smith •  Zhenya Cherkas •  Martin Dellwo •  Carrie Brodmerkel •  Sandor Szalma •  Mark Curran •  Guna Rajagopal Janssen Research & Development