How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing

How To Scale From
Workstation Through Cloud
To HPC In Cryo-EM
Processing
Presented by: Dr Lance Wilson

Presentation Outline
What will we discuss…
• Who am I? Where am I from?
• What is cryo-electron microscopy?
• How long does analysis take?
• What impact does hardware have?
• What options do you suggest?
• What processing opportunities exist?
• Next opportunity: Light sheet microscopy!

Who am I and Where
am I from?
The research context

MASSIVE
… is a data processing engine for Australian
science. It empowers researchers to unlock
impactful research discoveries within scientific data.
Integrative High Performance Computing
• usability by new HPC user communities over raw capacity;
• hardware suited to data processing;
• underpinning high performing wet, experimental and data-focused
laboratories, with growing data processing needs;
• workflows that increase return on investment in instruments and
experiments; and
• porosity and flexibility to serve specific requirements in growth usage
areas, such as the life sciences, machine learning etc.

FEI Titan Krios
Nationally funded
project to develop
environments for Cryo
analysis
MMI Lattice
Light Sheet
Nationally funded
project to capture and
preprocess LLS data
Synchrotron
MX
Data Management,
Integration between
AS and MASSIVE M3
MASSIVE M3
Structural refinement
and analysis
Professor Trevor Lithgow
ARC Australian Laureate Fellow
Discovery of new protein transport
machines in bacteria, understanding the
assembly of protein transport machines,
and dissecting the effects of anti-
microbial peptides on anti-biotic
resistant “super-bugs”
Chamber details from the
nanomachine that secretes
the toxin that causes cholera.
Research and data by Dr. Iain Hay (Lithgow
lab)

MASSIVE and Characterisation
Virtual Laboratory
How do we partner with researchers… HIGHLIGHT TEXT
TO DRAW
ATTENTION TO A
CALL-OUT.
“Here is your
CD of data…”
“Your data is moving up to a data
management system in the cloud
where you have access to a range
of tools and services to start your
data analysis”

Rich Online Workbenches
How do we complement/replace workstations… Atom Probe, Structural
Biology, Bioinformatics,
Cytometry, Cryo-
Electron Microscopy,
Neutron Beam Imaging,
General Imaging Tool,
Light Microscopy, General
Scientific, X-ray

Cryo-Em:
Titan Krios
Cryo-Em PC
Collect
movies,
Images,
MyTardis
Data storage,
sharing
MyData
App
Raw
frames
MyData
App
Corrected
Stills
Picking
Publication
DOI, Reuse
Strudel
Desktop & Web
Ctffind
Relion
Frealign
Model

What is Cryo Electron
Microscopy?
.. and some research context

CryoEM
Computed
Tomography
+
Photogrammetry

http://www.cmu.edu/me/xctf/xrayct/index.htmlhttps://www.maximintegrated.com/en/app-notes/index.mvp/id/4682

How many projections does a clinical CT use?
1000!
http://www.upstate.edu/radiology/education/rsna/ct/rec
16

https://skfb.ly/TI89

3D reconstruction of the electron density of aE11 Fab’ polyC9 complex

What outcome are the
scientists looking for?
http://www.ebi.ac.uk/pdbe/entry/pdb/5k12

Disease
resistance
and drug
targets
https://doi.org/10.1128/mBio.00395-17

Disease
resistance
and drug
targets
https://doi.org/10.1002/cmdc.201900042

What is a time taken
for analysis of a single
dataset set?
25

Paper Processing Time
Protein
Purification
Screening
Neg. Stain
Screening
- Cryo
Cryo
Collection
2 months 2 months 2 months 3 x 2 days
Tecnai T12
JEOL IEM-1400
Molecular
Model
3 days
Motion correction
CTF correction
2D classification
3D classification
Refinement
Tecnai T12
Talos Artica
Titan Krios - K2
Talos - Falcon 3
1 hrs
5 min
1 day
1 day
3 days
Legend
Time = Red
Equipment = Blue
Steps = Grey
8 Months
300 Steps
3 Months

How much are GPUs used in
preprocessing?
27

Important quote:
“The structure
determination task is
further complicated by the
lack of information about
the relative orientations of
all particles and, in the
case of structural
variability in the sample,
also their assignment to a
structurally unique class.”
28

What are the major analysis steps?
Hint: All use GPUs
and are done
repeatedly.
2D Classification
3D Classification
Refinement
29

What is 2D Classification?
2D classification results are a way to visualize how your 2D images will
cluster together into homogenous class averages without any model
required.
Source: https://github.com/cianfrocco-lab/Old-school-processing/wiki/2D-classification 30

2D Classification
The ML2D algorithm may be used to
simultaneously align and classify single-
particle images
An intrinsic characteristic of the ML
approach is that it does not assign
images to one particular class or
orientation. Instead, images are
compared with all references in all
possible orientations and probability
weights are calculated for each
possibility.
(Maximum Likelihood)
https://doi.org/10.1016/S0076-6879(10)82013-0
31

Particle picking
for next step of
classification in
2D
32Source: https://www.e-sciencecentral.org/articles/Table.php?xn=am/am-48-001&id=

Classification of selected particles in 2D
33Source: https://www.e-sciencecentral.org/articles/Table.php?xn=am/am-48-001&id=

What is 3D Classification?
“The premise of 3D classification is to use an initial starting model
to sort your particles into 3 or more groups. By using the 3D
model, Relion uses maximum likelihood methods to create
homogenous classes of particles that will belong to a given
group.”
Source: https://github.com/cianfrocco-lab/Old-school-processing/wiki/Relion-3D-classification 34

What is a time taken
for analysis of a single
dataset set?
35

Slow time from lead compound to clinic Increasing prevalence of
resistant bacteria in the clinic
UK AMR review: “Deaths by infection will be larger than both cancer
and diabetes combined by 2050…..”
A big problem is Methicillin-resistant Staphylococcus aureus (MRSA)
and Vancomycin-resistant Enterococci (VRE)…
What is the problem?
Antibiotics at crisis point

Australian patient
contracts MRSA
Linezolid
resistance
MDR-MRSA
Isolated
Genome seqenced
(Illumina)
70S ribosomes
Isolated
Ribosomes imaged
on FEI Titan Krios
• Rational redesign of existing drug pharmaphacores
• Linezolid
• Virginiamycin (Synercid)
• More tractable chemical synthesis of fully synthetic
derivatives
• Thiostrepton
• Semi-synthetic modifications to help with drug solubility
Patient to Microscope
“Pragmatic Approach”

How long
does an
analysis take
(and why?)?
42
2017
• ~1-4TB raw data set/sample ~2000-5000 files
• Pipeline analysis with internal & external tools
• Require large memory gpu > 8GB
• Require large system memory > 64GB
• Require cpu cores 200 - 400
• Parallel file reads and writes
2019
• ~1-10TB raw data set/sample ~2000-10000 files
• Require large memory gpu > 16GB
• Require large system memory > 18GB
• Require cpu cores 30 - 130
• Parallel file reads and writes and high ops local
cache

How long do each
of the steps take?
• 2,500 images
• 150,000 particles
• 260 pixels
Task Submitted? GPU? Nodes Time
Import No < 1 min
Motion Correction Yes Yes 3 20 min
CTF estimation Yes No 1 20 min
Manual Picking No ?
Autopicking Yes Yes 2 40 min
Particle Extraction Yes No 1 10 min
2D Classification Yes Yes 2 10 min/iteration
3D Classification Yes Yes 1 10 min/iteration
3D Refine Yes Yes 2 5-10 min/iteration
Movie Refine Yes No 1 1 hour
Particle Polishing Yes No 1 1-2 hours
Mask Creation No 5-30 min
Postprocessing No <1 min

How long
does an
analysis take
(and why?)?
46
9 DAYS
GPU acceleration
for 30 of 45

How long
does an
analysis take
(and why?)?
47
2D Classification

Does GPU
architecture
affect
performance?
48
~ 2 x per
generation (
Kepler -> Volta)

Does job
layout matter
between GPU
architectures?
49
Volta peaks at
lower MPI ranks,
leaves more GPU
Ram available

How does GPU
Ram affect
scientific
outcomes?
50
K3 cameras
produce 5k
images

Workstation
GPUs vs HPC
GPUs?
51
~30% Faster,
even faster for
higher precision

How long
does an
analysis take
(and why?)?
54
14 DAYS
GPU acceleration
for 11 of 30

How long
does an
analysis take
(and why?)?
55
2D Classification

How long
does each
step take
(and why?)?
56
Ø 6 days of compute
Ø 3 days without
extraction step

How long
does each
step take
(and why?)?
57
Extract particles

How long
does each
step take
(and why?)?
58

How long
does each
step take
(and why?)?
59
Zoom
change

How long
does each
step take
(and why?)?
60
Two models for structure

How long
does each
step take
(and why?)?
61
Improving resolution

How fast can we
process this data?
… hardware choices, do they matter?
62

Hardware Configurations
Workstation
8 Cores
64GB Ram
4 x GTX1070
Cloud (HPC)
Dual Socket
36 Cores
384GB Ram
3 x V100
DGX-1V
Dual Socket
40 Cores
512GB Ram
8 x V100
SSD + NAS | SSD + Lustre | SSD

Can we
process
faster? (Is
HPC the
option)
64
Plot 1-4 m3g + dgx vs workstation
111.62
29.75
67.03
43.85
38.53
39.10
0.00 20.00 40.00 60.00 80.00 100.00 120.00
JOB RUNTIME (MIN)
HARDWARECONFIGURATIONS
2D Classification Step Runtime for Different
Hardware
Hardware Type 4xm3g Hardware Type 3xm3g
Hardware Type DGX1-V Hardware Type Workstation

Can we
process
faster? (do
options
matter?)
65
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00
User config
Pre-read particles off
Parallel disk access off
MPI result combine off
Local scratch on
JOB RUNTIME (MIN)
OPTIMISATIONOPTIONS
Hardware and Optimisations
Hardware Type 1xm3g Hardware Type DGX1-V

Can we
process
faster? (do
options
matter?)
66
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00
User config
Pre-read particles off
Parallel disk access off
MPI result combine off
Local scratch on
JOB RUNTIME (MIN)
OPTIMISATIONOPTIONS
Hardware and Optimisations

What is the
ratio of cost to
performance?
67
7.14
14.29
21.43
28.57
28.57
1.67
2.55
2.90
2.85
3.75
1.00 6.00 11.00 16.00 21.00 26.00
1 x GPU node
2 x GPU node
3 x GPU node
4 x GPU node
DGX-1V
HARDWARECONFIGURATION
Cost and Performance Ratios for Workstation
VS HPC
Performance Ratio Cost Ratio

What retail options for
compute exist?
Workstations + Software Support
68

How fast are
the Relion
developers
machines?
69
• [lancew@m3-login1 job008]$ pwd
• /scratch/br76/relion3tut/relion30_tutorial_precalculated_results/Class2D/job008
• [lancew@m3-login1 job008]$ ls --full-time
• total 2884
• -rw-r--r-- 1 lancew br76 2777 2018-06-07 23:32:58.000000000 +1000 default_pipeline.star
• -rw-r--r-- 1 lancew br76 840 2018-06-07 23:32:58.000000000 +1000 job_pipeline.star
• -rw-r--r-- 1 lancew br76 448 2018-06-07 23:32:58.000000000 +1000 note.txt
• -rw-r--r-- 1 lancew br76 0 2018-06-07 23:32:57.000000000 +1000 run.err
• -rw-r--r-- 1 lancew br76 820224 2018-06-07 23:32:58.000000000 +1000 run_it025_classes.mrcs
• -rw-r--r-- 1 lancew br76 740401 2018-06-07 23:32:57.000000000 +1000 run_it025_data.star
• -rw-r--r-- 1 lancew br76 223301 2018-06-07 23:32:57.000000000 +1000 run_it025_model.star
• -rw-r--r-- 1 lancew br76 5916 2018-06-07 23:32:58.000000000 +1000 run_it025_optimiser.star
• -rw-r--r-- 1 lancew br76 458 2018-06-07 23:32:56.000000000 +1000 run_it025_sampling.star
• -rw-r--r-- 1 lancew br76 1234 2018-06-07 23:32:57.000000000 +1000 run.job
• -rw-r--r-- 1 lancew br76 307687 2018-06-07 23:32:58.000000000 +1000 run.out
• -rw-r--r-- 1 lancew br76 820224 2018-06-07 23:32:58.000000000 +1000
run_unmasked_classes.mrcs

Why do these
systems look
attractive to
researchers?
Source: https://www.exxactcorp.com/Relion-for-Cryo-EM-Solutions
DO MORE SCIENCE!
Have peace of mind, focus on what matters
most, knowing your system is backed by a 3
year warranty and support.
PLUG AND PLAY
Exxact systems are fully
turnkey, built to perform
right out of the box so
you avoid the drudgery of
configuration and setup.

Why do these
systems look
attractive to
researchers?
Source: https://www.exxactcorp.com/Relion-for-Cryo-EM-Solutions
Software:
System
CentOS 7
CUDA Toolkit
v9.2
Evince
FFTW3 Gnuplot
MPICH v3.2.1
Research
Unblur v1.0.2 EMAN v2.21
IMOD v4.9.9 Relion 3
Motioncorr MotionCor2
Gctf v1.06 CTFFind4.1.10
ResMap v1.1.4
Summovie
v1.0.2
Chimera v1.13
PRECONFIGURED
“Relion 3.0 beta and v2.1 With example job
submission scripts, benchmarks, a fully
validated test suite, and the latest software
patches for quick implementation.”

Why do these
systems look
attractive to
researchers?
Source: http://www.linuxvixion.com/wp-content/uploads/2016/04/linuxvixion-molecular-dynamics-2016-english.pdf

Source: https://aws.amazon.com/ec2/instance-types/p3/
What are the cloud options? …and cost?
$25/hr

Source: https://aws.amazon.com/ec2/instance-types/p3/
What are the cloud options? …and cost?
$25/hr 14 Days $8400
Source: http://labphoto.tumblr.com/post/105112424006/cyanuric-triazide-or-2-4-6-triazido-1-3-5-triazine

What has changed in 2
years?
Mostly hardware …. And a little ML
75

Options for Computation (2017)
Workstation
Pro
Full user control
Con
Limited by single
machine
Cloud
Pro
Scales easily
Con
Cost, complexity,
data movement
HPC
Pro
Huge resources
Con
Tightly controlled,
shared

Options for Computation (2019)
Workstation
Pro
Full user control
Con
Limited by single
machine
Cloud
Pro
Scales easily
Con
Cost, complexity,
data movement
HPC
Pro
Huge resources,
purpose designed
Con
Tightly controlled,
shared
Containers

0
10
20
30
40
50
60
70
80
90
4 8 16
PROCESSINGTIME(MINS)
NUMBER OF CORES (THREADS)
Relion - 3D Classification Step (2017)
K80
DGX
K80 = 24 x CPU, 256GB RAM, 4 x K80 GPU
DGX-1 = 32 x CPU, 512GB RAM, 8 x P100

0
10
20
30
40
50
60
70
80
90
4 8 16
NUMBER OF CORES (THREADS)
Relion - 3D Classification Step (2019)
K80
DGX
DGX1-V
DGX-1 = 32 x CPU, 512GB RAM, 8 x P100
DGX-1V = 40 x CPU, 512GB RAM, 8 x V100

0
20
40
60
80
100
120
140
3 5 9 13 17
MPI TASKS
Relion Class 3D - MPI Task Effects (2017)
K80 Processing Time
DGX Processing Time
DGX-1 = 32 x CPU, 512GB RAM, 8 x P100

0
20
40
60
80
100
120
140
3 5 9 13 17
MPI TASKS
Relion Class 3D - MPI Task Effects (2019)
K80
DGX
DGX-1V
DGX-1 = 32 x CPU, 512GB RAM, 8 x P100

DGX-1 = 32 x CPU, 512GB RAM, 8 x P100
39.75
19.92
13.12
9.88
0 5 10 15 20 25 30 35 40 45
1
2
3
4
PROCESSING TIME FOR FIRST ITERATION(MINS)
NUMBEROFDGX-1SERVERS Iteration Processing Time VS Number of DGX-1 Servers
Linear speed up !!!

What has stayed the same
in 2 years?
Mostly software …. Still a struggle
84

Software
Dependencies
• Relion has two main parts
• Internal programs
• External programs
• List of dependencies
• FFTW
• FLTK
• OpenMPI
Software Dependencies

How to accelerate a
solution with HPC?
Using a real example (K2)
86

What are the
optimisation
options?
87
--j
The number of parallel threads to run on each CPU. We often use
4-6.
--
dont_combine_weights_
via_disc
By default large messages are passed between MPI processes
through reading and writing of large files on the computer disk.
By giving this option, the messages will be passed through the
network instead. We often use this option.
--no_parallel_disc_io
By default, all MPI slaves read their own particles (from disk or
into RAM). Use this option to have the master read all particles,
and then send them all through the network. We do not often
use this option.
--preread_images
By default, all particles are read from the computer disk in every
iteration. Using this option, they are all read into RAM once, at
the very beginning of the job instead. We often use this option if
the machine has enough RAM (more than N*boxsize*boxsize*4
bytes) to store all N particles.
--scratch_dir
By default, particles are read every iteration from the location
specified in the input STAR file. By using this option, all particles
are copied to a scratch disk, from where they will be read (every
iteration) instead. We often use this option if we don't have
enough RAM to read in all the particles, but we have large
enough fast SSD scratch disk(s) (e.g. mounted as /tmp).
Source: https://www3.mrc-lmb.cam.ac.uk/relion/index.php/Benchmarks_%26_computer_hardware

What are the
optimisation
options?
88
Disable MPI for results

What are the
optimisation
options?
89
Very little difference

What are the
optimisation
options?
90
14% Slower

How are the
CPUs used?
… How many
are needed?
91

How are the
CPUs used?
… How many
are needed?
92

How are the
CPUs used?
… How many
are needed?
93
CPUs usage
divided between
I/O and analysis

Utilization rates report how busy each GPU is
over time, and can be used to determine how
much an application is using the GPUs in the
system.
GPU Percent of time over the past sample
period during which one or more kernels was
executing on the GPU.
Memory Percent of time over the past sample
period during which global (device) memory
was being read or written.
PCIe Rx and Tx Throughput in MB/s
GPU Performance Measurements
How efficient is the software… Latest generation
GPU hardware
enables very detailed
performance metrics.
https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

What are the
GPUs doing?
… are they
busy?
95

What are the
GPUs doing?
… are they
busy?
96

What are the
GPUs doing?
… are they
busy?
97Relion3 Tutorial Dataset
37%

What are the
GPUs doing?
… are they
busy?
1.7%

What are the
GPUs doing
with
memory?
12GB

How fast is
data copied
to memory?

How much
power is
consumed ?
… global
warming?
50W

How cooling
is needed ?
… global
warming?

Where are the opportunities
for reducing “Time to
Science”?
104

However, the GPU acceleration is only available for
cards from a single vendor, and it cannot use many of
the largescale computational resources available in
existing centres, local clusters, or even researchers’
laptops. In addition, the memory available on typical
GPUs limits the box sizes that can be used, which
could turn into a severe bottleneck for large particles.
For relion-3, we have developed a new general code
path where CPU algorithms have been rewritten to
mirror the GPU acceleration, which provides dual
execution branches of the code that work very efficiently
both on accelerators as well as the single-instruction,
multiple-data vector units present on traditional CPUs.
Relion 3 – CPU Vs GPU?
Where is development heading… Limitations of
memory
Increased
performance from
new CPUs
https://www.biorxiv.org/content/biorxiv/early/2018/09/19/421123.full.pdf

Opportunities for
speed up
❏ Cryosparc
❏ Cryolo
❏ Preprocessing
tools
10
Source: https://cryosparc.com/

Opportunities for
speed up
❏ Cryosparc
❏ Cryolo
❏ Preprocessing
tools
10
Source: http://sphire.mpg.de/wiki/doku.php?id=downloads:cryolo_1&redirect=1

Opportunities for
speed up
❏ Cryosparc
❏ Cryolo
❏ Preprocessing
tools
10

Opportunities for
speed up
❏ Cryosparc
❏ Cryolo
❏ Preprocessing
tools
https://github.com/Characterisation-Virtual-Laboratory/Cryo-EM-Processing-
Tool 10

Cryo-EM Processing Tool
Tool
1
1

Tool 111

Cryo-EM Processing Tool
https://github.com/Characterisation-Virtual-Laboratory/Cryo-EM-Processing-Tool 1
1

Next Generation: 4D+
Volumetric Imaging
Lattice light sheet microscopy

Mitochondrial DNA
(green) escaping
mitrochondria (red)
during cell death.
Dr Kate McArthur (Monash BDI) and Dr
Lachlan Whitehead (WEHI) and The
Advanced Imaging Centre at Janelia
Research Campus.
Credit: Steve Morton

Acknowledgments
Dr Matt Belousof
Hari Venogopal
Jafar Lie
Jay Van Schyndel
MASSIVE Partners
ARDC

THANK YOU
Final messages:
ü HPC is the best option, if you have interactive
queues.
ü Cloud is good for one off analysis.
ü Workstations make great training/learning
platforms
For future questions: lance.wilson@monash.edu

How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing

More Related Content

What's hot (13)

Similar to How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing (20)

More from inside-BigData.com (20)

Recently uploaded (20)

How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing