GPU Accelerated Computational Chemistry Applications

update

Updated: February 4, 2013

Molecular Dynamics (MD) Applications
Features
Application GPU Perf Release Status Notes/Benchmarks
Supported
> 100 ns/day AMBER 12, GPU Revision Support 12.2
PMEMD Explicit Solvent & GB Released
AMBER Implicit Solvent
JAC NVE on 2X
Multi-GPU, multi-node
http://ambermd.org/gpus/benchmarks.
K20s htm#Benchmarks

2x C2070 equals Release C37b1;
Implicit (5x), Explicit (2x) Released
CHARMM Solvent via OpenMM
32-35x X5667
Single & multi-GPU in single node
http://www.charmm.org/news/c37b1.html#po
CPUs stjump

Two-body Forces, Link-cell Source only, Results Published
Release V 4.03
DL_POLY Pairs, Ewald SPME forces, 4x
http://www.stfc.ac.uk/CSE/randd/ccg/softwa
Shake VV re/DL_POLY/25526.aspx

165 ns/Day
Released
GROMACS Implicit (5x), Explicit (2x) DHFR on
Release 4.6; 1st Multi-GPU support
4X C2075s

http://lammps.sandia.gov/bench.html#deskto
Lennard-Jones, Gay-Berne, Released.
LAMMPS Tersoff & many more potentials
3.5-18x on Titan
p and
http://lammps.sandia.gov/bench.html#titan

4.0 ns/days Released
Full electrostatics with PME and
NAMD most simulation features
F1-ATPase on 100M atom capable NAMD 2.9
1x K20X Multi-GPU, multi-node
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison

New/Additional MD Applications Ramping
Features
Application GPU Perf Release Status Notes
Supported
4-29X Released, Version 1.8.51
Abalone Simulations (on 1060 GPU)
(on 1060 GPU) Single GPU
Agile Molecule, Inc.

Computation of non-valent 4-29X Released, Version 1.1.4
Ascalaph interactions (on 1060 GPU) Single GPU
Agile Molecule, Inc.

150 ns/day DHFR on Released Production bio-molecular dynamics (MD)
ACEMD Written for use only on GPUs
1x K20 Single and multi-GPUs software specially optimized to run on GPUs

Powerful distributed computing
Depends upon Released; http://folding.stanford.edu
Folding@Home molecular dynamics system;
number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs
implicit solvent and folding

High-performance all-atom
Depends upon Released; http://www.gpugrid.net/
GPUGrid.net biomolecular simulations;
number of GPUs NVIDIA GPUs only
explicit solvent and binding
Simple fluids and binary
mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool
HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen
correlations)

Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue Written for use only on GPUs
than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013

Implicit: 127-213
Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics
OpenMM custom forces
ns/day Explicit: 18-
Multi-GPU on high-performance
55 ns/day DHFR

Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local
Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org
Abinit diagonalization /
1.3-2.7X
Multi-GPU support
orthogonalization
Integrating scheduling GPU into http://www.olcf.ornl.gov/wp-
Under development
ACES III SIAL programming language and 10X on kernels
Multi-GPU support
content/training/electronic-structure-
SIP runtime environment 2012/deumens_ESaccel_2012.pdf
Pilot project completed,
ADF Fock Matrix, Hessians TBD Under development www.scm.com
Multi-GPU support
http://inac.cea.fr/L_Sim/BigDFT/news.html,
http://www.olcf.ornl.gov/wp-
5-25X Released June 2009, content/training/electronic-structure-
DFT; Daubechies wavelets,
BigDFT part of Abinit
(1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and
GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp-
2012/BigDFT-HPC-tues.pdf
Under development,
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
Casino TBD TBD Spring 2013 release
html
Multi-GPU support
DBCSR (spare matrix multiply Under development
CP2K library)
2-7X
Multi-GPU support
content/training/ascc_2012/friday/ACSS_2012_V
andeVondele_s.pdf
Libqc with Rys Quadrature
1.3-1.6X, Released Next release Q4 2012.
GAMESS-US Algorithm, Hartree-Fock, MP2
2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html
and CCSD in Q4 2012


(ss|ss) type integrals within
calculations using Hartree Fock ab
Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419
GAMESS-UK initio methods and density 8x
Multi-GPU support 63
functional theory. Supports
organics & inorganics.

Under development
Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011
Gaussian Collaboration
TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm

Electrostatic poisson equation,
Released
orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
GPAW residual minimization method
8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
(rmm-diis)

Under development
Schrodinger, Inc.
Jaguar Investigating GPU acceleration TBD Multi-GPU support
http://www.schrodinger.com/kb/278

Released, Version 7.8
MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org
support coming in Version 8

Density-fitted MP2 (DF-MP2),
1.7-2.3X Under development www.molpro.net
MOLPRO density fitted local correlation
projected Multiple GPU Hans-Joachim Werner
methods (DF-RHF, DF-KS), DFT

Features
Supported
pseudodiagonalization, full
Under Development Academic port.
MOPAC2009 diagonalization, and density 3.8-14X
Single GPU http://openmopac.net
matrix assembling

Development GPGPU benchmarks:
Triples part of Reg-CCSD(T), www.nwchem-sw.org
Release targeting March 2013
NWChem CCSD & EOMCCSD task 3-10X projected
Multiple GPUs
And http://www.olcf.ornl.gov/wp-
schedulers content/training/electronic-structure-
2012/Krishnamoorthy-ESCMA12.pdf

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/

Density functional theory (DFT) First principles materials code that computes
Released
PEtot plane wave pseudopotential 6-10X
Multi-GPU
the behavior of the electron structures of
calculations materials

http://www.q-
Q-CHEM RI-MP2 8x-14x Released, Version 4.0
chem.com/doc_for_web/qchem_manual_4.0.pdf


Features
Supported

NCSA
Released University of Illinois at Urbana-Champaign
QMCPACK Main features 3-4x
Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php
/GPU_version_of_QMCPACK

Created by Irish Centre for
Quantum PWscf package: linear algebra
(matrix multiply), explicit 2.5-3.5x
Released
Version 5.0
High-End Computing
http://www.quantum-espresso.org/index.php
Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs
and http://www.quantum-espresso.org/

Completely redesigned to
exploit GPU parallelism. YouTube:
44-650X vs. Released
http://youtu.be/EJODzk6RFxE?hd=1 and
TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5
version Multi-GPU/single node content/training/electronic-structure-
2012/Luehr-ESCMA.pdf

2x
Hybrid Hartree-Fock DFT
2 GPUs Available on request By Carnegie Mellon University
VASP functionals including exact
comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf
exchange
128 CPU cores

Generalized Wang-Landau
3x
Under development GPU Perf Electronic Structure Determination Workshop 2012:
NICS
compared against Multi-core x86 CPU socket.
WL-LSMS method
with 32 GPUs vs.
Multi-GPU support GPU Perf benchmarked on GPU supported features
32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison
may be a kernel to kernel perf

Viz, ―Docking‖ and Related Applications Growing
Related Features
GPU Perf Release Status Notes
Applications Supported

Visualization from Visage Imaging. Next release, 5.4, will use
3D visualization of volumetric Released, Version 5.3.3
Amira 5® data and surfaces
70x
Single GPU
GPU for general purpose processing in some functions
http://www.visageimaging.com/overview.html

High-Throughput parallel blind Virtual Screening,
Allows fast processing of large Available upon request to
BINDSURF ligand databases
100X
authors; single GPU
http://www.biomedcentral.com/1471-2105/13/S14/S13

Empirical Free Released University of Bristol
BUDE Energy Forcefield
6.5-13.4X
Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm

Released, Suite 2011 Schrodinger, Inc.
Core Hopping GPU accelerated application 3.75-5000X
Single and multi-GPUs. http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released Open Eyes Scientific Software
FastROCS searching/comparison
800-3000X
Single and multi-GPUs. http://www.eyesopen.com/fastrocs

Lines: 460% increase
Cartoons: 1246% increase
PyMol Surface: 1746% increase 1700x
Single GPUs
http://pymol.org/
Spheres: 753% increase
Ribbon: 426% increase

High quality rendering, GPU Perf compared against Multi-core x86 CPU socket.
large structures (100 million atoms),
100-125X or greater GPU Perf benchmarked on GPU supported features
Visualization from University of Illinois at Urbana-Champaign
VMD analysis and visualization tasks, multiple
on kernels
and mayhttp://www.ks.uiuc.edu/Research/vmd/
be a kernel to kernel perf comparison
GPU support for display of molecular

Bioinformatics Applications
Features GPU
Application Release Status Website
Supported Speedup
Alignment of short sequencing Version 0.6.2 – 3/2012
BarraCUDA reads
6-10x
http://seqbarracuda.sourceforge.net/

Parallel search of Smith- Version 2.0.8 – Q1/2012
CUDASW++ Waterman database
10-50x
http://sourceforge.net/projects/cudasw/

Parallel, accurate long read Version 1.0.40 – 6/2012
CUSHAW aligner for large genomes
10x
Multiple-GPU
http://cushaw.sourceforge.net/

Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu
GPU-BLAST BLASTP
3-4x
Single GPU blast.html

Parallel local and global
Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH
GPU-HMMER search of Hidden Markov 60-100x
Multi-GPU, multi-node MMER.htm
Models

Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa
mCUDA-MEME algorithm based on MEME
4-10x
Multi-GPU, multi-node re/mcuda-meme

Hardware and software for
Released.
SeqNFind reference assembly, blast, SW, 400x
http://www.seqnfind.com/
HMM, de novo assembly

Version 1.11 – 5/2012
UGENE Fast short read alignment 6-8x
http://ugene.unipro.ru/
GPU Perf compared against same or similar code running on single CPU machine
Parallel linear regression on Performance measured internally or independently

GPU Accelerated Computational Chemistry Applications

MD Average Speedups
The blue node contains Dual E5-2687W CPUs
10 (8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs (8
Cores per CPU) and 1 or 2 NVIDIA K10, K20, or
Performance Relative to CPU Only

8 K20X GPUs.

6

4

2

0
CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X

Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.
Error bars show the maximum and minimum speedup for each hardware configuration.

Built from Ground Up for GPUs
Computational Chemistry

Study disease & discover drugs
What
Predict drug and protein interactions
GPU READY
Speed of simulations is critical APPLICATIONS
Why Enables study of:
Abalone
ACEMD
Longer timeframes AMBER
Larger systems DL_PLOY
More simulations GAMESS

How GPUs increase throughput & accelerate simulations
GROMACS
LAMMPS
NAMD
AMBER 11 Application NWChem
4.6x performance increase with 2 GPUs with Q-CHEM
only a 54% added cost* Quantum Espresso
TeraChem
• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)
• Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333

AMBER 12
GPU Support Revision 12.2
1/22/2013

15

Kepler - Our Fastest Family of GPUs Yet
30.00
Factor IX Running AMBER 12 GPU Support Revision 12.1
25.39
25.00 The blue node contains Dual E5-2687W CPUs
22.44 (8 Cores per CPU).
7.4x The green nodes contain Dual E5-2687W CPUs (8
20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10
Nanoseconds / Day

or 1x K20 for the GPU
6.6x
15.00

11.85 5.6x
10.00

3.5x
5.00
3.42

0.00
Factor IX
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
M2090

GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)
when compared to a CPU only node
16

K10 Accelerates Simulations of All Sizes
30
Running AMBER 12 GPU Support Revision 12.1

25 24.00 (8 Cores per CPU).
Speedup Compared to CPU Only

19.98
20 Cores per CPU) and 1x NVIDIA K10 GPU

15

10

5.50 5.53 5.04
5
2.00

0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB

Gain 24x performance by adding just 1 GPU
Nucleosome
when compared to dual CPU performance

K20 Accelerates Simulations of All Sizes
30.00
28.00
25.56 SPFP with CUDA 4.2.9 ECC Off
25.00
The blue node contains 2x Intel E5-2687W CPUs

(8 Cores per CPU)
20.00
Each green nodes contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

15.00

10.00
7.28
6.50 6.56

5.00
2.66
1.00
0.00
CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome
Molecules PME PME GB

Gain 28x throughput/performance by adding just one K20 GPU
Nucleosome

18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012

K20X Accelerates Simulations of All Sizes
35
31.30 Running AMBER 12 GPU Support Revision 12.1
30 28.59
(8 Cores per CPU).

25
Cores per CPU) and 1x NVIDIA K20X GPU
20

15

10 8.30
7.15 7.43

5
2.79

0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB

Gain 31x performance by adding just one K20X GPU
Nucleosome


K10 Strong Scaling over Nodes
Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off
6 The blue nodes contains 2x Intel X5670
CPUs (6 Cores per CPU)

5 The green nodes contains 2x Intel X5670
CPUs (6 Cores per CPU) plus 2x NVIDIA
K10 GPUs
4
Nanoseconds / Day

2.4x
3
CPU Only
3.6x With GPU
2

5.1x
1

Cellulose
0
1 2 4
Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes

Kepler – Universally Faster
9
8 The CPU Only node contains Dual E5-2687W CPUs
(8 Cores per CPU).
Speedups Compared to CPU Only

7
The Kepler nodes contain Dual E5-2687W CPUs (8
6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
GPUs
5
JAC

4 Factor IX
Cellulose
3

2

1

0
CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose

The Kepler GPUs accelerated all simulations, up to 8x

K10 Extreme Performance
JAC 23K Atoms (NVE)
120 The blue node contains Dual E5-2687W CPUs
(8 Cores per CPU).

97.99 The green node contain Dual E5-2687W CPUs (8
100
Cores per CPU) and 2x NVIDIA K10 GPUs
Nanoseconds / Day

80

60

40

20
12.47

0
1 Node 1 Node
DHFR

Gain 7.8X performance by adding just 2 GPUs

K20 Extreme Performance
DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
120

95.59 (8 Cores per CPU)
100

Each green node contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
Nanoseconds / Day

80

60

40

20 12.47

0
1 Node 1 Node
DHFR

Gain > 7.5X throughput/performance by adding just 2 K20 GPUs


Replace 8 Nodes with 1 K20 GPU
90.00 35000
$32,000.00
81.09 SPFP with CUDA 4.2.9 ECC Off
80.00
30000
The eight (8) blue nodes each contain 2x Intel
70.00 E5-2687W CPUs (8 Cores per CPU)
65.00
25000
60.00
CPUs (8 Cores per CPU) plus 1x NVIDIA K20
GPU
50.00 20000

Note: Typical CPU and GPU node pricing used.
40.00 Pricing may vary depending on node
15000
configuration. Contact your preferred HW vendor
for actual pricing.
30.00
10000
20.00 $6,500.00

5000
10.00

0.00 0
Nanoseconds/Day Cost

DHFR
Cut down simulation costs to ¼ and gain higher performance


Replace 7 Nodes with 1 K10 GPU
Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
80 $35,000.00
$32,000
The eight (8) blue nodes each contain 2x Intel
70 $30,000.00 E5-2687W CPUs (8 Cores per CPU)

60
The green node contains 2x Intel E5-2687W
$25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10
Nanoseconds / Day

GPU
50
$20,000.00 Note: Typical CPU and GPU node pricing used.
40 Pricing may vary depending on node
$15,000.00 configuration. Contact your preferred HW vendor
30 for actual pricing.

$10,000.00
20 $7,000

10 $5,000.00

0 $0.00
CPU Only GPU Enabled CPU Only GPU Enabled

DHFR
Cut down simulation costs to ¼ and increase performance by 70%


Extra CPUs decrease Performance
Cellulose NVE Running AMBER 12 GPU Support Revision 12.1

8 The orange bars contains one E5-2687W CPUs
(8 Cores per CPU).
7
The blue bars contain Dual E5-2687W CPUs (8
6 Cores per CPU)
Nanoseconds / Day

2 CPUs 2 GPUs
1 CPU 2 GPUs
5

4 1 E5-2687W
2 E5-2687W
3

2

1

0 Cellulose
CPU Only CPU with dual K20s

When used with GPUs, dual CPU sockets perform worse than single CPU sockets.

Kepler - Greener Science
Energy used in simulating 1 ns of DHFR JAC
2500 The blue node contains Dual E5-2687W CPUs
(150W each, 8 Cores per CPU).

2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
Lower is better GPUs (235W each).
Energy Expended (kJ)

1500

Energy Expended
1000
= Power x Time

500

0
CPU Only CPU + K10 CPU + K20 CPU + K20X

The GPU Accelerated systems use 65-75% less energy

Recommended GPU Node Configuration for
AMBER Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2

Cores per CPU socket 4+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+
System memory per node (GB) 16

Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075

1-2
# of GPUs per CPU socket (4 GPUs on 1 socket is good
to do 4 fast serial GPU runs)

GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 16x or higher

Server storage 2 TB
28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012

Benefits of GPU AMBER Accelerated Computing
Faster than CPU only systems in all tests

Most major compute intensive aspects of classical MD ported

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive

Kepler - Our Fastest Family of GPUs Yet
4.50
ApoA1 Running NAMD version 2.9
4.00
4.00 The blue node contains Dual E5-2687W CPUs
3.57 (8 Cores per CPU).
3.45
3.50
2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10
3.00 or 1x K20 for the GPU
Nanoseconds/Day

2.63
2.6x
2.50

2.5x
2.00

1.50 1.37 1.9x

1.00

0.50

0.00
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
Apolipoprotein A1
M2090

GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)
when compared to a CPU only node
31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012

Accelerates Simulations of All Sizes
3
Running NAMD 2.9 with CUDA 4.0 ECC Off
2.7
2.6
2.5 2.4
(8 Cores per CPU)

2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

1.5

1

0.5

0
CPU All Molecules ApoA1 F1-ATPase STMV
Apolipoprotein A1

Gain 2.5x throughput/performance by adding just 1 GPU


Kepler – Universally Faster
6
Running NAMD version 2.9

The CPU Only node contains Dual E5-2687W CPUs
5 (8 Cores per CPU).

5.1x The Kepler nodes contain Dual E5-2687W CPUs (8
4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or
K20X GPUs.
4.3x
F1-ATPase
3
ApoA1
STMV
2.9x
2
2.6x
2.4x

1

0
CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X
F1-ATPase
| Kepler nodes use Dual CPUs |

The Kepler GPUs accelerate all simulations, up to 5x
Average acceleration printed in bars

Outstanding Strong Scaling with Multi-STMV
Each blue XE6 CPU node contains 1x AMD
100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU).
1.2

Fermi XK6 Each green XK6 CPU+GPU node contains
1x AMD 1600 Opteron (16 Cores per CPU)
1 and an additional 1x NVIDIA X2090 GPU.
CPU XK6
2.7x
Nanoseconds / Day

0.8

2.9x
0.6

0.4

0.2
3.6x
3.8x Concatenation of 100
0 Satellite Tobacco Mosaic Virus
32 64 128 256 512 640 768
# of Nodes

Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers

Replace 3 Nodes with 1 2090 GPU
Each blue node contains 2x Intel Xeon X5550 CPUs
F1-ATPase (4 Cores per CPU).
4 CPU Nodes
0.8 9000
0.74 The green node contains 2x Intel Xeon X5550 CPUs
$8,000
1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU
0.7 1x M2090 GPUs
0.63
7000 Note: Typical CPU and GPU node pricing used. Pricing
0.6 may vary depending on node configuration. Contact your
6000 preferred HW vendor for actual pricing.
0.5
5000
0.4 $4,000
4000
0.3
3000
0.2
2000

0.1 1000

0 0

Speedup of 1.2x for 50% the cost F1-ATPase

K20 - Greener: Twice The Science Per Watt
1200000
Energy Used in Simulating 1 Nanosecond of ApoA1
1000000 Each blue node contains Dual E5-2687W
CPUs (95W, 4 Cores per CPU).

Each green node contains 2x Intel Xeon X5550

800000
CPUs (95W, 4 Cores per CPU) and 2x NVIDIA
Lower is better K20 GPUs (225W per GPU)

600000

Energy Expended
400000
= Power x Time

200000

0
1 Node 1 Node + 2x K20

Cut down energy usage by ½ with GPUs


Kepler - Greener: Twice The Science/Joule
Energy used in simulating 1 ns of SMTV
250000

200000 (150W each, 8 Cores per CPU).

Lower is better The green nodes contain Dual E5-2687W CPUs
(8 Cores per CPU) and 2x NVIDIA K10, K20, or
150000
K20X GPUs (235W each).

Energy Expended
100000
= Power x Time

50000

0
CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs

Cut down energy usage by ½ with GPUs

Satellite Tobacco Mosaic Virus

NAMD Computational Chemistry
# of CPU sockets 2
Cores per CPU socket 6+
System memory per socket (GB) 32

GPUs
Fermi M2090, M2075, C2075

# of GPUs per CPU socket 1-2
GPU to CPU connection PCIe 2.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012

Summary/Conclusions
Benefits of GPU Accelerated Computing

Large performance boost with small marginal price increase

Energy usage cut in half

GPUs scale very well within a node and over multiple nodes

Tesla K20 GPU is our fastest and lowest power high performance GPU to date

Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive

More Science for Your Money
Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores
6 and 150W per CPU).
5.5
Green nodes have 2x E5-2687W and 1
5 or 2 NVIDIA K10, K20, or K20X GPUs (235W).

4.5

4
3.3
2.92
3
2.47

2 1.7

1

0
CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x
K10 K20 K20X K10 K20 K20X

Experience performance increases of up to 5.5x with Kepler GPU nodes.

K20X, the Fastest GPU Yet
7 Blue node uses 2x E5-2687W (8 Cores
and 150W per CPU).
6
Green nodes have 2x E5-2687W and 2
NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone

5

4

3

2

1

0
CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X

Experience performance increases of up to 6.2x with Kepler GPU nodes.
One K20X performs as well as two M2090s

Get a CPU Rebate to Fund Part of Your GPU Budget
Acceleration in Loop Time Computation by
Additional GPUs
20
18.2
The blue node contains Dual X5670 CPUs
18
(6 Cores per CPU).
16
The green nodes contain Dual X5570 CPUs
Normalized to CPU Only

14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090
GPUs.
12
9.88
10

8

6 5.31

4

2

0
1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090

Increase performance 18x when compared to CPU-only nodes

Cheaper CPUs used with GPUs AND still faster overall performance when
compared to more expensive CPUs!

Excellent Strong Scaling on Large Clusters
LAMMPS Gay-Berne 134M Atoms

600
GPU Accelerated XK6
500
CPU only XE6
Loop Time (seconds)

400
3.55x
300

200
3.48x
3.45x
100

0
300 400 500 600 700 800 900
Nodes

From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
compared to XE6 CPU nodes
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090

GPUs Sustain 5x Performance for Weak Scaling
Weak Scaling with 32K Atoms per Node
45

40

Loop Time (seconds) 35

30
6.7x 5.8x 4.8x
25

20

15

10

5

0
1 8 27 64 125 216 343 512 729
Nodes
Performance of 4.8x-6.7x with GPU-accelerated nodes
when compared to CPUs alone
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090

Faster, Greener — Worth It!
Energy Consumed in one loop of EAM
140

120 GPU-accelerated computing uses
Lower is better 53% less energy than CPU only
100

80

60
Energy Expended = Power x Time
Power calculated by combining the component’s TDPs
40

20

0
1 Node 1 Node + 1 K20X 1 Node + 2x K20X

Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.

Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive

Molecular Dynamics with LAMMPS
on a Hybrid Cray Supercomputer
W. Michael Brown
National Center for Computational Sciences
Oak Ridge National Laboratory

NVIDIA Technology Theater, Supercomputing 2012
November 14, 2012

Early Kepler Benchmarks on Titan
32.00 4
16.00
XK7+GPU
8.00
4.00 XK6 3

Time (s)
Atomic Fluid 2.00

Time (s)
XK6+GPU
1.00 2
0.50 XK7+GPU
0.25 XK6
0.13 1
XK6+GPU
0.06
0.03 0
1 2 4 8 16 32 64 128 Nodes

1

4
16

64

6

96
24

4
25

38
40
10

16
3.0
8.00
XK7+GPU 2.5
4.00
2.0

Time (s)
2.00
Time (s)

Bulk Copper XK6 1.5
1.00
1.0
0.50 XK6+GPU
0.5
0.25
0.0
0.13 Nodes

1

4
16

64

6

96
24

4
25

38
1 2 4 8 16 32 64 128

40
10

16

Early Kepler Benchmarks on Titan
64.00 32
32.00
XK7+GPU 16
16.00

Time (s)
Protein 8

Time (s)
8.00
XK6
4.00 4
2.00
XK6+GPU 2
1.00
0.50 1

1

4

16

64

256

4096

16384
1024
1 2 4 8 16 32 64 128 Nodes
128.00 16
64.00 14
32.00 XK7+GPU
12
16.00
10
Time (s)

8.00

Time (s)
Liquid Crystal 4.00 XK6 8
2.00 6
1.00
XK6+GPU 4
0.50
0.25 2
0.13 0
1 2 4 8 16 32 64 128 Nodes

1

4
16

64

6

96
24

4
25

38
40
10

16

Early Titan XK6/XK7 Benchmarks
18
Speedup with Acceleration on XK6/XK7 Nodes
16
1 Node = 32K Particles
14
900 Nodes = 29M Particles
12
10
8
6
4
2
0
Atomic Fluid (cutoff Atomic Fluid (cutoff
Bulk Copper Protein Liquid Crystal
= 2.5σ) = 5.0σ)
XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82
XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70
XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60
XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14

LAMMPS Computational Chemistry
# of CPU sockets 2

GPUs
Fermi M2090, M2075, C2075

# of GPUs per CPU socket 1-2


Network configuration Gemini, InfiniBand

51 Scale to multiple nodes with same single node configuration

GROMACS 4.6 Final, Pre-Beta
and 4.6 Beta

Great Scaling in Small Systems
25.00
Running GROMACS 4.6 pre-beta with CUDA 4.1
21.68
Each blue node contains 1x Intel X5550 CPU
20.00 3.2x (95W TDP, 4 Cores per CPU)

3.2x Each green node contains 1x Intel X5550 CPU
Nanoseconds / Day

(95W TDP, 4 Cores per CPU) and 1x NVIDIA
15.00 M2090 (225W TDP per GPU)
13.01

CPU Only
10.00 3.6x With GPU
8.36
3.6x

5.00
3.7x
Benchmark systems: RNAse in water
with 16,816 atoms in truncated
dodecahedron box
0.00
1 2 3
Number of Nodes

Get up to 3.7x performance compared to CPU-only nodes

Additional Strong Scaling on Larger System
128K Water Molecules
160 Running GROMACS 4.6 pre-beta with CUDA 4.1

Each blue node contains 1x Intel X5670 (95W
140
TDP, 6 Cores per CPU)

120 Each green node contains 1x Intel X5670 (95W
2x TDP, 6 Cores per CPU) and 1x NVIDIA M2070
Nanoseconds / Day

100 (225W TDP per GPU)

80
CPU Only
60 With GPU

2.8x
40

20
3.1x
0
8 16 32 64 128
Number of Nodes

Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance
when compared to CPU-only nodes

Replace 3 Nodes with 2 GPUs
Running GROMACS 4.6 pre-beta with CUDA
ADH in Water (134K Atoms) 4.1

9 4 CPU Nodes
9000 The blue node contains 2x Intel X5550 CPUs
8.36
$8,000 (95W TDP, 4 Cores per CPU)
8 8000
The green node contains 2x Intel X5550 CPUs
7 6.7 7000
$6,500 (95W TDP, 4 Cores per CPU) and 2x
NVIDIA M2090s as the GPU (225W TDP per
6 6000 GPU)
5 5000 Note: Typical CPU and GPU node pricing
used. Pricing may vary depending on node
4 4000
configuration. Contact your preferred HW
vendor for actual pricing.
3 3000

2 2000

1 1000

0 0

Save thousands of dollars and perform 25% faster

Greener Science
ADH in Water (134K Atoms)
Running GROMACS 4.6 with CUDA 4.1
12000
The blue nodes contain 2x Intel X5550 CPUs
Energy Expended (KiloJoules Consumed)

(95W TDP, 4 Cores per CPU)
10000
The green node contains 2x Intel X5550 CPUs,
Lower is better 4 Cores per CPU) and 2x NVIDIA M2090s GPUs
8000 (225W TDP per GPU)

6000

4000 Energy Expended
= Power x Time
2000

0
4 Nodes 1 Node + 2x M2090
(760 Watts) (640 Watts)

In simulating each nanosecond, the GPU-accelerated system uses 33% less energy

The Power of Kepler
RNase Solvated Protein 24k Atoms
140

Running GROMACS version 4.6 beta
120
The grey nodes contain 1 or 2 E5-2687W CPUs
(150W each, 8 Cores per CPU) and 1 or 2
100 NVIDIA M2090s.

The green nodes contain 1 or 2 E5-2687W
80 CPUs (8 Cores per CPU) and 1 or 2 NVIDIA
M2090 K20X GPUs (235W each).
60 K20X

40

20

0
1 CPU + 1 GPU 1 CPU + 2 GPU 2 CPU + 1 GPU 2 CPU + 2 GPU

Upgrading an M2090 to a K20X increases performance 10-45%
Ribonuclease

K20X – Fast
RNase Solvated Protein 24k Atoms
120

Running GROMACS version 4.6 beta
100
The blue nodes contain 1 or 2 E5-2687W CPUs
80
Nanoseconds / Day

The green nodes contain 1 or 2 E5-2687W
CPUs (8 Cores per CPU) and 1 or 2 NVIDIA
K20X GPUs (235W each).
60 CPU Only
With 1 K20X

40

20

0
1 CPU 2 CPUs

Adding a K20X increases performance by up to 3x
Ribonuclease

K20X, the Fastest Yet
192K Water Molecules
16

Running GROMACS version 4.6-beta2 and
14 CUDA 5.0.35

12 The blue node contains 2 E5-2687W CPUs
Nanoseconds / Day

10 The green nodes contain 2 E5-2687W CPUs (8
Cores per CPU) and 1 or 2 NVIDIA K20X GPUs
8 (235W each).

6

4

2

0
CPU CPU + K20X CPU + 2x K20X

Using K20X nodes increases performance by 2.5x
Water
Try GPU accelerated GROMACS 4.6 for free – www.nvidia.com/GPUTestDrive

GROMACS Computational Chemistry
# of CPU sockets 2


GPUs
Fermi M2090, M2075, C2075

1x
Kepler-based GPUs (K20X, K20 or K10): need fast Sandy
# of GPUs per CPU socket
Bridge or perhaps the very fastest Westmeres, or high-end
AMD Opterons


61 Scale to multiple nodes with same single node configuration

GPUs Outperform CPUs
Daresbury Crambin 19.6k Atoms
70

Running CHARMM release C37b1
60
The blue nodes contains 44 X5667 CPUs
(95W, 4 Cores per CPU).
50
Nanoseconds / Day

The green nodes contain 2 X5667 CPUs and 1
40 or 2 NVIDIA C2070 GPUs (238W each).

Note: Typical CPU and GPU node pricing used.
30
Pricing may vary depending on node

10

0
44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070
$44,000 $3000 $4000

1 GPU = 15 CPUs

More Bang for your Buck
Daresbury Crambin 19.6k Atom
12

10 The blue nodes contains 44 X5667 CPUs
Scaled Performance / Price

8 The green nodes contain 2 X5667 CPUs and 1
or 2 NVIDIA C2070 GPUs (238W).

6 Note: Typical CPU and GPU node pricing used.
Pricing may vary depending on node

2

0
44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070

Using GPUs delivers 10.6x the performance for the same cost

Greener Science with NVIDIA
Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms
18000

16000

14000 The blue nodes contains 64 X5667 CPUs

12000
The green nodes contain 2 X5667 CPUs and 1
or 2 NVIDIA C2070 GPUs (238W each).
10000

Lower is better Note: Typical CPU and GPU node pricing used.
8000 Pricing may vary depending on node

4000

2000 Energy Expended
= Power x Time
0
64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070

Using GPUs will decrease energy use by 75%

www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms)
116 ns/day on 1 GPU for DHFR (23K atoms)
M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and
Comput. 5, 1632 (2009)

www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory
Comput., 5, 2371–2377 (2009)
2 For a list of selected references see http://www.acellera.com/acemd/publications


Local Hamiltonian, non-local
Hamiltonian, LOBPCG algorithm, Released since Version 6.12 www.abinit.org
Abinit diagonalization /
1.3-2.7X
Multi-GPU support
orthogonalization

Integrating scheduling GPU into http://www.olcf.ornl.gov/wp-
Under development
ACES III SIAL programming language and 10X on kernels
Multi-GPU support
SIP runtime environment 2012/deumens_ESaccel_2012.pdf

Pilot project completed,
ADF Fock Matrix, Hessians TBD Under development www.scm.com
Multi-GPU support

http://inac.cea.fr/L_Sim/BigDFT/news.html,
5-25X Released June 2009, content/training/electronic-structure-
DFT; Daubechies wavelets,
BigDFT part of Abinit
(1 CPU core to current release 1.6 2012/BigDFT-Formalism.pdf and
GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp-
2012/BigDFT-HPC-tues.pdf

Under development,
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
Casino TBD TBD Spring 2013 release
html
Multi-GPU support
DBCSR (spare matrix multiply Under development GPU Perf benchmarked on GPU supported features
CP2K library)
2-7X
Multi-GPU support
content/training/ascc_2012/friday/ACSS_2012_V
andeVondele_s.pdf


(ss|ss) type integrals within
calculations using Hartree Fock ab
Release in Summer 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419
GAMESS-UK initio methods and density 8x
Multi-GPU support 63
functional theory. Supports
organics & inorganics.

Under development
Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011
Gaussian Collaboration
TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm

Electrostatic poisson equation,
Released
orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
GPAW residual minimization method
8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
(rmm-diis)

Under development
Schrodinger, Inc.
Jaguar Investigating GPU acceleration TBD Multi-GPU support
http://www.schrodinger.com/kb/278

3x NICS Electronic Structure Determination Workshop 2012:
with 32 GPUs vs. Under development http://www.olcf.ornl.gov/wp-
LSMS Generalized Wang-Landau method
32 (16-core) Multi-GPU support content/training/electronic-structure-
2012/Eisenbach_OakRidge_February.pdf
CPUs

MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org
support coming in Version 8
Density-fitted MP2 (DF-MP2),
1.7-2.3X Under development and may be a kernel to kernel perf comparison
www.molpro.net
density fitted local correlation

Features
Supported
pseudodiagonalization, full
Under Development Academic port.
MOPAC2009 diagonalization, and density 3.8-14X
Single GPU http://openmopac.net
matrix assembling

Development GPGPU benchmarks:
Triples part of Reg-CCSD(T), www.nwchem-sw.org
Release targeting end of 2012
NWChem CCSD & EOMCCSD task 3-10X projected
Multiple GPUs
And http://www.olcf.ornl.gov/wp-
schedulers content/training/electronic-structure-
2012/Krishnamoorthy-ESCMA12.pdf

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/

Density functional theory (DFT) First principles materials code that computes
Released
PEtot plane wave pseudopotential 6-10X
Multi-GPU
the behavior of the electron structures of
calculations materials

http://www.q-
Q-CHEM RI-MP2 8x-14x Released, Version 4.0
chem.com/doc_for_web/qchem_manual_4.0.pdf


Features
Supported

NCSA
Released University of Illinois at Urbana-Champaign
QMCPACK Main features 3-4x
Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php
/GPU_version_of_QMCPACK

Created by Irish Centre for
Quantum PWscf package: linear algebra
(matrix multiply), explicit 2.5-3.5x
Released
Version 5.0
High-End Computing
http://www.quantum-espresso.org/index.php
Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs
and http://www.quantum-espresso.org/

Completely redesigned to
exploit GPU parallelism. YouTube:
44-650X vs. Released
http://youtu.be/EJODzk6RFxE?hd=1 and
TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5
version Multi-GPU/single node content/training/electronic-structure-
2012/Luehr-ESCMA.pdf

2x
Hybrid Hartree-Fock DFT
2 GPUs Available on request By Carnegie Mellon University
VASP functionals including exact
comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf
exchange
128 CPU cores


Kepler, it’s faster
14

12 Running CP2K version 12413-trunk on CUDA
5.0.36


8 The green nodes contain 2 E5-2687W CPUs
and 1 or 2 NVIDIA K10, K20, or K20X GPUs
(235W each).
6

4

2

0
CPU Only CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X

Using GPUs delivers up to 12.6x the performance per node

Strong Scaling
8

XK6 With GPUs
7
Speedup Relative to 256 non-GPU Cores

XK6 Without GPUs
Conducted on Cray XK6
Using matrix-matrix multiplication
6 NREP=6 and N=159,000 with 50% occupation

5
3x
4

2.9x
3

2
2.3x
1

0
256 512 768
# of Cores used

Speedups increase as more nodes are added, up to 3x at 768 nodes

Kepler, keeping the planet Green
350

300 Running CP2K version 12413-trunk on CUDA
5.0.36


200 The green nodes contain 2 E5-2687W CPUs
and 1 or 2 NVIDIA K20 GPUs (235W each).
Lower is better
150 Energy Expended
= Power x Time
100

50

0
CPU Only CPU + K20 CPU + 2x K20

Using K20s will lower energy use by over 75% for the same simulation

Gaussian
Key quantum chemistry code
ACS Fall 2011 press release
Joint collaboration between Gaussian, NVDA and PGI for GPU
acceleration: http://www.gaussian.com/g_press/nvidia_press.htm
No such release exists for Intel MIC or AMD GPUs
Mike Frisch quote:
“Calculations using Gaussian are limited primarily by the available computing
resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the
development of hardware, compiler technology and application software among the
three companies, the new application will bring the speed and cost-effectiveness of
GPUs to the challenging problems and applications that Gaussian’s customers need to
address.”

NVIDIA Confidential

GAMESS Partnership Overview
Mark Gordon and Andrey Asadchev, key developers of GAMESS,
in collaboration with NVIDIA. Mark Gordon is a recipient of a
NVIDIA Professor Partnership Award.

Quantum Chemistry one of major consumers of CPU cycles at
national supercomputer centers

NVIDIA developer resources fully allocated to GAMESS code
“
We like to push the envelope as much as we can in the direction of highly scalable efficient
codes. GPU technology seems like a good way to achieve this goal. Also, since we are
associated with a DOE Laboratory, energy efficiency is important, and this is another reason
to explore quantum chemistry on GPUs.
” Prof. Mark Gordon
Distinguished Professor, Department of Chemistry, Iowa State University and
Director, Applied Mathematical Sciences Program, AMES Laboratory
84

GAMESS August 2011 GPU Performance
First GPU supported GAMESS release via "libqc", a library for fast quantum
chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software
2e- AO integrals and their assembly into a closed shell Fock matrix

Performance for Two Small Molecules 2.0
GAMESS Aug. 2011 Release Relative

4x E5640 CPUs
4x E5640 CPUs + 4x Tesla C2070s

1.0

0.0
Ginkgolide (53 atoms) Vancomycin (176 atoms)

Upcoming GAMESS Q4 2012 Release
Multi-nodes with multi-GPUs supported
Rys Quadrature
Hartree-Fock
8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup.
See 2012 publication
Møller–Plesset perturbation theory (MP2):
Preliminary code completed
Paper in development
Coupled Cluster SD(T): CCSD code completed,
(T) in progress
86

GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F
Hartree-Fock GPU Speedups*
3.5

3.0 2.9
Adding 1x 2070 GPU
2.5 2.5
speeds up computations
2.5 2.4
2.3 2.3 2.3 by 2.3x to 2.9x

2.0

Speedup
1.5

1.0

0.5

* A. Asadchev, M.S. Gordon, “New
0.0 Multithreaded Hybrid CPU/GPU Approach to
Taxol 6-31G Taxol 6-31G(d) Taxol 6- Taxol 6- Valinomycin 6- Valinomycin 6- Valinomycin 6- Hartree-Fock,” Journal of Chemical Theory and
31G(2d,2p) 31++G(d,p) 31G 31G(d) 31G(2d,2p) Computation (2012)

NVIDIA CONFIDENTIAL
87

Used with
permission
from Samuli
Hakala

NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes

System: cluster consisting
of dual-socket nodes
constructed from:

• 8-core AMD Interlagos
processors
• 64 GB of memory
• Tesla M2090 (Fermi)
GPUs

The nodes are connected
using a high-performance
QDR Infiniband
interconnect

Courtesy of
Kowolski, K., Bhaskaran-
Nair, at al @ PNNL, JCTC
(submitted)

Kepler, fast science
AUsurf
14
Running Quantum Espresso version 5.0-build7
on CUDA 5.0.36
12

The blue node contains 2 E5-2687W CPUs
10
The green nodes contain 2 E5-2687W CPUs
8 and 1 or 2 NVIDIA M2090 or K10 GPUs (225W
and 235W respectively).

6

4

2

0
CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10

Using K10s delivers up to 11.7x the performance per node over CPUs
And 1.7x the performance when compared to M2090s

Extreme Performance/Price from 1 GPU
4
Simulations run on FERMI @ ICHEC.
3.5
A 6-Core 2.66 GHz Intel X5650 was
3 used for the CPU
Scaled to CPU Only

2.5 An NVIDIA C2050 was used for the
GPU
2

1.5
CPU+
1 GPU
CPU
0.5 Only

0
Price: Performance: (Shilu-3) Performance: (Water-on-Calcite)

Calcite structure

Adding a GPU can improve performance by 3.7x while only increasing price by 25%

Extreme Performance/Price from 1 GPU
4
Price and Performance scaled to the CPU only system Simulations run on FERMI @ ICHEC.
3.5
3 used for the CPU

GPU
2

1.5
CPU+
1 GPU
CPU
0.5 Only

0
Price: Performance: (AUSURF112, k- Performance:
point) (AUSURF112, gamma-point)

Calculation done for a gold surface of 112 atoms

Adding a GPU can improve performance by 3.5x while only increasing price by 25%

Replace 72 CPUs with 8 GPUs
Simulations run on PLX @ CINECA.
250
Intel 6-Core 2.66 GHz X5550 were
LSMO-BFO (120 Atoms) 8 K-points used for the CPUs
223 219 NVIDIA M2070s were used for the
200 GPUs
Elapsed Time (minutes)

150

100

50

0
120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)

The GPU Accelerated setup performs faster and costs 24% less

QE/PWscf - Green Science
LSMO-BFO (120 Atoms) 8 K-points Simulations run on PLX @ CINECA.
12000
Intel 6-Core 2.66 GHz X5550 were
used for the CPUs
10000
NVIDIA M2070s were used for the
Power Consumption (Watts)

Lower is better GPUs
8000

6000

4000

2000

0
120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)

Over a year, the lower power consumption would save $4300 on energy bills

NVIDIA GPUs Use Less Energy

Energy Consumption on Different Tests Simulations run on FERMI @ ICHEC.
0.6
used for the CPU
CPU Only
CPU+GPU GPU
Power Consumption [kW/h]

0.4

Lower is better
0.3

-58%
0.2

0.1

-54%
-57%
0
Shilu-3 AUSURF112 Water-on-Calcite

In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only

QE/PWscf - Great Strong Scaling in Parallel
CdSe-159 Walltime of 1 full SCF Simulations run on STONEY @ ICHEC.
35000
Two quad core 2.87 GHz Intel X5560s
30000 were used in each node

Lower is better CPU Two NVIDIA M2090s were used in
25000 each node for the CPU+GPU test
CPU+GPU
2.5x
20000
Time (s)

15000

10000

2.2x
5000 2.1x
2.2x
0
2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112)
Nodes (Total CPU Cores) 159 Cadmium Selenide nanodots

Speedups up to 2.5x with GPU Accelerations

QE/PWscf - More Powerful Strong Scaling
GeSnTe134 Walltime of full SCF
4500 Simulations run on PLX @ CINECA.
CPU
4000 Two 6-Core 2.4 GHz Intel E5645s were
CPU+GPU used in each node
3500
Two NVIDIA M2070s were used in
1.6x Lower is better each node for the CPU+GPU test
3000
Time (s)

2500

2000
2.3x
1500
2.4x 2.1x
1000

500

0
4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528)
Nodes (Total CPU Cores)

Accelerate your cluster by up to 2.1x with NVIDIA GPUs

Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive

TeraChem
Supercomputer Speeds on GPUs
Time for SCF Step
100
TeraChem running on 8 C2050s on 1 node
90
NWChem running on 4096 Quad Core CPUs
80 In the Chinook Supercomputer

70 Giant Fullerene C240 Molecule
Time (Seconds)

60

50

40

30

20

10

0
4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Similar performance from just a handful of GPUs

TeraChem
Bang for the Buck
Performance/Price TeraChem running on 8 C2050s on 1 node

600 NWChem running on 4096 Quad Core CPUs
In the Chinook Supercomputer
Price/Performance relative to Supercomputer

493
500 Giant Fullerene C240 Molecule

Note: Typical CPU and GPU node pricing
400 used. Pricing may vary depending on node
configuration. Contact your preferred HW
vendor for actual pricing.
300

200

100

1
0
4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Dollars spent on GPUs do 500x more science than those spent on CPUs

Kepler’s Even Better
Olestra BLYP 453 Atoms B3LYP/6-31G(d)
800 2000 TeraChem running on C2050 and K20C

700 1800 First graph is of BLYP/G-31(d)
Second is B3LYP/6-31G(d)
1600
600
1400
500
1200
Seconds

Seconds
400 1000

300 800

600
200
400
100
200

0 0
C2050 K20C C2050 K20C

Kepler performs 2x faster than Tesla

ROCS on the GPU: FastROCS
Shape Overlays per

400000

300000
Second

200000

100000

0

CPU GPU

Riding Moore’s Law

2000000
1800000
Shape Overlays per Second

1600000
1400000
1200000
1000000
800000
600000
400000
200000
0
C1060 C2050 C2075 C2090 K10 K20

FastROCS scaling across 4x K10s (2 physical GPUs per K10)
53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule)

9000000
8000000
Conformers per Second

7000000
6000000
5000000
4000000
3000000
2000000
1000000
0
1 2 3 4 5 6 7 8
Number of individual K10 GPUs
(Note, each K10 has 2 physical GPUs on the board)

Benefits of GPU Accelerated Computing

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive
11
8

GPU Test Drive
Experience GPU Acceleration
For Computational Chemistry
Researchers, Biophysicists

Preconfigured with Molecular
Dynamics Apps

Remotely Hosted GPU Servers

Free & Easy – Sign up, Log in and
See Results

www.nvidia.com/gputestdrive
11
9

GPU Accelerated Computational Chemistry Applications

More Related Content

Similar to GPU Accelerated Computational Chemistry Applications (20)

Recently uploaded (20)

GPU Accelerated Computational Chemistry Applications

Editor's Notes