SlideShare a Scribd company logo
update




         Updated: February 4, 2013
Molecular Dynamics (MD) Applications
                      Features
 Application                                      GPU Perf              Release Status                        Notes/Benchmarks
                     Supported
                                                   > 100 ns/day                                              AMBER 12, GPU Revision Support 12.2
                PMEMD Explicit Solvent & GB                                      Released
   AMBER             Implicit Solvent
                                                  JAC NVE on 2X
                                                                          Multi-GPU, multi-node
                                                                                                            http://ambermd.org/gpus/benchmarks.
                                                       K20s                                                           htm#Benchmarks


                                                  2x C2070 equals                                                      Release C37b1;
                  Implicit (5x), Explicit (2x)                                   Released
  CHARMM            Solvent via OpenMM
                                                   32-35x X5667
                                                                     Single & multi-GPU in single node
                                                                                                         http://www.charmm.org/news/c37b1.html#po
                                                       CPUs                                                                stjump


                 Two-body Forces, Link-cell                                                                      Source only, Results Published
                                                                             Release V 4.03
  DL_POLY        Pairs, Ewald SPME forces,              4x
                                                                          Multi-GPU, multi-node
                                                                                                         http://www.stfc.ac.uk/CSE/randd/ccg/softwa
                         Shake VV                                                                                   re/DL_POLY/25526.aspx


                                                    165 ns/Day
                                                                                 Released
  GROMACS         Implicit (5x), Explicit (2x)       DHFR on
                                                                          Multi-GPU, multi-node
                                                                                                              Release 4.6; 1st Multi-GPU support
                                                    4X C2075s

                                                                                                         http://lammps.sandia.gov/bench.html#deskto
                 Lennard-Jones, Gay-Berne,                                      Released.
  LAMMPS       Tersoff & many more potentials
                                                  3.5-18x on Titan
                                                                          Multi-GPU, multi-node
                                                                                                                            p and
                                                                                                          http://lammps.sandia.gov/bench.html#titan


                                                    4.0 ns/days                  Released
               Full electrostatics with PME and
   NAMD            most simulation features
                                                   F1-ATPase on            100M atom capable                              NAMD 2.9
                                                      1x K20X             Multi-GPU, multi-node
                                                                                          GPU Perf compared against Multi-core x86 CPU socket.
                                                                                             GPU Perf benchmarked on GPU supported features
                                                                                                  and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping
                      Features
Application                                        GPU Perf                Release Status                                  Notes
                     Supported
                                                         4-29X              Released, Version 1.8.51
  Abalone         Simulations (on 1060 GPU)
                                                     (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                 Computation of non-valent               4-29X              Released, Version 1.1.4
  Ascalaph             interactions                  (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                                                  150 ns/day DHFR on                Released                 Production bio-molecular dynamics (MD)
   ACEMD        Written for use only on GPUs
                                                        1x K20               Single and multi-GPUs         software specially optimized to run on GPUs

               Powerful distributed computing
                                                     Depends upon                 Released;                       http://folding.stanford.edu
Folding@Home    molecular dynamics system;
                                                    number of GPUs              GPUs and CPUs                    GPUs get 4X the points of CPUs
                 implicit solvent and folding

                 High-performance all-atom
                                                     Depends upon                 Released;                         http://www.gpugrid.net/
GPUGrid.net       biomolecular simulations;
                                                    number of GPUs             NVIDIA GPUs only
                 explicit solvent and binding
                   Simple fluids and binary
               mixtures (pair potentials, high-    Up to 66x on 2090        Released, Version 0.2.0       http://halmd.org/benchmarks.html#supercool
   HALMD       precision NVE and NVT, dynamic       vs. 1 CPU core                Single GPU                     ed-binary-mixture-kob-andersen
                         correlations)

                                                    Kepler 2X faster        Released, Version 0.11.2        http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue      Written for use only on GPUs
                                                      than Fermi         Single and multi-GPU on 1 node          Multi-GPU w/ MPI in March 2013

                                                   Implicit: 127-213
                Implicit and explicit solvent,                               Released Version 4.1.1       Library and application for molecular dynamics
  OpenMM                custom forces
                                                  ns/day Explicit: 18-
                                                                                   Multi-GPU                           on high-performance
                                                    55 ns/day DHFR
                                                                                            GPU Perf compared against Multi-core x86 CPU socket.
                                                                                               GPU Perf benchmarked on GPU supported features
                                                                                                   and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
 Application   Features Supported                GPU Perf         Release Status                                     Notes
                Local Hamiltonian, non-local
               Hamiltonian, LOBPCG algorithm,                     Released; Version 7.0.5                         www.abinit.org
   Abinit             diagonalization /
                                                   1.3-2.7X
                                                                    Multi-GPU support
                      orthogonalization
               Integrating scheduling GPU into                                                            http://www.olcf.ornl.gov/wp-
                                                                    Under development
   ACES III    SIAL programming language and     10X on kernels
                                                                     Multi-GPU support
                                                                                                      content/training/electronic-structure-
                   SIP runtime environment                                                              2012/deumens_ESaccel_2012.pdf
                                                                  Pilot project completed,
    ADF             Fock Matrix, Hessians             TBD            Under development                             www.scm.com
                                                                      Multi-GPU support
                                                                                                   http://inac.cea.fr/L_Sim/BigDFT/news.html,
                                                                                                           http://www.olcf.ornl.gov/wp-
                                                     5-25X          Released June 2009,               content/training/electronic-structure-
                 DFT; Daubechies wavelets,
   BigDFT              part of Abinit
                                                 (1 CPU core to     current release 1.6.0                 2012/BigDFT-Formalism.pdf and
                                                  GPU kernel)        Multi-GPU support                     http://www.olcf.ornl.gov/wp-
                                                                                                      content/training/electronic-structure-
                                                                                                             2012/BigDFT-HPC-tues.pdf
                                                                    Under development,
                                                                                                 http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
   Casino                   TBD                       TBD           Spring 2013 release
                                                                                                                    html
                                                                     Multi-GPU support
                                                                                                         http://www.olcf.ornl.gov/wp-
                DBCSR (spare matrix multiply                        Under development
    CP2K                  library)
                                                     2-7X
                                                                     Multi-GPU support
                                                                                                 content/training/ascc_2012/friday/ACSS_2012_V
                                                                                                                andeVondele_s.pdf
                  Libqc with Rys Quadrature
                                                   1.3-1.6X,              Released                               Next release Q4 2012.
 GAMESS-US      Algorithm, Hartree-Fock, MP2
                                                  2.3-2.9x HF        Multi-GPU support               http://www.msg.ameslab.gov/gamess/index.html
                     and CCSD in Q4 2012
                                                                                             GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                GPU Perf benchmarked on GPU supported features
                                                                                                    and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
 Application   Features Supported                   GPU Perf      Release Status                                       Notes

                  (ss|ss) type integrals within
               calculations using Hartree Fock ab
                                                                      Release in 2012            http://www.ncbi.nlm.nih.gov/pubmed/215419
 GAMESS-UK         initio methods and density           8x
                                                                     Multi-GPU support                               63
                   functional theory. Supports
                      organics & inorganics.

                                                                    Under development
                 Joint PGI, NVIDIA & Gaussian                                                                  Announced Aug. 29, 2011
  Gaussian               Collaboration
                                                       TBD           Multi-GPU support             http://www.gaussian.com/g_press/nvidia_press.htm



                Electrostatic poisson equation,
                                                                          Released
                 orthonormalizing of vectors,                                                    https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
   GPAW         residual minimization method
                                                        8x           Multi-GPU support              Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
                          (rmm-diis)

                                                                    Under development
                                                                                                               Schrodinger, Inc.
   Jaguar       Investigating GPU acceleration         TBD           Multi-GPU support
                                                                                                      http://www.schrodinger.com/kb/278


                                                                    Released, Version 7.8
  MOLCAS               CU_BLAS support                 1.1x      Single GPU. Additional GPU                       www.molcas.org
                                                                 support coming in Version 8


                 Density-fitted MP2 (DF-MP2),
                                                      1.7-2.3X      Under development                            www.molpro.net
  MOLPRO        density fitted local correlation
                                                     projected         Multiple GPU                            Hans-Joachim Werner
                methods (DF-RHF, DF-KS), DFT
                                                                                               GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                  GPU Perf benchmarked on GPU supported features
                                                                                                      and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
                       Features
 Application                                     GPU Perf            Release Status                                    Notes
                      Supported
                pseudodiagonalization, full
                                                                       Under Development                            Academic port.
 MOPAC2009      diagonalization, and density        3.8-14X
                                                                           Single GPU                           http://openmopac.net
                     matrix assembling


                                                                                                           Development GPGPU benchmarks:
                Triples part of Reg-CCSD(T),                                                                     www.nwchem-sw.org
                                                                   Release targeting March 2013
  NWChem           CCSD & EOMCCSD task           3-10X projected
                                                                          Multiple GPUs
                                                                                                           And http://www.olcf.ornl.gov/wp-
                         schedulers                                                                      content/training/electronic-structure-
                                                                                                           2012/Krishnamoorthy-ESCMA12.pdf


  Octopus             DFT and TDDFT                   TBD                   Released                   http://www.tddft.org/programs/octopus/



               Density functional theory (DFT)                                                       First principles materials code that computes
                                                                            Released
   PEtot        plane wave pseudopotential           6-10X
                                                                            Multi-GPU
                                                                                                       the behavior of the electron structures of
                         calculations                                                                                   materials


                                                                                                                  http://www.q-
  Q-CHEM                   RI-MP2                    8x-14x           Released, Version 4.0
                                                                                                    chem.com/doc_for_web/qchem_manual_4.0.pdf



                                                                                                  GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                     GPU Perf benchmarked on GPU supported features
                                                                                                         and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
                        Features
 Application                                       GPU Perf           Release Status                                  Notes
                       Supported

                                                                                                                        NCSA
                                                                            Released                University of Illinois at Urbana-Champaign
  QMCPACK                Main features                  3-4x
                                                                          Multiple GPUs           http://cms.mcc.uiuc.edu/qmcpack/index.php
                                                                                                           /GPU_version_of_QMCPACK



                                                                                                            Created by Irish Centre for
   Quantum        PWscf package: linear algebra
                   (matrix multiply), explicit        2.5-3.5x
                                                                            Released
                                                                           Version 5.0
                                                                                                               High-End Computing
                                                                                                  http://www.quantum-espresso.org/index.php
Espresso/PWscf   computational kernels, 3D FFTs                           Multiple GPUs
                                                                                                     and http://www.quantum-espresso.org/


                                                                                                             Completely redesigned to
                                                                                                        exploit GPU parallelism. YouTube:
                                                   44-650X vs.               Released
                                                                                                     http://youtu.be/EJODzk6RFxE?hd=1 and
  TeraChem        “Full GPU-based solution”        GAMESS CPU               Version 1.5
                                                                                                          http://www.olcf.ornl.gov/wp-
                                                     version          Multi-GPU/single node           content/training/electronic-structure-
                                                                                                              2012/Luehr-ESCMA.pdf


                                                       2x
                  Hybrid Hartree-Fock DFT
                                                     2 GPUs           Available on request                 By Carnegie Mellon University
    VASP         functionals including exact
                                                  comparable to          Multiple GPUs                  http://arxiv.org/pdf/1111.0716.pdf
                         exchange
                                                  128 CPU cores


                   Generalized Wang-Landau
                                                          3x
                                                                        Under development     GPU Perf Electronic Structure Determination Workshop 2012:
                                                                                                  NICS
                                                                                                        compared against Multi-core x86 CPU socket.
                                                                                                              http://www.olcf.ornl.gov/wp-
   WL-LSMS                  method
                                                  with 32 GPUs vs.
                                                                         Multi-GPU support       GPU Perf benchmarked on GPU supported features
                                                                                                           content/training/electronic-structure-
                                                  32 (16-core) CPUs                                  and2012/Eisenbach_OakRidge_February.pdfcomparison
                                                                                                           may be a kernel to kernel perf
Viz, ―Docking‖ and Related Applications Growing
   Related                Features
                                                               GPU Perf             Release Status                                          Notes
 Applications            Supported

                                                                                                                    Visualization from Visage Imaging. Next release, 5.4, will use
                  3D visualization of volumetric                                     Released, Version 5.3.3
   Amira 5®             data and surfaces
                                                                     70x
                                                                                           Single GPU
                                                                                                                        GPU for general purpose processing in some functions
                                                                                                                           http://www.visageimaging.com/overview.html



                                                                                                                         High-Throughput parallel blind Virtual Screening,
                  Allows fast processing of large                                   Available upon request to
   BINDSURF              ligand databases
                                                                     100X
                                                                                       authors; single GPU
                                                                                                                      http://www.biomedcentral.com/1471-2105/13/S14/S13




                           Empirical Free                                                   Released                                   University of Bristol
     BUDE                 Energy Forcefield
                                                                  6.5-13.4X
                                                                                           Single GPU                http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm


                                                                                      Released, Suite 2011                             Schrodinger, Inc.
  Core Hopping     GPU accelerated application                    3.75-5000X
                                                                                     Single and multi-GPUs.               http://www.schrodinger.com/products/14/32/



                     Real-time shape similarity                                             Released                               Open Eyes Scientific Software
   FastROCS            searching/comparison
                                                                  800-3000X
                                                                                     Single and multi-GPUs.                     http://www.eyesopen.com/fastrocs


                          Lines: 460% increase
                        Cartoons: 1246% increase
                                                                                      Released, Version 1.5
     PyMol              Surface: 1746% increase                     1700x
                                                                                           Single GPUs
                                                                                                                                         http://pymol.org/
                         Spheres: 753% increase
                         Ribbon: 426% increase


                          High quality rendering,                                                     GPU Perf compared against Multi-core x86 CPU socket.
                   large structures (100 million atoms),
                                                              100-125X or greater                           GPU Perf benchmarked on GPU supported features
                                                                                                                 Visualization from University of Illinois at Urbana-Champaign
      VMD        analysis and visualization tasks, multiple
                                                                  on kernels
                                                                                      Released, Version 1.9
                                                                                                               and mayhttp://www.ks.uiuc.edu/Research/vmd/
                                                                                                                            be a kernel to kernel perf comparison
                   GPU support for display of molecular
Bioinformatics Applications
                     Features                     GPU
 Application                                                         Release Status                              Website
                    Supported                   Speedup
               Alignment of short sequencing                           Version 0.6.2 – 3/2012
  BarraCUDA                reads
                                                  6-10x
                                                                       Multi-GPU, multi-node
                                                                                                     http://seqbarracuda.sourceforge.net/


                  Parallel search of Smith-                            Version 2.0.8 – Q1/2012
  CUDASW++          Waterman database
                                                  10-50x
                                                                        Multi-GPU, multi-node
                                                                                                    http://sourceforge.net/projects/cudasw/


                Parallel, accurate long read                           Version 1.0.40 – 6/2012
   CUSHAW        aligner for large genomes
                                                   10x
                                                                            Multiple-GPU
                                                                                                        http://cushaw.sourceforge.net/


               Protein alignment according to                          Version 2.2.26 – 3/2012    http://eudoxus.cheme.cmu.edu/gpublast/gpu
 GPU-BLAST                 BLASTP
                                                   3-4x
                                                                             Single GPU                            blast.html

                  Parallel local and global
                                                                       Version 2.3.2 – Q1/2012    http://www.mpihmmer.org/installguideGPUH
 GPU-HMMER        search of Hidden Markov        60-100x
                                                                        Multi-GPU, multi-node                    MMER.htm
                            Models

                 Scalable motif discovery                                  Version 3.0.12         https://sites.google.com/site/yongchaosoftwa
 mCUDA-MEME      algorithm based on MEME
                                                  4-10x
                                                                       Multi-GPU, multi-node                     re/mcuda-meme


                 Hardware and software for
                                                                             Released.
  SeqNFind     reference assembly, blast, SW,     400x
                                                                       Multi-GPU, multi-node
                                                                                                           http://www.seqnfind.com/
                   HMM, de novo assembly


                                                                       Version 1.11 – 5/2012
   UGENE         Fast short read alignment         6-8x
                                                                       Multi-GPU, multi-node
                                                                                                            http://ugene.unipro.ru/
                                                           GPU Perf compared against same or similar code running on single CPU machine
                Parallel linear regression on                                           Performance measured internally or independently
GPU Accelerated Computational Chemistry Applications
MD Average Speedups
                                                                                                                               The blue node contains Dual E5-2687W CPUs
                                   10                                                                                          (8 Cores per CPU).

                                                                                                                               The green nodes contain Dual E5-2687W CPUs (8
                                                                                                                               Cores per CPU) and 1 or 2 NVIDIA K10, K20, or
Performance Relative to CPU Only




                                    8                                                                                          K20X GPUs.



                                    6




                                    4




                                    2




                                    0
                                        CPU     CPU + K10   CPU + K20   CPU + K20X   CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X



Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.
                                              Error bars show the maximum and minimum speedup for each hardware configuration.
Molecular Dynamics (MD) Applications
                      Features
 Application                                      GPU Perf              Release Status                        Notes/Benchmarks
                     Supported
                                                   > 100 ns/day                                              AMBER 12, GPU Revision Support 12.2
                PMEMD Explicit Solvent & GB                                      Released
   AMBER             Implicit Solvent
                                                  JAC NVE on 2X
                                                                          Multi-GPU, multi-node
                                                                                                            http://ambermd.org/gpus/benchmarks.
                                                       K20s                                                           htm#Benchmarks


                                                  2x C2070 equals                                                      Release C37b1;
                  Implicit (5x), Explicit (2x)                                   Released
  CHARMM            Solvent via OpenMM
                                                   32-35x X5667
                                                                     Single & multi-GPU in single node
                                                                                                         http://www.charmm.org/news/c37b1.html#po
                                                       CPUs                                                                stjump


                 Two-body Forces, Link-cell                                                                      Source only, Results Published
                                                                             Release V 4.03
  DL_POLY        Pairs, Ewald SPME forces,              4x
                                                                          Multi-GPU, multi-node
                                                                                                         http://www.stfc.ac.uk/CSE/randd/ccg/softwa
                         Shake VV                                                                                   re/DL_POLY/25526.aspx


                                                    165 ns/Day
                                                                                 Released
  GROMACS         Implicit (5x), Explicit (2x)       DHFR on
                                                                          Multi-GPU, multi-node
                                                                                                              Release 4.6; 1st Multi-GPU support
                                                    4X C2075s

                                                                                                         http://lammps.sandia.gov/bench.html#deskto
                 Lennard-Jones, Gay-Berne,                                      Released.
  LAMMPS       Tersoff & many more potentials
                                                  3.5-18x on Titan
                                                                          Multi-GPU, multi-node
                                                                                                                            p and
                                                                                                          http://lammps.sandia.gov/bench.html#titan


                                                    4.0 ns/days                  Released
               Full electrostatics with PME and
   NAMD            most simulation features
                                                   F1-ATPase on            100M atom capable                              NAMD 2.9
                                                      1x K20X             Multi-GPU, multi-node
                                                                                          GPU Perf compared against Multi-core x86 CPU socket.
                                                                                             GPU Perf benchmarked on GPU supported features
                                                                                                  and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping
                      Features
Application                                        GPU Perf                Release Status                                  Notes
                     Supported
                                                         4-29X              Released, Version 1.8.51
  Abalone         Simulations (on 1060 GPU)
                                                     (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                 Computation of non-valent               4-29X              Released, Version 1.1.4
  Ascalaph             interactions                  (on 1060 GPU)                Single GPU
                                                                                                                       Agile Molecule, Inc.

                                                  150 ns/day DHFR on                Released                 Production bio-molecular dynamics (MD)
   ACEMD        Written for use only on GPUs
                                                        1x K20               Single and multi-GPUs         software specially optimized to run on GPUs

               Powerful distributed computing
                                                     Depends upon                 Released;                       http://folding.stanford.edu
Folding@Home    molecular dynamics system;
                                                    number of GPUs              GPUs and CPUs                    GPUs get 4X the points of CPUs
                 implicit solvent and folding

                 High-performance all-atom
                                                     Depends upon                 Released;                         http://www.gpugrid.net/
GPUGrid.net       biomolecular simulations;
                                                    number of GPUs             NVIDIA GPUs only
                 explicit solvent and binding
                   Simple fluids and binary
               mixtures (pair potentials, high-    Up to 66x on 2090        Released, Version 0.2.0       http://halmd.org/benchmarks.html#supercool
   HALMD       precision NVE and NVT, dynamic       vs. 1 CPU core                Single GPU                     ed-binary-mixture-kob-andersen
                         correlations)

                                                    Kepler 2X faster        Released, Version 0.11.2        http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue      Written for use only on GPUs
                                                      than Fermi         Single and multi-GPU on 1 node          Multi-GPU w/ MPI in March 2013

                                                   Implicit: 127-213
                Implicit and explicit solvent,                               Released Version 4.1.1       Library and application for molecular dynamics
  OpenMM                custom forces
                                                  ns/day Explicit: 18-
                                                                                   Multi-GPU                           on high-performance
                                                    55 ns/day DHFR
                                                                                            GPU Perf compared against Multi-core x86 CPU socket.
                                                                                               GPU Perf benchmarked on GPU supported features
                                                                                                   and may be a kernel to kernel perf comparison
Built from Ground Up for GPUs
       Computational Chemistry

       Study disease & discover drugs
What
       Predict drug and protein interactions
                                                                                                                          GPU READY
       Speed of simulations is critical                                                                                  APPLICATIONS
Why    Enables study of:
                                                                                                                            Abalone
                                                                                                                             ACEMD
                 Longer timeframes                                                                                           AMBER
                 Larger systems                                                                                             DL_PLOY
                 More simulations                                                                                           GAMESS


How    GPUs increase throughput & accelerate simulations
                                                                                                                           GROMACS
                                                                                                                            LAMMPS
                                                                                                                             NAMD
            AMBER 11 Application                                                                                            NWChem
            4.6x performance increase with 2 GPUs with                                                                      Q-CHEM
            only a 54% added cost*                                                                                      Quantum Espresso
                                                                                                                           TeraChem
        •    AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)
        •    Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
AMBER 12
     GPU Support Revision 12.2
            1/22/2013




15
Kepler - Our Fastest Family of GPUs Yet
                         30.00
                                                         Factor IX                                                   Running AMBER 12 GPU Support Revision 12.1
                                                                                                       25.39
                         25.00                                                                                       The blue node contains Dual E5-2687W CPUs
                                                                                     22.44                           (8 Cores per CPU).
                                                                                                      7.4x           The green nodes contain Dual E5-2687W CPUs (8
                         20.00                                    18.90                                              Cores per CPU) and either 1x NVIDIA M2090, 1x K10
     Nanoseconds / Day




                                                                                                                     or 1x K20 for the GPU
                                                                                    6.6x
                         15.00

                                                 11.85           5.6x
                         10.00


                                                3.5x
                          5.00
                                    3.42



                          0.00
                                                                                                                                        Factor IX
                                 1 CPU Node   1 CPU Node +   1 CPU Node + K10   1 CPU Node + K20 1 CPU Node + K20X
                                                 M2090


                         GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)
                         when compared to a CPU only node
16
K10 Accelerates Simulations of All Sizes
                               30
                                                                                                                               Running AMBER 12 GPU Support Revision 12.1

                                                                                                                               The blue node contains Dual E5-2687W CPUs
                               25                                                                                   24.00      (8 Cores per CPU).
Speedup Compared to CPU Only




                                                                                                                               The green nodes contain Dual E5-2687W CPUs (8
                                                                                                        19.98
                               20                                                                                              Cores per CPU) and 1x NVIDIA K10 GPU



                               15



                               10


                                                               5.50         5.53          5.04
                               5
                                                     2.00

                               0
                                         CPU        TRPcage   JAC NVE   Factor IX NVE Cellulose NVE   Myoglobin   Nucleosome
                                    All Molecules     GB        PME         PME            PME          GB            GB


                                             Gain 24x performance by adding just 1 GPU
                                                                                                                                           Nucleosome
                                             when compared to dual CPU performance
K20 Accelerates Simulations of All Sizes
                                    30.00
                                                                                                                             28.00
                                                                                                                                            Running AMBER 12 GPU Support Revision 12.1
                                                                                                               25.56                        SPFP with CUDA 4.2.9 ECC Off
                                    25.00
                                                                                                                                            The blue node contains 2x Intel E5-2687W CPUs
     Speedup Compared to CPU Only




                                                                                                                                            (8 Cores per CPU)
                                    20.00
                                                                                                                                            Each green nodes contains 2x Intel E5-2687W
                                                                                                                                            CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

                                    15.00



                                    10.00
                                                                                                   7.28
                                                                         6.50         6.56

                                     5.00
                                                            2.66
                                               1.00
                                     0.00
                                              CPU All    TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB   Nucleosome
                                             Molecules                              PME            PME                         GB


                                            Gain 28x throughput/performance by adding just one K20 GPU
                                                                                                                                                               Nucleosome
                                            when compared to dual CPU performance

18                                                                                                                                      AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
K20X Accelerates Simulations of All Sizes
                                    35
                                                                                                                         31.30         Running AMBER 12 GPU Support Revision 12.1
                                    30                                                                       28.59
                                                                                                                                       The blue node contains Dual E5-2687W CPUs
                                                                                                                                       (8 Cores per CPU).
     Speedup Compared to CPU Only




                                    25
                                                                                                                                       The green nodes contain Dual E5-2687W CPUs (8
                                                                                                                                       Cores per CPU) and 1x NVIDIA K20X GPU
                                    20


                                    15


                                    10                                                         8.30
                                                                    7.15         7.43

                                     5
                                                          2.79


                                     0
                                              CPU        TRPcage   JAC NVE   Factor IX NVE Cellulose NVE   Myoglobin   Nucleosome
                                         All Molecules     GB        PME         PME            PME          GB            GB



                                          Gain 31x performance by adding just one K20X GPU
                                                                                                                                                           Nucleosome
                                          when compared to dual CPU performance

19                                                                                                                                  AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
K10 Strong Scaling over Nodes
                                    Cellulose 408K Atoms (NPT)                         Running AMBER 12 with CUDA 4.2 ECC Off
                    6                                                                  The blue nodes contains 2x Intel X5670
                                                                                       CPUs (6 Cores per CPU)

                    5                                                                  The green nodes contains 2x Intel X5670
                                                                                       CPUs (6 Cores per CPU) plus 2x NVIDIA
                                                                                       K10 GPUs
                    4
Nanoseconds / Day




                                                                     2.4x
                    3
                                                                            CPU Only
                                                 3.6x                       With GPU
                    2

                             5.1x
                    1

                                                                                                  Cellulose
                    0
                         1                       2               4
                                          Number of Nodes


                         GPUs significantly outperform CPUs while scaling over multiple nodes
Kepler – Universally Faster
                                9
                                                                                                  Running AMBER 12 GPU Support Revision 12.1
                                8                                                                 The CPU Only node contains Dual E5-2687W CPUs
                                                                                                  (8 Cores per CPU).
Speedups Compared to CPU Only




                                7
                                                                                                  The Kepler nodes contain Dual E5-2687W CPUs (8
                                6                                                                 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
                                                                                                  GPUs
                                5
                                                                                      JAC

                                4                                                     Factor IX
                                                                                      Cellulose
                                3

                                2

                                1

                                0
                                    CPU Only    CPU + K10    CPU + K20   CPU + K20X                             Cellulose


                                    The Kepler GPUs accelerated all simulations, up to 8x
K10 Extreme Performance
                                                                        Running AMBER 12 GPU Support Revision 12.1
                                      JAC 23K Atoms (NVE)
                    120                                                 The blue node contains Dual E5-2687W CPUs
                                                                        (8 Cores per CPU).

                                                               97.99    The green node contain Dual E5-2687W CPUs (8
                    100
                                                                        Cores per CPU) and 2x NVIDIA K10 GPUs
Nanoseconds / Day




                     80


                     60


                     40


                     20
                                   12.47


                      0
                                   1 Node                     1 Node
                                                                                        DHFR

                          Gain 7.8X performance by adding just 2 GPUs
                            when compared to dual CPU performance
K20 Extreme Performance
                                            DHRF JAC 23K Atoms (NVE)                          Running AMBER 12 GPU Support Revision 12.1
                                                                                              SPFP with CUDA 4.2.9 ECC Off
                         120

                                                                                              The blue node contains 2x Intel E5-2687W CPUs
                                                                          95.59               (8 Cores per CPU)
                         100

                                                                                              Each green node contains 2x Intel E5-2687W
                                                                                              CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
     Nanoseconds / Day




                          80


                          60


                          40


                          20                 12.47


                           0
                                            1 Node                       1 Node
                                                                                                                    DHFR

                               Gain > 7.5X throughput/performance by adding just 2 K20 GPUs
                                          when compared to dual CPU performance

23                                                                                        AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Replace 8 Nodes with 1 K20 GPU
     90.00                                                                 35000
                                               $32,000.00
                                                                                       Running AMBER 12 GPU Support Revision 12.1
                                 81.09                                                 SPFP with CUDA 4.2.9 ECC Off
     80.00
                                                                           30000
                                                                                       The eight (8) blue nodes each contain 2x Intel
     70.00                                                                             E5-2687W CPUs (8 Cores per CPU)
                     65.00
                                                                           25000
                                                                                       Each green node contains 2x Intel E5-2687W
     60.00
                                                                                       CPUs (8 Cores per CPU) plus 1x NVIDIA K20
                                                                                       GPU
     50.00                                                                 20000

                                                                                       Note: Typical CPU and GPU node pricing used.
     40.00                                                                             Pricing may vary depending on node
                                                                           15000
                                                                                       configuration. Contact your preferred HW vendor
                                                                                       for actual pricing.
     30.00
                                                                           10000
     20.00                                                     $6,500.00

                                                                           5000
     10.00


      0.00                                                                 0
                      Nanoseconds/Day                   Cost

                                                                                                           DHFR
             Cut down simulation costs to ¼ and gain higher performance

24                                                                                 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Replace 7 Nodes with 1 K10 GPU
                          Performance on JAC NVE                         Cost                       Running AMBER 12 GPU Support Revision 12.1
                                                                                                    SPFP with CUDA 4.2.9 ECC Off
                         80                               $35,000.00
                                                                       $32,000
                                                                                                    The eight (8) blue nodes each contain 2x Intel
                         70                               $30,000.00                                E5-2687W CPUs (8 Cores per CPU)

                         60
                                                                                                    The green node contains 2x Intel E5-2687W
                                                          $25,000.00                                CPUs (8 Cores per CPU) plus 1x NVIDIA K10
     Nanoseconds / Day




                                                                                                    GPU
                         50
                                                          $20,000.00                                Note: Typical CPU and GPU node pricing used.
                         40                                                                         Pricing may vary depending on node
                                                          $15,000.00                                configuration. Contact your preferred HW vendor
                         30                                                                         for actual pricing.

                                                          $10,000.00
                         20                                                         $7,000

                         10                                $5,000.00


                          0                                    $0.00
                                CPU Only    GPU Enabled                CPU Only   GPU Enabled

                                                                                                                        DHFR
                         Cut down simulation costs to ¼ and increase performance by 70%

25                                                                                              AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Extra CPUs decrease Performance
                                   Cellulose NVE                                          Running AMBER 12 GPU Support Revision 12.1

                    8                                                                     The orange bars contains one E5-2687W CPUs
                                                                                          (8 Cores per CPU).
                    7
                                                                                          The blue bars contain Dual E5-2687W CPUs (8
                    6                                                                     Cores per CPU)
Nanoseconds / Day




                                                             2 CPUs 2 GPUs
                                              1 CPU 2 GPUs
                    5

                    4                                                        1 E5-2687W
                                                                             2 E5-2687W
                    3

                    2

                    1

                    0                                                                                   Cellulose
                        CPU Only             CPU with dual K20s




When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
Kepler - Greener Science
                                                                                               Running AMBER 12 GPU Support Revision 12.1
                              Energy used in simulating 1 ns of DHFR JAC
                       2500                                                                    The blue node contains Dual E5-2687W CPUs
                                                                                               (150W each, 8 Cores per CPU).

                                                                                               The green nodes contain Dual E5-2687W CPUs (8
                       2000                                                                    Cores per CPU) and 1x NVIDIA K10, K20, or K20X
                                                                Lower is better                GPUs (235W each).
Energy Expended (kJ)




                       1500


                                                                                                        Energy Expended
                       1000
                                                                                                        = Power x Time

                        500



                          0
                               CPU Only      CPU + K10     CPU + K20              CPU + K20X


                              The GPU Accelerated systems use 65-75% less energy
Recommended GPU Node Configuration for
         AMBER Computational Chemistry
                   Workstation or Single Node Configuration
                      # of CPU sockets                                    2

                    Cores per CPU socket                  4+ (1 CPU core drives 1 GPU)

                      CPU speed (Ghz)                                 2.66+
                System memory per node (GB)                              16

                                                              Kepler K10, K20, K20X
                           GPUs
                                                           Fermi M2090, M2075, C2075


                                                                         1-2
                  # of GPUs per CPU socket                 (4 GPUs on 1 socket is good
                                                            to do 4 fast serial GPU runs)

                 GPU memory preference (GB)                               6
                   GPU to CPU connection                      PCIe 2.0 16x or higher

                       Server storage                                  2 TB
28   Scale to multiple nodes with same single node configuration     AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Benefits of GPU AMBER Accelerated Computing
     Faster than CPU only systems in all tests

     Most major compute intensive aspects of classical MD ported

     Large performance boost with marginal price increase

     Energy usage cut by more than half

     GPUs scale well within a node and over multiple nodes

     K20 GPU is our fastest and lowest power high performance GPU yet

        Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive
29                                                        AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
NAMD 2.9
Kepler - Our Fastest Family of GPUs Yet
                       4.50
                                                          ApoA1                                                   Running NAMD version 2.9
                                                                                                    4.00
                       4.00                                                                                       The blue node contains Dual E5-2687W CPUs
                                                                                   3.57                           (8 Cores per CPU).
                                                                3.45
                       3.50
                                                                                                                  The green nodes contain Dual E5-2687W CPUs (8
                                                                                                   2.9x           Cores per CPU) and either 1x NVIDIA M2090, 1x K10
                       3.00                                                                                       or 1x K20 for the GPU
     Nanoseconds/Day




                                               2.63
                                                                                 2.6x
                       2.50

                                                              2.5x
                       2.00


                       1.50      1.37        1.9x

                       1.00


                       0.50


                       0.00
                              1 CPU Node   1 CPU Node +   1 CPU Node + K10   1 CPU Node + K20 1 CPU Node + K20X
                                                                                                                                        Apolipoprotein A1
                                              M2090


                  GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)
                  when compared to a CPU only node
31                                                                                                                NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Accelerates Simulations of All Sizes
                                     3
                                                                                                      Running NAMD 2.9 with CUDA 4.0 ECC Off
                                                                                         2.7
                                                               2.6
                                                                                                      The blue node contains 2x Intel E5-2687W CPUs
                                    2.5                                     2.4
                                                                                                      (8 Cores per CPU)
     Speedup Compared to CPU Only




                                                                                                      Each green node contains 2x Intel E5-2687W
                                     2                                                                CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs



                                    1.5



                                     1



                                    0.5



                                     0
                                          CPU All Molecules   ApoA1      F1-ATPase       STMV
                                                                                                                        Apolipoprotein A1

                                          Gain 2.5x throughput/performance by adding just 1 GPU
                                          when compared to dual CPU performance

32                                                                                                NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Kepler – Universally Faster
                               6
                                                                                                                               Running NAMD version 2.9

                                                                                                                               The CPU Only node contains Dual E5-2687W CPUs
                               5                                                                                               (8 Cores per CPU).
Speedup Compared to CPU Only




                                                                                                      5.1x                     The Kepler nodes contain Dual E5-2687W CPUs (8
                               4                                                            4.7x                               Cores per CPU) and 1 or two NVIDIA K10, K20, or
                                                                                                                               K20X GPUs.
                                                                                  4.3x
                                                                                                                   F1-ATPase
                               3
                                                                                                                   ApoA1
                                                                                                                   STMV
                                                                        2.9x
                               2
                                                           2.6x
                                                  2.4x

                               1



                               0
                                   CPU Only       1x K10   1x K20      1x K20X    2x K10    2x K20   2x K20X
                                                                                                                                           F1-ATPase
                                              |                     Kepler nodes use Dual CPUs                 |

                                        The Kepler GPUs accelerate all simulations, up to 5x
                                        Average acceleration printed in bars
Outstanding Strong Scaling with Multi-STMV
                                                                                            Running NAMD version 2.9
                                                                                            Each blue XE6 CPU node contains 1x AMD
                                     100 STMV on Hundreds of Nodes                          1600 Opteron (16 Cores per CPU).
                    1.2

                                  Fermi XK6                                                 Each green XK6 CPU+GPU node contains
                                                                                            1x AMD 1600 Opteron (16 Cores per CPU)
                     1                                                                      and an additional 1x NVIDIA X2090 GPU.
                                  CPU XK6
                                                                                     2.7x
Nanoseconds / Day




                    0.8

                                                                      2.9x
                    0.6



                    0.4



                    0.2
                                                3.6x
                          3.8x                                                                       Concatenation of 100
                     0                                                                           Satellite Tobacco Mosaic Virus
                             32      64       128          256      512      640   768
                                                       # of Nodes


                    Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
Replace 3 Nodes with 1 2090 GPU
                                                                                     Running NAMD version 2.9
                                                                                     Each blue node contains 2x Intel Xeon X5550 CPUs
                           F1-ATPase                                                 (4 Cores per CPU).
                                                                4 CPU Nodes
0.8                                                                           9000
                    0.74                                                             The green node contains 2x Intel Xeon X5550 CPUs
                                       $8,000
                                                                1 CPU Node +8000     (4 Cores per CPU) and 1x NVIDIA M2090 GPU
0.7                                                             1x M2090 GPUs
         0.63
                                                                              7000   Note: Typical CPU and GPU node pricing used. Pricing
0.6                                                                                  may vary depending on node configuration. Contact your
                                                                              6000   preferred HW vendor for actual pricing.
0.5
                                                                              5000
0.4                                                    $4,000
                                                                              4000
0.3
                                                                              3000
0.2
                                                                              2000

0.1                                                                           1000

 0                                                                            0
         Nanoseconds/Day                        Cost




         Speedup of 1.2x for 50% the cost                                                                F1-ATPase
K20 - Greener: Twice The Science Per Watt
                            1200000
                                      Energy Used in Simulating 1 Nanosecond of ApoA1
                                                                                                        Running NAMD version 2.9
                            1000000                                                                     Each blue node contains Dual E5-2687W
                                                                                                        CPUs (95W, 4 Cores per CPU).

                                                                                                        Each green node contains 2x Intel Xeon X5550
     Energy Expended (kJ)




                             800000
                                                                                                        CPUs (95W, 4 Cores per CPU) and 2x NVIDIA
                                                             Lower is better                            K20 GPUs (225W per GPU)

                             600000


                                                                                                               Energy Expended
                             400000
                                                                                                               = Power x Time

                             200000



                                  0
                                              1 Node                           1 Node + 2x K20


                                          Cut down energy usage by ½ with GPUs

36                                                                                               NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Kepler - Greener: Twice The Science/Joule
                                Energy used in simulating 1 ns of SMTV
                       250000
                                                                                              Running NAMD version 2.9

                                                                                              The blue node contains Dual E5-2687W CPUs
                       200000                                                                 (150W each, 8 Cores per CPU).
Energy Expended (kJ)




                                                                            Lower is better   The green nodes contain Dual E5-2687W CPUs
                                                                                              (8 Cores per CPU) and 2x NVIDIA K10, K20, or
                       150000
                                                                                              K20X GPUs (235W each).

                                                                                                   Energy Expended
                       100000
                                                                                                   = Power x Time

                        50000



                            0
                                CPU Only      CPU + 2 K10s   CPU + 2 K20s     CPU + 2 K20Xs




                                       Cut down energy usage by ½ with GPUs

                                                                                                   Satellite Tobacco Mosaic Virus
Recommended GPU Node Configuration for
         NAMD Computational Chemistry
                   Workstation or Single Node Configuration
                      # of CPU sockets                                  2
                    Cores per CPU socket                               6+
                      CPU speed (Ghz)                                2.66+
               System memory per socket (GB)                           32

                                                             Kepler K10, K20, K20X
                           GPUs
                                                          Fermi M2090, M2075, C2075

                  # of GPUs per CPU socket                            1-2
                GPU memory preference (GB)                              6
                   GPU to CPU connection                       PCIe 2.0 or higher

                       Server storage                          500 GB or higher

                    Network configuration                      Gemini, InfiniBand

38   Scale to multiple nodes with same single node configuration    NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
Summary/Conclusions
     Benefits of GPU Accelerated Computing
     Faster than CPU only systems in all tests

     Large performance boost with small marginal price increase

     Energy usage cut in half

     GPUs scale very well within a node and over multiple nodes

     Tesla K20 GPU is our fastest and lowest power high performance GPU to date

       Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive
39                                                           NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
LAMMPS, Jan. 2013 or later
More Science for Your Money
                                                  Embedded Atom Model                                           Blue node uses 2x E5-2687W (8 Cores
                               6                                                                                and 150W per CPU).
                                                                                                       5.5
                                                                                                                Green nodes have 2x E5-2687W and 1
                               5                                                                                or 2 NVIDIA K10, K20, or K20X GPUs (235W).
Speedup Compared to CPU Only




                                                                                            4.5


                               4
                                                                                 3.3
                                                                      2.92
                               3
                                                           2.47

                               2                1.7


                               1



                               0
                                   CPU Only   CPU + 1x   CPU + 1x   CPU + 1x   CPU + 2x   CPU + 2x   CPU + 2x
                                                K10        K20       K20X        K10        K20       K20X


                                   Experience performance increases of up to 5.5x with Kepler GPU nodes.
K20X, the Fastest GPU Yet
                                7                                                                   Blue node uses 2x E5-2687W (8 Cores
                                                                                                    and 150W per CPU).
                                6
                                                                                                    Green nodes have 2x E5-2687W and 2
                                                                                                    NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone




                                5


                                4


                                3


                                2


                                1


                                0
                                         CPU Only     CPU + 2x M2090   CPU + K20X   CPU + 2x K20X




                                    Experience performance increases of up to 6.2x with Kepler GPU nodes.
                                    One K20X performs as well as two M2090s
Get a CPU Rebate to Fund Part of Your GPU Budget
                               Acceleration in Loop Time Computation by
                                            Additional GPUs
                                                                                                                Running NAMD version 2.9
                          20
                                                                                                  18.2
                                                                                                                The blue node contains Dual X5670 CPUs
                          18
                                                                                                                (6 Cores per CPU).
                          16
                                                                                                                The green nodes contain Dual X5570 CPUs
 Normalized to CPU Only




                          14                                                     12.9                           (4 Cores per CPU) and 1-4 NVIDIA M2090
                                                                                                                GPUs.
                          12
                                                                9.88
                          10

                           8

                           6                   5.31

                           4

                           2

                           0
                                1 Node   1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090


                                 Increase performance 18x when compared to CPU-only nodes


                          Cheaper CPUs used with GPUs AND still faster overall performance when
                                          compared to more expensive CPUs!
Excellent Strong Scaling on Large Clusters
                                                 LAMMPS Gay-Berne 134M Atoms

                            600
                                                                                                    GPU Accelerated XK6
                            500
                                                                                                    CPU only XE6
      Loop Time (seconds)




                            400
                                       3.55x
                            300


                            200
                                                                              3.48x
                                                                                                                            3.45x
                            100


                              0
                                     300           400          500          600          700          800            900
                                                                            Nodes

    From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
                            compared to XE6 CPU nodes
                            Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
                            Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
GPUs Sustain 5x Performance for Weak Scaling
                                                Weak Scaling with 32K Atoms per Node
                              45

                              40

        Loop Time (seconds)   35

                              30
                                         6.7x                        5.8x                       4.8x
                              25

                              20

                              15

                              10

                               5

                               0
                                     1           8     27    64    125      216   343   512   729
                                                                  Nodes
                                   Performance of 4.8x-6.7x with GPU-accelerated nodes
                                             when compared to CPUs alone
     Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
     Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
Faster, Greener — Worth It!
                         Energy Consumed in one loop of EAM
                       140


                       120                                                              GPU-accelerated computing uses
                                              Lower is better                            53% less energy than CPU only
                       100
Energy Expended (kJ)




                        80


                        60
                                                                                      Energy Expended = Power x Time
                                                                                      Power calculated by combining the component’s TDPs
                        40


                        20


                         0
                                1 Node         1 Node + 1 K20X    1 Node + 2x K20X


                                Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
                                Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.


                             Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
Molecular Dynamics with LAMMPS
 on a Hybrid Cray Supercomputer
                    W. Michael Brown
        National Center for Computational Sciences
              Oak Ridge National Laboratory

      NVIDIA Technology Theater, Supercomputing 2012
                     November 14, 2012
Early Kepler Benchmarks on Titan
                      32.00                                                                             4
                      16.00
                                                                   XK7+GPU
                       8.00
                       4.00                                        XK6                                  3




               Time (s)
Atomic Fluid           2.00




                                                                                                            Time (s)
                                                                   XK6+GPU
                       1.00                                                                             2
                       0.50                                        XK7+GPU
                       0.25                                        XK6
                       0.13                                                                             1
                                                                   XK6+GPU
                       0.06
                       0.03                                                                             0
                                   1   2   4   8   16 32 64 128   Nodes




                                                                             1

                                                                                 4
                                                                                     16

                                                                                              64

                                                                                                6


                                                                                              96
                                                                                              24



                                                                                               4
                                                                                             25




                                                                                             38
                                                                                           40
                                                                                           10


                                                                                          16
                                                                                                        3.0
                            8.00
                                                                  XK7+GPU                               2.5
                            4.00
                                                                                                        2.0




                                                                                                                   Time (s)
                            2.00
                 Time (s)




Bulk Copper                                                       XK6                                   1.5
                            1.00
                                                                                                        1.0
                            0.50                                  XK6+GPU
                                                                                                        0.5
                            0.25
                                                                                                        0.0
                            0.13                                  Nodes




                                                                             1

                                                                                 4
                                                                                     16

                                                                                          64

                                                                                                    6


                                                                                                   96
                                                                                                   24



                                                                                                    4
                                                                                                  25




                                                                                                 38
                                   1   2   4   8   16 32 64 128




                                                                                                40
                                                                                                10


                                                                                               16
Early Kepler Benchmarks on Titan
                             64.00                                                                                               32
                             32.00
                                                                       XK7+GPU                                                   16
                             16.00




                      Time (s)
Protein                                                                                                                          8




                                                                                                                                      Time (s)
                                 8.00
                                                                       XK6
                                 4.00                                                                                            4
                                 2.00
                                                                       XK6+GPU                                                   2
                                 1.00
                                 0.50                                                                                            1




                                                                                 1

                                                                                       4

                                                                                           16

                                                                                                64

                                                                                                     256



                                                                                                                  4096

                                                                                                                         16384
                                                                                                           1024
                                        1   2   4   8   16 32 64 128   Nodes
                    128.00                                                                                                                   16
                     64.00                                                                                                                   14
                     32.00                                               XK7+GPU
                                                                                                                                             12
                     16.00
                                                                                                                                             10
                 Time (s)




                      8.00




                                                                                                                                                  Time (s)
Liquid Crystal        4.00                                               XK6                                                                 8
                      2.00                                                                                                                   6
                      1.00
                                                                         XK6+GPU                                                             4
                      0.50
                      0.25                                                                                                                   2
                      0.13                                                                                                                   0
                                        1   2   4   8   16 32 64 128    Nodes




                                                                                   1

                                                                                       4
                                                                                            16

                                                                                                 64

                                                                                                            6


                                                                                                           96
                                                                                                           24



                                                                                                            4
                                                                                                          25




                                                                                                         38
                                                                                                        40
                                                                                                        10


                                                                                                       16
Early Titan XK6/XK7 Benchmarks
           18
                             Speedup with Acceleration on XK6/XK7 Nodes
           16
                                       1 Node = 32K Particles
           14
                                     900 Nodes = 29M Particles
           12
           10
            8
            6
            4
            2
            0
                  Atomic Fluid (cutoff Atomic Fluid (cutoff
                                                              Bulk Copper   Protein   Liquid Crystal
                       = 2.5σ)              = 5.0σ)
XK6 (1 Node)             1.92                 4.33               2.12         2.6         5.82
XK7 (1 Node)             2.90                 8.38               3.66        3.36         15.70
XK6 (900 Nodes)          1.68                 3.96               2.15        1.56         5.60
XK7 (900 Nodes)          2.75                 7.48               2.86        1.95         10.14
Recommended GPU Node Configuration for
         LAMMPS Computational Chemistry
                   Workstation or Single Node Configuration
                      # of CPU sockets                                 2
                    Cores per CPU socket                              6+
                      CPU speed (Ghz)                                2.66+
               System memory per socket (GB)                          32

                                                             Kepler K10, K20, K20X
                           GPUs
                                                          Fermi M2090, M2075, C2075

                  # of GPUs per CPU socket                            1-2
                GPU memory preference (GB)                             6
                   GPU to CPU connection                       PCIe 2.0 or higher

                       Server storage                          500 GB or higher

                    Network configuration                      Gemini, InfiniBand

51   Scale to multiple nodes with same single node configuration
GROMACS 4.6 Final, Pre-Beta
      and 4.6 Beta
GPU Accelerated Computational Chemistry Applications
Great Scaling in Small Systems
                    25.00
                                                                                               Running GROMACS 4.6 pre-beta with CUDA 4.1
                                                                            21.68
                                                                                               Each blue node contains 1x Intel X5550 CPU
                    20.00                                               3.2x                   (95W TDP, 4 Cores per CPU)

                                                                 3.2x                          Each green node contains 1x Intel X5550 CPU
Nanoseconds / Day




                                                                                               (95W TDP, 4 Cores per CPU) and 1x NVIDIA
                    15.00                                                                      M2090 (225W TDP per GPU)
                                                         13.01

                                                                                    CPU Only
                    10.00                            3.6x                           With GPU
                                       8.36
                                              3.6x

                     5.00
                            3.7x
                                                                                                   Benchmark systems: RNAse in water
                                                                                                   with 16,816 atoms in truncated
                                                                                                   dodecahedron box
                     0.00
                                   1                 2                  3
                                              Number of Nodes



                     Get up to 3.7x performance compared to CPU-only nodes
Additional Strong Scaling on Larger System
                                          128K Water Molecules
                    160                                                               Running GROMACS 4.6 pre-beta with CUDA 4.1

                                                                                      Each blue node contains 1x Intel X5670 (95W
                    140
                                                                                      TDP, 6 Cores per CPU)

                    120                                                               Each green node contains 1x Intel X5670 (95W
                                                                      2x              TDP, 6 Cores per CPU) and 1x NVIDIA M2070
Nanoseconds / Day




                    100                                                               (225W TDP per GPU)

                     80
                                                                           CPU Only
                     60                                                    With GPU

                                                        2.8x
                     40

                     20
                              3.1x
                      0
                          8          16            32          64   128
                                             Number of Nodes



Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance
                   when compared to CPU-only nodes
Replace 3 Nodes with 2 GPUs
                                                                            Running GROMACS 4.6 pre-beta with CUDA
                ADH in Water (134K Atoms)                                   4.1

9                                                           4 CPU Nodes
                                                                     9000   The blue node contains 2x Intel X5550 CPUs
                  8.36
                                   $8,000                                   (95W TDP, 4 Cores per CPU)
8                                                                    8000
                                                                            The green node contains 2x Intel X5550 CPUs
7      6.7                                                           7000
                                                   $6,500                   (95W TDP, 4 Cores per CPU) and 2x
                                                                            NVIDIA M2090s as the GPU (225W TDP per
6                                                                    6000   GPU)
5                                                                    5000   Note: Typical CPU and GPU node pricing
                                                                            used. Pricing may vary depending on node
4                                                                    4000
                                                                            configuration. Contact your preferred HW
                                                                            vendor for actual pricing.
3                                                                    3000

2                                                                    2000

1                                                                    1000

0                                                                    0
       Nanoseconds/Day                      Cost



      Save thousands of dollars and perform 25% faster
Greener Science
                                                      ADH in Water (134K Atoms)
                                                                                                      Running GROMACS 4.6 with CUDA 4.1
                                        12000
                                                                                                      The blue nodes contain 2x Intel X5550 CPUs
Energy Expended (KiloJoules Consumed)




                                                                                                      (95W TDP, 4 Cores per CPU)
                                        10000
                                                                                                      The green node contains 2x Intel X5550 CPUs,
                                                                        Lower is better               4 Cores per CPU) and 2x NVIDIA M2090s GPUs
                                        8000                                                          (225W TDP per GPU)


                                        6000



                                        4000                                                                  Energy Expended
                                                                                                              = Power x Time
                                        2000



                                            0
                                                        4 Nodes                   1 Node + 2x M2090
                                                      (760 Watts)                    (640 Watts)




                                         In simulating each nanosecond, the GPU-accelerated system uses 33% less energy
The Power of Kepler
                RNase Solvated Protein 24k Atoms
140

                                                                              Running GROMACS version 4.6 beta
120
                                                                              The grey nodes contain 1 or 2 E5-2687W CPUs
                                                                              (150W each, 8 Cores per CPU) and 1 or 2
100                                                                           NVIDIA M2090s.

                                                                              The green nodes contain 1 or 2 E5-2687W
 80                                                                           CPUs (8 Cores per CPU) and 1 or 2 NVIDIA
                                                                      M2090   K20X GPUs (235W each).
 60                                                                   K20X


 40


 20


  0
      1 CPU + 1 GPU   1 CPU + 2 GPU   2 CPU + 1 GPU   2 CPU + 2 GPU



 Upgrading an M2090 to a K20X increases performance 10-45%
                                                                                      Ribonuclease
K20X – Fast
                                 RNase Solvated Protein 24k Atoms
                    120

                                                                                          Running GROMACS version 4.6 beta
                    100
                                                                                          The blue nodes contain 1 or 2 E5-2687W CPUs
                                                                                          (150W each, 8 Cores per CPU).
                     80
Nanoseconds / Day




                                                                                          The green nodes contain 1 or 2 E5-2687W
                                                                                          CPUs (8 Cores per CPU) and 1 or 2 NVIDIA
                                                                                          K20X GPUs (235W each).
                     60                                                     CPU Only
                                                                            With 1 K20X

                     40



                     20



                      0
                                   1 CPU                   2 CPUs




                          Adding a K20X increases performance by up to 3x
                                                                                                  Ribonuclease
K20X, the Fastest Yet
                                      192K Water Molecules
                    16

                                                                                   Running GROMACS version 4.6-beta2 and
                    14                                                             CUDA 5.0.35

                    12                                                             The blue node contains 2 E5-2687W CPUs
                                                                                   (150W each, 8 Cores per CPU).
Nanoseconds / Day




                    10                                                             The green nodes contain 2 E5-2687W CPUs (8
                                                                                   Cores per CPU) and 1 or 2 NVIDIA K20X GPUs
                     8                                                             (235W each).

                     6


                     4


                     2


                     0
                               CPU              CPU + K20X       CPU + 2x K20X



                         Using K20X nodes increases performance by 2.5x
                                                                                                Water
                           Try GPU accelerated GROMACS 4.6 for free – www.nvidia.com/GPUTestDrive
Recommended GPU Node Configuration for
        GROMACS Computational Chemistry
                     Workstation or Single Node Configuration
             # of CPU sockets                                      2
           Cores per CPU socket                                   6+
             CPU speed (Ghz)                                    2.66+

      System memory per socket (GB)                               32

                                                        Kepler K10, K20, K20X
                  GPUs
                                                     Fermi M2090, M2075, C2075

                                                                   1x
                                       Kepler-based GPUs (K20X, K20 or K10): need fast Sandy
         # of GPUs per CPU socket
                                       Bridge or perhaps the very fastest Westmeres, or high-end
                                                            AMD Opterons

       GPU memory preference (GB)                                  6

          GPU to CPU connection                           PCIe 2.0 or higher
              Server storage                               500 GB or higher
61   Scale to multiple nodes with same single node configuration
CHARMM Release C37b1
GPUs Outperform CPUs
                            Daresbury Crambin 19.6k Atoms
                    70

                                                                                   Running CHARMM release C37b1
                    60
                                                                                   The blue nodes contains 44 X5667 CPUs
                                                                                   (95W, 4 Cores per CPU).
                    50
Nanoseconds / Day




                                                                                   The green nodes contain 2 X5667 CPUs and 1
                    40                                                             or 2 NVIDIA C2070 GPUs (238W each).

                                                                                   Note: Typical CPU and GPU node pricing used.
                    30
                                                                                   Pricing may vary depending on node
                                                                                   configuration. Contact your preferred HW vendor
                    20                                                             for actual pricing.


                    10


                     0
                         44x X5667     2x X5667 + 1x C2070   2x X5667 + 2x C2070
                          $44,000             $3000                 $4000



                                     1 GPU = 15 CPUs
More Bang for your Buck
                                            Daresbury Crambin 19.6k Atom
                             12
                                                                                                   Running CHARMM release C37b1

                             10                                                                    The blue nodes contains 44 X5667 CPUs
                                                                                                   (95W, 4 Cores per CPU).
Scaled Performance / Price




                              8                                                                    The green nodes contain 2 X5667 CPUs and 1
                                                                                                   or 2 NVIDIA C2070 GPUs (238W).

                              6                                                                    Note: Typical CPU and GPU node pricing used.
                                                                                                   Pricing may vary depending on node
                                                                                                   configuration. Contact your preferred HW vendor
                              4                                                                    for actual pricing.


                              2


                              0
                                        44x X5667      2x X5667 + 1x C2070   2x X5667 + 2x C2070




                                  Using GPUs delivers 10.6x the performance for the same cost
Greener Science with NVIDIA
                               Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms
                       18000


                       16000
                                                                                                      Running CHARMM release C37b1

                       14000                                                                          The blue nodes contains 64 X5667 CPUs
                                                                                                      (95W, 4 Cores per CPU).
Energy Expended (kJ)




                       12000
                                                                                                      The green nodes contain 2 X5667 CPUs and 1
                                                                                                      or 2 NVIDIA C2070 GPUs (238W each).
                       10000

                                                      Lower is better                                 Note: Typical CPU and GPU node pricing used.
                        8000                                                                          Pricing may vary depending on node
                                                                                                      configuration. Contact your preferred HW vendor
                        6000                                                                          for actual pricing.

                        4000


                        2000                                                                                 Energy Expended
                                                                                                             = Power x Time
                           0
                                     64x X5667           2x X5667 + 1x C2070    2x X5667 + 2x C2070



                           Using GPUs will decrease energy use by 75%
www.acellera.com




            470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms)
                 116 ns/day on 1 GPU for DHFR (23K atoms)
M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and
Comput. 5, 1632 (2009)
www.acellera.com




                          NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1




1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory
Comput., 5, 2371–2377 (2009)
2 For a list of selected references see http://www.acellera.com/acemd/publications
GPU Accelerated Computational Chemistry Applications
Quantum Chemistry Applications
 Application   Features Supported                GPU Perf          Release Status                                      Notes

                Local Hamiltonian, non-local
               Hamiltonian, LOBPCG algorithm,                     Released since Version 6.12                       www.abinit.org
   Abinit             diagonalization /
                                                   1.3-2.7X
                                                                      Multi-GPU support
                      orthogonalization


               Integrating scheduling GPU into                                                               http://www.olcf.ornl.gov/wp-
                                                                     Under development
   ACES III    SIAL programming language and     10X on kernels
                                                                      Multi-GPU support
                                                                                                         content/training/electronic-structure-
                   SIP runtime environment                                                                 2012/deumens_ESaccel_2012.pdf


                                                                   Pilot project completed,
    ADF             Fock Matrix, Hessians             TBD             Under development                              www.scm.com
                                                                       Multi-GPU support


                                                                                                      http://inac.cea.fr/L_Sim/BigDFT/news.html,
                                                                                                              http://www.olcf.ornl.gov/wp-
                                                     5-25X           Released June 2009,                 content/training/electronic-structure-
                 DFT; Daubechies wavelets,
   BigDFT              part of Abinit
                                                 (1 CPU core to       current release 1.6                    2012/BigDFT-Formalism.pdf and
                                                  GPU kernel)         Multi-GPU support                       http://www.olcf.ornl.gov/wp-
                                                                                                         content/training/electronic-structure-
                                                                                                                2012/BigDFT-HPC-tues.pdf


                                                                     Under development,
                                                                                                    http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
   Casino                   TBD                       TBD            Spring 2013 release
                                                                                                                       html
                                                                      Multi-GPU support
                                                                                                GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                            http://www.olcf.ornl.gov/wp-
                DBCSR (spare matrix multiply                         Under development             GPU Perf benchmarked on GPU supported features
    CP2K                  library)
                                                     2-7X
                                                                      Multi-GPU support
                                                                                                    content/training/ascc_2012/friday/ACSS_2012_V
                                                                                                       and may be a kernel to kernel perf comparison
                                                                                                                   andeVondele_s.pdf
Quantum Chemistry Applications
 Application   Features Supported                   GPU Perf            Release Status                                       Notes

                  (ss|ss) type integrals within
               calculations using Hartree Fock ab
                                                                        Release in Summer 2012         http://www.ncbi.nlm.nih.gov/pubmed/215419
 GAMESS-UK         initio methods and density             8x
                                                                           Multi-GPU support                               63
                   functional theory. Supports
                      organics & inorganics.

                                                                          Under development
                 Joint PGI, NVIDIA & Gaussian                                                                        Announced Aug. 29, 2011
  Gaussian               Collaboration
                                                          TBD              Multi-GPU support             http://www.gaussian.com/g_press/nvidia_press.htm



                Electrostatic poisson equation,
                                                                                Released
                 orthonormalizing of vectors,                                                          https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
   GPAW         residual minimization method
                                                          8x               Multi-GPU support              Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
                          (rmm-diis)

                                                                          Under development
                                                                                                                     Schrodinger, Inc.
   Jaguar       Investigating GPU acceleration            TBD              Multi-GPU support
                                                                                                            http://www.schrodinger.com/kb/278


                                                           3x                                           NICS Electronic Structure Determination Workshop 2012:
                                                    with 32 GPUs vs.      Under development                         http://www.olcf.ornl.gov/wp-
    LSMS       Generalized Wang-Landau method
                                                      32 (16-core)         Multi-GPU support                     content/training/electronic-structure-
                                                                                                                2012/Eisenbach_OakRidge_February.pdf
                                                          CPUs


                                                                          Released, Version 7.8
  MOLCAS               CU_BLAS support                   1.1x          Single GPU. Additional GPU                        www.molcas.org
                                                                       support coming in Version 8
                                                                                                     GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                        GPU Perf benchmarked on GPU supported features
                 Density-fitted MP2 (DF-MP2),
                                                        1.7-2.3X          Under development                 and may be a kernel to kernel perf comparison
                                                                                                                     www.molpro.net
                density fitted local correlation
Quantum Chemistry Applications
                       Features
 Application                                     GPU Perf            Release Status                                     Notes
                      Supported
                pseudodiagonalization, full
                                                                       Under Development                             Academic port.
 MOPAC2009      diagonalization, and density        3.8-14X
                                                                           Single GPU                            http://openmopac.net
                     matrix assembling


                                                                                                            Development GPGPU benchmarks:
                Triples part of Reg-CCSD(T),                                                                      www.nwchem-sw.org
                                                                   Release targeting end of 2012
  NWChem           CCSD & EOMCCSD task           3-10X projected
                                                                           Multiple GPUs
                                                                                                            And http://www.olcf.ornl.gov/wp-
                         schedulers                                                                       content/training/electronic-structure-
                                                                                                            2012/Krishnamoorthy-ESCMA12.pdf


  Octopus             DFT and TDDFT                   TBD                    Released                   http://www.tddft.org/programs/octopus/



               Density functional theory (DFT)                                                        First principles materials code that computes
                                                                            Released
   PEtot        plane wave pseudopotential           6-10X
                                                                            Multi-GPU
                                                                                                        the behavior of the electron structures of
                         calculations                                                                                    materials


                                                                                                                   http://www.q-
  Q-CHEM                   RI-MP2                    8x-14x            Released, Version 4.0
                                                                                                     chem.com/doc_for_web/qchem_manual_4.0.pdf



                                                                                                   GPU Perf compared against Multi-core x86 CPU socket.
                                                                                                      GPU Perf benchmarked on GPU supported features
                                                                                                          and may be a kernel to kernel perf comparison
Quantum Chemistry Applications
                        Features
 Application                                      GPU Perf        Release Status                               Notes
                       Supported

                                                                                                                    NCSA
                                                                        Released                University of Illinois at Urbana-Champaign
  QMCPACK                Main features                3-4x
                                                                      Multiple GPUs           http://cms.mcc.uiuc.edu/qmcpack/index.php
                                                                                                       /GPU_version_of_QMCPACK



                                                                                                        Created by Irish Centre for
   Quantum        PWscf package: linear algebra
                   (matrix multiply), explicit       2.5-3.5x
                                                                        Released
                                                                       Version 5.0
                                                                                                           High-End Computing
                                                                                              http://www.quantum-espresso.org/index.php
Espresso/PWscf   computational kernels, 3D FFTs                       Multiple GPUs
                                                                                                 and http://www.quantum-espresso.org/


                                                                                                        Completely redesigned to
                                                                                                   exploit GPU parallelism. YouTube:
                                                  44-650X vs.            Released
                                                                                                http://youtu.be/EJODzk6RFxE?hd=1 and
  TeraChem        “Full GPU-based solution”       GAMESS CPU            Version 1.5
                                                                                                     http://www.olcf.ornl.gov/wp-
                                                    version       Multi-GPU/single node          content/training/electronic-structure-
                                                                                                         2012/Luehr-ESCMA.pdf


                                                       2x
                  Hybrid Hartree-Fock DFT
                                                     2 GPUs       Available on request                By Carnegie Mellon University
    VASP         functionals including exact
                                                  comparable to      Multiple GPUs                 http://arxiv.org/pdf/1111.0716.pdf
                         exchange
                                                  128 CPU cores


                                                                                          GPU Perf compared against Multi-core x86 CPU socket.
                                                                                             GPU Perf benchmarked on GPU supported features
                                                                                                 and may be a kernel to kernel perf comparison
BigDFT
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
CP2K
Kepler, it’s faster
                                   14


                                   12                                                                                               Running CP2K version 12413-trunk on CUDA
                                                                                                                                    5.0.36
Performance Relative to CPU Only




                                   10                                                                                               The blue node contains 2 E5-2687W CPUs
                                                                                                                                    (150W, 8 Cores per CPU).

                                    8                                                                                               The green nodes contain 2 E5-2687W CPUs
                                                                                                                                    and 1 or 2 NVIDIA K10, K20, or K20X GPUs
                                                                                                                                    (235W each).
                                    6


                                    4


                                    2


                                    0
                                          CPU Only   CPU + K10   CPU + K20   CPU + K20X   CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X




                                        Using GPUs delivers up to 12.6x the performance per node
Strong Scaling
                                        8

                                                 XK6 With GPUs
                                        7
Speedup Relative to 256 non-GPU Cores




                                                 XK6 Without GPUs
                                                                                                           Conducted on Cray XK6
                                                                                                           Using matrix-matrix multiplication
                                        6                                                                  NREP=6 and N=159,000 with 50% occupation

                                        5
                                                                                            3x
                                        4

                                                                            2.9x
                                        3


                                        2
                                                        2.3x
                                        1


                                        0
                                                  256                    512          768
                                                                    # of Cores used

                                        Speedups increase as more nodes are added, up to 3x at 768 nodes
Kepler, keeping the planet Green
                       350



                       300                                                                   Running CP2K version 12413-trunk on CUDA
                                                                                             5.0.36

                       250                                                                   The blue node contains 2 E5-2687W CPUs
                                                                                             (150W, 8 Cores per CPU).
Energy Expended (kJ)




                       200                                                                   The green nodes contain 2 E5-2687W CPUs
                                                                                             and 1 or 2 NVIDIA K20 GPUs (235W each).
                                                   Lower is better
                       150                                                                          Energy Expended
                                                                                                    = Power x Time
                       100



                        50



                         0
                                     CPU Only            CPU + K20         CPU + 2x K20


                             Using K20s will lower energy use by over 75% for the same simulation
GAUSSIAN
Gaussian
    Key quantum chemistry code
    ACS Fall 2011 press release
        Joint collaboration between Gaussian, NVDA and PGI for GPU
        acceleration: http://www.gaussian.com/g_press/nvidia_press.htm
        No such release exists for Intel MIC or AMD GPUs
     Mike Frisch quote:
                “Calculations using Gaussian are limited primarily by the available computing
                resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the
                development of hardware, compiler technology and application software among the
                three companies, the new application will bring the speed and cost-effectiveness of
                GPUs to the challenging problems and applications that Gaussian’s customers need to
                address.”



NVIDIA Confidential
GAMESS
GAMESS Partnership Overview
 Mark Gordon and Andrey Asadchev, key developers of GAMESS,
 in collaboration with NVIDIA. Mark Gordon is a recipient of a
 NVIDIA Professor Partnership Award.

 Quantum Chemistry one of major consumers of CPU cycles at
 national supercomputer centers

 NVIDIA developer resources fully allocated to GAMESS code
          “
        We like to push the envelope as much as we can in the direction of highly scalable efficient
        codes. GPU technology seems like a good way to achieve this goal. Also, since we are
        associated with a DOE Laboratory, energy efficiency is important, and this is another reason
        to explore quantum chemistry on GPUs.
                                                                                    ”               Prof. Mark Gordon
                                                  Distinguished Professor, Department of Chemistry, Iowa State University and
                                                          Director, Applied Mathematical Sciences Program, AMES Laboratory
                                                                                                                        84
GAMESS August 2011 GPU Performance
First GPU supported GAMESS release via "libqc", a library for fast quantum
chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software
2e- AO integrals and their assembly into a closed shell Fock matrix

          Performance for Two Small Molecules   2.0
          GAMESS Aug. 2011 Release Relative

                                                                              4x E5640 CPUs
                                                                              4x E5640 CPUs + 4x Tesla C2070s




                                                1.0




                                                0.0
                                                      Ginkgolide (53 atoms)     Vancomycin (176 atoms)
Upcoming GAMESS Q4 2012 Release
 Multi-nodes with multi-GPUs supported
 Rys Quadrature
 Hartree-Fock
    8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup.
    See 2012 publication
 Møller–Plesset perturbation theory (MP2):
 Preliminary code completed
    Paper in development
 Coupled Cluster SD(T): CCSD code completed,
 (T) in progress
                                                                86
GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F
                                        Hartree-Fock GPU Speedups*
   3.5



   3.0                                                                                               2.9
                                                                                                                   Adding 1x 2070 GPU
                            2.5            2.5
                                                                                                                   speeds up computations
   2.5                                                                 2.4
             2.3                                        2.3                           2.3                          by 2.3x to 2.9x

   2.0


                                                                                                                 Speedup
   1.5



   1.0



   0.5


                                                                                                                   * A. Asadchev, M.S. Gordon, “New
   0.0                                                                                                             Multithreaded Hybrid CPU/GPU Approach to
         Taxol 6-31G   Taxol 6-31G(d)    Taxol 6-     Taxol 6-    Valinomycin 6- Valinomycin 6- Valinomycin 6-     Hartree-Fock,” Journal of Chemical Theory and
                                        31G(2d,2p)   31++G(d,p)        31G           31G(d)      31G(2d,2p)        Computation (2012)

                                                              NVIDIA CONFIDENTIAL
                                                                                                                                                          87
GPAW
Used with
permission
from Samuli
Hakala
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
92
93
94
95
96
97
NWChem
NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes

                                                               System: cluster consisting
                                                               of dual-socket nodes
                                                               constructed from:

                                                               • 8-core AMD Interlagos
                                                                 processors
                                                               • 64 GB of memory
                                                               • Tesla M2090 (Fermi)
                                                                 GPUs

                                                               The nodes are connected
                                                               using a high-performance
                                                               QDR Infiniband
                                                               interconnect

                                                               Courtesy of
                                                               Kowolski, K., Bhaskaran-
                                                               Nair, at al @ PNNL, JCTC
                                                               (submitted)
Quantum Espresso/PWscf
Kepler, fast science
                                                                       AUsurf
                                   14
                                                                                                                     Running Quantum Espresso version 5.0-build7
                                                                                                                     on CUDA 5.0.36
                                   12
Performance Relative to CPU Only




                                                                                                                     The blue node contains 2 E5-2687W CPUs
                                                                                                                     (150W, 8 Cores per CPU).
                                   10
                                                                                                                     The green nodes contain 2 E5-2687W CPUs
                                    8                                                                                and 1 or 2 NVIDIA M2090 or K10 GPUs (225W
                                                                                                                     and 235W respectively).

                                    6


                                    4


                                    2


                                    0
                                            CPU Only     CPU + M2090     CPU + K10   CPU + 2x M2090   CPU + 2x K10




                                        Using K10s delivers up to 11.7x the performance per node over CPUs
                                        And 1.7x the performance when compared to M2090s
Extreme Performance/Price from 1 GPU
                      4
                                                                                                      Simulations run on FERMI @ ICHEC.
                     3.5
                                                                                                      A 6-Core 2.66 GHz Intel X5650 was
                      3                                                                               used for the CPU
Scaled to CPU Only




                     2.5                                                                              An NVIDIA C2050 was used for the
                                                                                                      GPU
                      2

                     1.5
                                    CPU+
                      1             GPU
                                    CPU
                     0.5            Only

                      0
                           Price:          Performance: (Shilu-3)   Performance: (Water-on-Calcite)




                                                                                                             Calcite structure


                Adding a GPU can improve performance by 3.7x while only increasing price by 25%
Extreme Performance/Price from 1 GPU
 4
      Price and Performance scaled to the CPU only system                              Simulations run on FERMI @ ICHEC.
3.5
                                                                                       A 6-Core 2.66 GHz Intel X5650 was
 3                                                                                     used for the CPU

2.5                                                                                    An NVIDIA C2050 was used for the
                                                                                       GPU
 2

1.5
                       CPU+
 1                     GPU
                       CPU
0.5                    Only

 0
              Price:          Performance: (AUSURF112, k-        Performance:
                                         point)             (AUSURF112, gamma-point)




 Calculation done for a gold surface of 112 atoms

Adding a GPU can improve performance by 3.5x while only increasing price by 25%
Replace 72 CPUs with 8 GPUs
                                                                                 Simulations run on PLX @ CINECA.
                         250
                                                                                 Intel 6-Core 2.66 GHz X5550 were
                                 LSMO-BFO (120 Atoms) 8 K-points                 used for the CPUs
                                      223                      219               NVIDIA M2070s were used for the
                         200                                                     GPUs
Elapsed Time (minutes)




                         150




                         100




                          50




                           0
                               120 CPUs ($42,000)   48 CPUs + 8 GPUs ($32,800)


        The GPU Accelerated setup performs faster and costs 24% less
QE/PWscf - Green Science
                                    LSMO-BFO (120 Atoms) 8 K-points                              Simulations run on PLX @ CINECA.
                            12000
                                                                                                 Intel 6-Core 2.66 GHz X5550 were
                                                                                                 used for the CPUs
                            10000
                                                                                                 NVIDIA M2070s were used for the
Power Consumption (Watts)




                                                               Lower is better                   GPUs
                             8000


                             6000


                             4000


                             2000


                                0
                                          120 CPUs ($42,000)        48 CPUs + 8 GPUs ($32,800)




Over a year, the lower power consumption would save $4300 on energy bills
NVIDIA GPUs Use Less Energy

                                 Energy Consumption on Different Tests                      Simulations run on FERMI @ ICHEC.
                           0.6
                                                                                            A 6-Core 2.66 GHz Intel X5650 was
                                                                                            used for the CPU
                                                                              CPU Only
                           0.5                                                              An NVIDIA C2050 was used for the
                                                                              CPU+GPU       GPU
Power Consumption [kW/h]




                           0.4


                                                                         Lower is better
                           0.3

                                                            -58%
                           0.2



                           0.1

                                                                                 -54%
                                           -57%
                            0
                                        Shilu-3         AUSURF112        Water-on-Calcite


In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only
QE/PWscf - Great Strong Scaling in Parallel
                   CdSe-159 Walltime of 1 full SCF                                                      Simulations run on STONEY @ ICHEC.
           35000
                                                                                                        Two quad core 2.87 GHz Intel X5560s
           30000                                                                                        were used in each node

                                                 Lower is better                     CPU                Two NVIDIA M2090s were used in
           25000                                                                                        each node for the CPU+GPU test
                                                                                     CPU+GPU
                        2.5x
           20000
Time (s)




           15000


           10000

                                          2.2x
            5000                                                        2.1x
                                                                                                2.2x
               0
                     2 (16)    4 (32)   6 (48)        8 (64)       10 (80)     12 (96)     14 (112)
                                             Nodes (Total CPU Cores)                                   159 Cadmium Selenide nanodots

                   Speedups up to 2.5x with GPU Accelerations
QE/PWscf - More Powerful Strong Scaling
                  GeSnTe134 Walltime of full SCF
           4500                                                                                            Simulations run on PLX @ CINECA.
                                                                                          CPU
           4000                                                                                            Two 6-Core 2.4 GHz Intel E5645s were
                                                                                          CPU+GPU          used in each node
           3500
                                                                                                           Two NVIDIA M2070s were used in
                       1.6x                                                 Lower is better                each node for the CPU+GPU test
           3000
Time (s)




           2500

           2000
                                            2.3x
           1500
                                                                      2.4x                          2.1x
           1000

            500

              0
                    4(48)     8(96)   12(144)       16(192)       24(288)       32(384)         44(528)
                                            Nodes (Total CPU Cores)

            Accelerate your cluster by up to 2.1x with NVIDIA GPUs

            Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive
TeraChem
TeraChem
   Supercomputer Speeds on GPUs
                                             Time for SCF Step
                 100
                                                                                     TeraChem running on 8 C2050s on 1 node
                  90
                                                                                     NWChem running on 4096 Quad Core CPUs
                  80                                                                 In the Chinook Supercomputer

                  70                                                                 Giant Fullerene C240 Molecule
Time (Seconds)




                  60

                  50

                  40

                  30

                  20

                  10

                   0
                           4096 Quad Core CPUs ($19,000,000)     8 C2050 ($31,000)




                       Similar performance from just a handful of GPUs
TeraChem
Bang for the Buck
                                                                    Performance/Price                       TeraChem running on 8 C2050s on 1 node

                                              600                                                           NWChem running on 4096 Quad Core CPUs
                                                                                                            In the Chinook Supercomputer
Price/Performance relative to Supercomputer




                                                                                              493
                                              500                                                           Giant Fullerene C240 Molecule

                                                                                                            Note: Typical CPU and GPU node pricing
                                              400                                                           used. Pricing may vary depending on node
                                                                                                            configuration. Contact your preferred HW
                                                                                                            vendor for actual pricing.
                                              300



                                              200



                                              100


                                                                   1
                                                0
                                                    4096 Quad Core CPUs ($19,000,000)   8 C2050 ($31,000)




                           Dollars spent on GPUs do 500x more science than those spent on CPUs
Kepler’s Even Better
          Olestra BLYP 453 Atoms                         B3LYP/6-31G(d)
          800                                     2000                           TeraChem running on C2050 and K20C

          700                                     1800                           First graph is of BLYP/G-31(d)
                                                                                 Second is B3LYP/6-31G(d)
                                                  1600
          600
                                                  1400
          500
                                                  1200
Seconds




                                        Seconds
          400                                     1000

          300                                      800

                                                   600
          200
                                                   400
          100
                                                   200

            0                                        0
                  C2050        K20C                        C2050          K20C




           Kepler performs 2x faster than Tesla
Viz, ―Docking‖ and Related Applications Growing
   Related                Features
                                                               GPU Perf             Release Status                                          Notes
 Applications            Supported

                                                                                                                    Visualization from Visage Imaging. Next release, 5.4, will use
                  3D visualization of volumetric                                     Released, Version 5.3.3
   Amira 5®             data and surfaces
                                                                     70x
                                                                                           Single GPU
                                                                                                                        GPU for general purpose processing in some functions
                                                                                                                           http://www.visageimaging.com/overview.html



                                                                                                                         High-Throughput parallel blind Virtual Screening,
                  Allows fast processing of large                                   Available upon request to
   BINDSURF              ligand databases
                                                                     100X
                                                                                       authors; single GPU
                                                                                                                      http://www.biomedcentral.com/1471-2105/13/S14/S13




                           Empirical Free                                                   Released                                   University of Bristol
     BUDE                 Energy Forcefield
                                                                  6.5-13.4X
                                                                                           Single GPU                http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm


                                                                                      Released, Suite 2011                             Schrodinger, Inc.
  Core Hopping     GPU accelerated application                    3.75-5000X
                                                                                     Single and multi-GPUs.               http://www.schrodinger.com/products/14/32/



                     Real-time shape similarity                                             Released                               Open Eyes Scientific Software
   FastROCS            searching/comparison
                                                                  800-3000X
                                                                                     Single and multi-GPUs.                     http://www.eyesopen.com/fastrocs


                          Lines: 460% increase
                        Cartoons: 1246% increase
                                                                                      Released, Version 1.5
     PyMol              Surface: 1746% increase                     1700x
                                                                                           Single GPUs
                                                                                                                                         http://pymol.org/
                         Spheres: 753% increase
                         Ribbon: 426% increase


                          High quality rendering,                                                     GPU Perf compared against Multi-core x86 CPU socket.
                   large structures (100 million atoms),
                                                              100-125X or greater                           GPU Perf benchmarked on GPU supported features
                                                                                                                 Visualization from University of Illinois at Urbana-Champaign
      VMD        analysis and visualization tasks, multiple
                                                                  on kernels
                                                                                      Released, Version 1.9
                                                                                                               and mayhttp://www.ks.uiuc.edu/Research/vmd/
                                                                                                                            be a kernel to kernel perf comparison
                   GPU support for display of molecular
FastROCS
  OpenEye Japan
Hideyuki Sato, Ph.D.




 © 2012 OpenEye Scientific Software
ROCS on the GPU: FastROCS
  Shape Overlays per


                       400000

                       300000
       Second




                       200000

                       100000

                           0

                                CPU   GPU
Riding Moore’s Law

                            2000000
                            1800000
Shape Overlays per Second




                            1600000
                            1400000
                            1200000
                            1000000
                             800000
                             600000
                             400000
                             200000
                                  0
                                      C1060   C2050   C2075   C2090   K10   K20
FastROCS scaling across 4x K10s (2 physical GPUs per K10)
                                      53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule)

                        9000000
                        8000000
Conformers per Second



                        7000000
                        6000000
                        5000000
                        4000000
                        3000000
                        2000000
                        1000000
                              0
                                  1           2              3               4              5               6          7   8
                                                                  Number of individual K10 GPUs
                                                              (Note, each K10 has 2 physical GPUs on the board)
Benefits of GPU Accelerated Computing
     Faster than CPU only systems in all tests

     Large performance boost with marginal price increase

     Energy usage cut by more than half

     GPUs scale well within a node and over multiple nodes

     K20 GPU is our fastest and lowest power high performance GPU yet

       Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive
11
8
GPU Test Drive
     Experience GPU Acceleration
     For Computational Chemistry
     Researchers, Biophysicists

     Preconfigured with Molecular
     Dynamics Apps

     Remotely Hosted GPU Servers

     Free & Easy – Sign up, Log in and
     See Results

     www.nvidia.com/gputestdrive
11
9

More Related Content

PPTX
CARBON NANO TUBE -- PREPARATION – METHODS
PPTX
DLS and Fluroscence spectroscopy
PPTX
LC MS
PPTX
PPTX
Alphafold2 - Protein Structural Bioinformatics After CASP14
PPTX
GROMACS Molecular Dynamics on GPU
PPTX
Lecture6
PPTX
Advanced Molecular Dynamics 2016
CARBON NANO TUBE -- PREPARATION – METHODS
DLS and Fluroscence spectroscopy
LC MS
Alphafold2 - Protein Structural Bioinformatics After CASP14
GROMACS Molecular Dynamics on GPU
Lecture6
Advanced Molecular Dynamics 2016

Similar to GPU Accelerated Computational Chemistry Applications (20)

PDF
Nvidia Cuda Apps Jun27 11
PDF
PG-Strom - GPU Accelerated Asyncr
PPTX
Gpu archi
PDF
Hp All In 1
PDF
AMD technologies for HPC
PPTX
GPU Computing In Higher Education And Research
PPTX
Top500 List June 2012
PDF
Pgopencl
PDF
PostgreSQL with OpenCL
PDF
PG-Strom
PDF
High performance computing - building blocks, production & perspective
PPTX
ISBI MPI Tutorial
PDF
Cuda tutorial
PPTX
HP - HPC-29mai2012
PDF
Accelerating Scientific Discovery V1
PPTX
Introducing the ADSP BF609 Blackfin Processors
PDF
N A G P A R I S280101
PDF
GPU for DL / Inside the workhorse of deep learning
PDF
MSI N480GTX Lightning Infokit
 
PPTX
Qnap nas ts 1679 introduction-02
Nvidia Cuda Apps Jun27 11
PG-Strom - GPU Accelerated Asyncr
Gpu archi
Hp All In 1
AMD technologies for HPC
GPU Computing In Higher Education And Research
Top500 List June 2012
Pgopencl
PostgreSQL with OpenCL
PG-Strom
High performance computing - building blocks, production & perspective
ISBI MPI Tutorial
Cuda tutorial
HP - HPC-29mai2012
Accelerating Scientific Discovery V1
Introducing the ADSP BF609 Blackfin Processors
N A G P A R I S280101
GPU for DL / Inside the workhorse of deep learning
MSI N480GTX Lightning Infokit
 
Qnap nas ts 1679 introduction-02
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
Ad

GPU Accelerated Computational Chemistry Applications

  • 1. update Updated: February 4, 2013
  • 2. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported > 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released AMBER Implicit Solvent JAC NVE on 2X Multi-GPU, multi-node http://ambermd.org/gpus/benchmarks. K20s htm#Benchmarks 2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released CHARMM Solvent via OpenMM 32-35x X5667 Single & multi-GPU in single node http://www.charmm.org/news/c37b1.html#po CPUs stjump Two-body Forces, Link-cell Source only, Results Published Release V 4.03 DL_POLY Pairs, Ewald SPME forces, 4x Multi-GPU, multi-node http://www.stfc.ac.uk/CSE/randd/ccg/softwa Shake VV re/DL_POLY/25526.aspx 165 ns/Day Released GROMACS Implicit (5x), Explicit (2x) DHFR on Multi-GPU, multi-node Release 4.6; 1st Multi-GPU support 4X C2075s http://lammps.sandia.gov/bench.html#deskto Lennard-Jones, Gay-Berne, Released. LAMMPS Tersoff & many more potentials 3.5-18x on Titan Multi-GPU, multi-node p and http://lammps.sandia.gov/bench.html#titan 4.0 ns/days Released Full electrostatics with PME and NAMD most simulation features F1-ATPase on 100M atom capable NAMD 2.9 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 3. New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported 4-29X Released, Version 1.8.51 Abalone Simulations (on 1060 GPU) (on 1060 GPU) Single GPU Agile Molecule, Inc. Computation of non-valent 4-29X Released, Version 1.1.4 Ascalaph interactions (on 1060 GPU) Single GPU Agile Molecule, Inc. 150 ns/day DHFR on Released Production bio-molecular dynamics (MD) ACEMD Written for use only on GPUs 1x K20 Single and multi-GPUs software specially optimized to run on GPUs Powerful distributed computing Depends upon Released; http://folding.stanford.edu Folding@Home molecular dynamics system; number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding High-performance all-atom Depends upon Released; http://www.gpugrid.net/ GPUGrid.net biomolecular simulations; number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations) Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/ HOOMD-Blue Written for use only on GPUs than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013 Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics OpenMM custom forces ns/day Explicit: 18- Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 4. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org Abinit diagonalization / 1.3-2.7X Multi-GPU support orthogonalization Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development ACES III SIAL programming language and 10X on kernels Multi-GPU support content/training/electronic-structure- SIP runtime environment 2012/deumens_ESaccel_2012.pdf Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X Released June 2009, content/training/electronic-structure- DFT; Daubechies wavelets, BigDFT part of Abinit (1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp- content/training/electronic-structure- 2012/BigDFT-HPC-tues.pdf Under development, http://www.tcm.phy.cam.ac.uk/~mdt26/casino. Casino TBD TBD Spring 2013 release html Multi-GPU support http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development CP2K library) 2-7X Multi-GPU support content/training/ascc_2012/friday/ACSS_2012_V andeVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Next release Q4 2012. GAMESS-US Algorithm, Hartree-Fock, MP2 2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html and CCSD in Q4 2012 GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 5. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes (ss|ss) type integrals within calculations using Hartree Fock ab Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419 GAMESS-UK initio methods and density 8x Multi-GPU support 63 functional theory. Supports organics & inorganics. Under development Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011 Gaussian Collaboration TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Electrostatic poisson equation, Released orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html, GPAW residual minimization method 8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC) (rmm-diis) Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org support coming in Version 8 Density-fitted MP2 (DF-MP2), 1.7-2.3X Under development www.molpro.net MOLPRO density fitted local correlation projected Multiple GPU Hans-Joachim Werner methods (DF-RHF, DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 6. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported pseudodiagonalization, full Under Development Academic port. MOPAC2009 diagonalization, and density 3.8-14X Single GPU http://openmopac.net matrix assembling Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Release targeting March 2013 NWChem CCSD & EOMCCSD task 3-10X projected Multiple GPUs And http://www.olcf.ornl.gov/wp- schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/ Density functional theory (DFT) First principles materials code that computes Released PEtot plane wave pseudopotential 6-10X Multi-GPU the behavior of the electron structures of calculations materials http://www.q- Q-CHEM RI-MP2 8x-14x Released, Version 4.0 chem.com/doc_for_web/qchem_manual_4.0.pdf GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 7. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported NCSA Released University of Illinois at Urbana-Champaign QMCPACK Main features 3-4x Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK Created by Irish Centre for Quantum PWscf package: linear algebra (matrix multiply), explicit 2.5-3.5x Released Version 5.0 High-End Computing http://www.quantum-espresso.org/index.php Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs and http://www.quantum-espresso.org/ Completely redesigned to exploit GPU parallelism. YouTube: 44-650X vs. Released http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5 http://www.olcf.ornl.gov/wp- version Multi-GPU/single node content/training/electronic-structure- 2012/Luehr-ESCMA.pdf 2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University VASP functionals including exact comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores Generalized Wang-Landau 3x Under development GPU Perf Electronic Structure Determination Workshop 2012: NICS compared against Multi-core x86 CPU socket. http://www.olcf.ornl.gov/wp- WL-LSMS method with 32 GPUs vs. Multi-GPU support GPU Perf benchmarked on GPU supported features content/training/electronic-structure- 32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison may be a kernel to kernel perf
  • 8. Viz, ―Docking‖ and Related Applications Growing Related Features GPU Perf Release Status Notes Applications Supported Visualization from Visage Imaging. Next release, 5.4, will use 3D visualization of volumetric Released, Version 5.3.3 Amira 5® data and surfaces 70x Single GPU GPU for general purpose processing in some functions http://www.visageimaging.com/overview.html High-Throughput parallel blind Virtual Screening, Allows fast processing of large Available upon request to BINDSURF ligand databases 100X authors; single GPU http://www.biomedcentral.com/1471-2105/13/S14/S13 Empirical Free Released University of Bristol BUDE Energy Forcefield 6.5-13.4X Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm Released, Suite 2011 Schrodinger, Inc. Core Hopping GPU accelerated application 3.75-5000X Single and multi-GPUs. http://www.schrodinger.com/products/14/32/ Real-time shape similarity Released Open Eyes Scientific Software FastROCS searching/comparison 800-3000X Single and multi-GPUs. http://www.eyesopen.com/fastrocs Lines: 460% increase Cartoons: 1246% increase Released, Version 1.5 PyMol Surface: 1746% increase 1700x Single GPUs http://pymol.org/ Spheres: 753% increase Ribbon: 426% increase High quality rendering, GPU Perf compared against Multi-core x86 CPU socket. large structures (100 million atoms), 100-125X or greater GPU Perf benchmarked on GPU supported features Visualization from University of Illinois at Urbana-Champaign VMD analysis and visualization tasks, multiple on kernels Released, Version 1.9 and mayhttp://www.ks.uiuc.edu/Research/vmd/ be a kernel to kernel perf comparison GPU support for display of molecular
  • 9. Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Version 0.6.2 – 3/2012 BarraCUDA reads 6-10x Multi-GPU, multi-node http://seqbarracuda.sourceforge.net/ Parallel search of Smith- Version 2.0.8 – Q1/2012 CUDASW++ Waterman database 10-50x Multi-GPU, multi-node http://sourceforge.net/projects/cudasw/ Parallel, accurate long read Version 1.0.40 – 6/2012 CUSHAW aligner for large genomes 10x Multiple-GPU http://cushaw.sourceforge.net/ Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu GPU-BLAST BLASTP 3-4x Single GPU blast.html Parallel local and global Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH GPU-HMMER search of Hidden Markov 60-100x Multi-GPU, multi-node MMER.htm Models Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa mCUDA-MEME algorithm based on MEME 4-10x Multi-GPU, multi-node re/mcuda-meme Hardware and software for Released. SeqNFind reference assembly, blast, SW, 400x Multi-GPU, multi-node http://www.seqnfind.com/ HMM, de novo assembly Version 1.11 – 5/2012 UGENE Fast short read alignment 6-8x Multi-GPU, multi-node http://ugene.unipro.ru/ GPU Perf compared against same or similar code running on single CPU machine Parallel linear regression on Performance measured internally or independently
  • 11. MD Average Speedups The blue node contains Dual E5-2687W CPUs 10 (8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K10, K20, or Performance Relative to CPU Only 8 K20X GPUs. 6 4 2 0 CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases. Error bars show the maximum and minimum speedup for each hardware configuration.
  • 12. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported > 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released AMBER Implicit Solvent JAC NVE on 2X Multi-GPU, multi-node http://ambermd.org/gpus/benchmarks. K20s htm#Benchmarks 2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released CHARMM Solvent via OpenMM 32-35x X5667 Single & multi-GPU in single node http://www.charmm.org/news/c37b1.html#po CPUs stjump Two-body Forces, Link-cell Source only, Results Published Release V 4.03 DL_POLY Pairs, Ewald SPME forces, 4x Multi-GPU, multi-node http://www.stfc.ac.uk/CSE/randd/ccg/softwa Shake VV re/DL_POLY/25526.aspx 165 ns/Day Released GROMACS Implicit (5x), Explicit (2x) DHFR on Multi-GPU, multi-node Release 4.6; 1st Multi-GPU support 4X C2075s http://lammps.sandia.gov/bench.html#deskto Lennard-Jones, Gay-Berne, Released. LAMMPS Tersoff & many more potentials 3.5-18x on Titan Multi-GPU, multi-node p and http://lammps.sandia.gov/bench.html#titan 4.0 ns/days Released Full electrostatics with PME and NAMD most simulation features F1-ATPase on 100M atom capable NAMD 2.9 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 13. New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported 4-29X Released, Version 1.8.51 Abalone Simulations (on 1060 GPU) (on 1060 GPU) Single GPU Agile Molecule, Inc. Computation of non-valent 4-29X Released, Version 1.1.4 Ascalaph interactions (on 1060 GPU) Single GPU Agile Molecule, Inc. 150 ns/day DHFR on Released Production bio-molecular dynamics (MD) ACEMD Written for use only on GPUs 1x K20 Single and multi-GPUs software specially optimized to run on GPUs Powerful distributed computing Depends upon Released; http://folding.stanford.edu Folding@Home molecular dynamics system; number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding High-performance all-atom Depends upon Released; http://www.gpugrid.net/ GPUGrid.net biomolecular simulations; number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations) Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/ HOOMD-Blue Written for use only on GPUs than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013 Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics OpenMM custom forces ns/day Explicit: 18- Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 14. Built from Ground Up for GPUs Computational Chemistry Study disease & discover drugs What Predict drug and protein interactions GPU READY Speed of simulations is critical APPLICATIONS Why Enables study of: Abalone ACEMD Longer timeframes AMBER Larger systems DL_PLOY More simulations GAMESS How GPUs increase throughput & accelerate simulations GROMACS LAMMPS NAMD AMBER 11 Application NWChem 4.6x performance increase with 2 GPUs with Q-CHEM only a 54% added cost* Quantum Espresso TeraChem • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
  • 15. AMBER 12 GPU Support Revision 12.2 1/22/2013 15
  • 16. Kepler - Our Fastest Family of GPUs Yet 30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU). 7.4x The green nodes contain Dual E5-2687W CPUs (8 20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10 Nanoseconds / Day or 1x K20 for the GPU 6.6x 15.00 11.85 5.6x 10.00 3.5x 5.00 3.42 0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 16
  • 17. K10 Accelerates Simulations of All Sizes 30 Running AMBER 12 GPU Support Revision 12.1 The blue node contains Dual E5-2687W CPUs 25 24.00 (8 Cores per CPU). Speedup Compared to CPU Only The green nodes contain Dual E5-2687W CPUs (8 19.98 20 Cores per CPU) and 1x NVIDIA K10 GPU 15 10 5.50 5.53 5.04 5 2.00 0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 24x performance by adding just 1 GPU Nucleosome when compared to dual CPU performance
  • 18. K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00 The blue node contains 2x Intel E5-2687W CPUs Speedup Compared to CPU Only (8 Cores per CPU) 20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 15.00 10.00 7.28 6.50 6.56 5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance 18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 19. K20X Accelerates Simulations of All Sizes 35 31.30 Running AMBER 12 GPU Support Revision 12.1 30 28.59 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). Speedup Compared to CPU Only 25 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K20X GPU 20 15 10 8.30 7.15 7.43 5 2.79 0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 31x performance by adding just one K20X GPU Nucleosome when compared to dual CPU performance 19 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 20. K10 Strong Scaling over Nodes Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off 6 The blue nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) 5 The green nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) plus 2x NVIDIA K10 GPUs 4 Nanoseconds / Day 2.4x 3 CPU Only 3.6x With GPU 2 5.1x 1 Cellulose 0 1 2 4 Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes
  • 21. Kepler – Universally Faster 9 Running AMBER 12 GPU Support Revision 12.1 8 The CPU Only node contains Dual E5-2687W CPUs (8 Cores per CPU). Speedups Compared to CPU Only 7 The Kepler nodes contain Dual E5-2687W CPUs (8 6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs 5 JAC 4 Factor IX Cellulose 3 2 1 0 CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose The Kepler GPUs accelerated all simulations, up to 8x
  • 22. K10 Extreme Performance Running AMBER 12 GPU Support Revision 12.1 JAC 23K Atoms (NVE) 120 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 97.99 The green node contain Dual E5-2687W CPUs (8 100 Cores per CPU) and 2x NVIDIA K10 GPUs Nanoseconds / Day 80 60 40 20 12.47 0 1 Node 1 Node DHFR Gain 7.8X performance by adding just 2 GPUs when compared to dual CPU performance
  • 23. K20 Extreme Performance DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 120 The blue node contains 2x Intel E5-2687W CPUs 95.59 (8 Cores per CPU) 100 Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU Nanoseconds / Day 80 60 40 20 12.47 0 1 Node 1 Node DHFR Gain > 7.5X throughput/performance by adding just 2 K20 GPUs when compared to dual CPU performance 23 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 24. Replace 8 Nodes with 1 K20 GPU 90.00 35000 $32,000.00 Running AMBER 12 GPU Support Revision 12.1 81.09 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x Intel 70.00 E5-2687W CPUs (8 Cores per CPU) 65.00 25000 Each green node contains 2x Intel E5-2687W 60.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU 50.00 20000 Note: Typical CPU and GPU node pricing used. 40.00 Pricing may vary depending on node 15000 configuration. Contact your preferred HW vendor for actual pricing. 30.00 10000 20.00 $6,500.00 5000 10.00 0.00 0 Nanoseconds/Day Cost DHFR Cut down simulation costs to ¼ and gain higher performance 24 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 25. Replace 7 Nodes with 1 K10 GPU Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 80 $35,000.00 $32,000 The eight (8) blue nodes each contain 2x Intel 70 $30,000.00 E5-2687W CPUs (8 Cores per CPU) 60 The green node contains 2x Intel E5-2687W $25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10 Nanoseconds / Day GPU 50 $20,000.00 Note: Typical CPU and GPU node pricing used. 40 Pricing may vary depending on node $15,000.00 configuration. Contact your preferred HW vendor 30 for actual pricing. $10,000.00 20 $7,000 10 $5,000.00 0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled DHFR Cut down simulation costs to ¼ and increase performance by 70% 25 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 26. Extra CPUs decrease Performance Cellulose NVE Running AMBER 12 GPU Support Revision 12.1 8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8 6 Cores per CPU) Nanoseconds / Day 2 CPUs 2 GPUs 1 CPU 2 GPUs 5 4 1 E5-2687W 2 E5-2687W 3 2 1 0 Cellulose CPU Only CPU with dual K20s When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
  • 27. Kepler - Greener Science Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC 2500 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X Lower is better GPUs (235W each). Energy Expended (kJ) 1500 Energy Expended 1000 = Power x Time 500 0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy
  • 28. Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 4+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 1-2 # of GPUs per CPU socket (4 GPUs on 1 socket is good to do 4 fast serial GPU runs) GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 16x or higher Server storage 2 TB 28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 29. Benefits of GPU AMBER Accelerated Computing Faster than CPU only systems in all tests Most major compute intensive aspects of classical MD ported Large performance boost with marginal price increase Energy usage cut by more than half GPUs scale well within a node and over multiple nodes K20 GPU is our fastest and lowest power high performance GPU yet Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive 29 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 31. Kepler - Our Fastest Family of GPUs Yet 4.50 ApoA1 Running NAMD version 2.9 4.00 4.00 The blue node contains Dual E5-2687W CPUs 3.57 (8 Cores per CPU). 3.45 3.50 The green nodes contain Dual E5-2687W CPUs (8 2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10 3.00 or 1x K20 for the GPU Nanoseconds/Day 2.63 2.6x 2.50 2.5x 2.00 1.50 1.37 1.9x 1.00 0.50 0.00 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X Apolipoprotein A1 M2090 GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X) when compared to a CPU only node 31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 32. Accelerates Simulations of All Sizes 3 Running NAMD 2.9 with CUDA 4.0 ECC Off 2.7 2.6 The blue node contains 2x Intel E5-2687W CPUs 2.5 2.4 (8 Cores per CPU) Speedup Compared to CPU Only Each green node contains 2x Intel E5-2687W 2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 1.5 1 0.5 0 CPU All Molecules ApoA1 F1-ATPase STMV Apolipoprotein A1 Gain 2.5x throughput/performance by adding just 1 GPU when compared to dual CPU performance 32 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 33. Kepler – Universally Faster 6 Running NAMD version 2.9 The CPU Only node contains Dual E5-2687W CPUs 5 (8 Cores per CPU). Speedup Compared to CPU Only 5.1x The Kepler nodes contain Dual E5-2687W CPUs (8 4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or K20X GPUs. 4.3x F1-ATPase 3 ApoA1 STMV 2.9x 2 2.6x 2.4x 1 0 CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X F1-ATPase | Kepler nodes use Dual CPUs | The Kepler GPUs accelerate all simulations, up to 5x Average acceleration printed in bars
  • 34. Outstanding Strong Scaling with Multi-STMV Running NAMD version 2.9 Each blue XE6 CPU node contains 1x AMD 100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU). 1.2 Fermi XK6 Each green XK6 CPU+GPU node contains 1x AMD 1600 Opteron (16 Cores per CPU) 1 and an additional 1x NVIDIA X2090 GPU. CPU XK6 2.7x Nanoseconds / Day 0.8 2.9x 0.6 0.4 0.2 3.6x 3.8x Concatenation of 100 0 Satellite Tobacco Mosaic Virus 32 64 128 256 512 640 768 # of Nodes Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
  • 35. Replace 3 Nodes with 1 2090 GPU Running NAMD version 2.9 Each blue node contains 2x Intel Xeon X5550 CPUs F1-ATPase (4 Cores per CPU). 4 CPU Nodes 0.8 9000 0.74 The green node contains 2x Intel Xeon X5550 CPUs $8,000 1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU 0.7 1x M2090 GPUs 0.63 7000 Note: Typical CPU and GPU node pricing used. Pricing 0.6 may vary depending on node configuration. Contact your 6000 preferred HW vendor for actual pricing. 0.5 5000 0.4 $4,000 4000 0.3 3000 0.2 2000 0.1 1000 0 0 Nanoseconds/Day Cost Speedup of 1.2x for 50% the cost F1-ATPase
  • 36. K20 - Greener: Twice The Science Per Watt 1200000 Energy Used in Simulating 1 Nanosecond of ApoA1 Running NAMD version 2.9 1000000 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU). Each green node contains 2x Intel Xeon X5550 Energy Expended (kJ) 800000 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA Lower is better K20 GPUs (225W per GPU) 600000 Energy Expended 400000 = Power x Time 200000 0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs 36 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 37. Kepler - Greener: Twice The Science/Joule Energy used in simulating 1 ns of SMTV 250000 Running NAMD version 2.9 The blue node contains Dual E5-2687W CPUs 200000 (150W each, 8 Cores per CPU). Energy Expended (kJ) Lower is better The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K10, K20, or 150000 K20X GPUs (235W each). Energy Expended 100000 = Power x Time 50000 0 CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs Cut down energy usage by ½ with GPUs Satellite Tobacco Mosaic Virus
  • 38. Recommended GPU Node Configuration for NAMD Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand 38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 39. Summary/Conclusions Benefits of GPU Accelerated Computing Faster than CPU only systems in all tests Large performance boost with small marginal price increase Energy usage cut in half GPUs scale very well within a node and over multiple nodes Tesla K20 GPU is our fastest and lowest power high performance GPU to date Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive 39 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
  • 40. LAMMPS, Jan. 2013 or later
  • 41. More Science for Your Money Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores 6 and 150W per CPU). 5.5 Green nodes have 2x E5-2687W and 1 5 or 2 NVIDIA K10, K20, or K20X GPUs (235W). Speedup Compared to CPU Only 4.5 4 3.3 2.92 3 2.47 2 1.7 1 0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes.
  • 42. K20X, the Fastest GPU Yet 7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). 6 Green nodes have 2x E5-2687W and 2 NVIDIA M2090s or K20X GPUs (235W). Speedup Relative to CPU Alone 5 4 3 2 1 0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s
  • 43. Get a CPU Rebate to Fund Part of Your GPU Budget Acceleration in Loop Time Computation by Additional GPUs Running NAMD version 2.9 20 18.2 The blue node contains Dual X5670 CPUs 18 (6 Cores per CPU). 16 The green nodes contain Dual X5570 CPUs Normalized to CPU Only 14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090 GPUs. 12 9.88 10 8 6 5.31 4 2 0 1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090 Increase performance 18x when compared to CPU-only nodes Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs!
  • 44. Excellent Strong Scaling on Large Clusters LAMMPS Gay-Berne 134M Atoms 600 GPU Accelerated XK6 500 CPU only XE6 Loop Time (seconds) 400 3.55x 300 200 3.48x 3.45x 100 0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
  • 45. GPUs Sustain 5x Performance for Weak Scaling Weak Scaling with 32K Atoms per Node 45 40 Loop Time (seconds) 35 30 6.7x 5.8x 4.8x 25 20 15 10 5 0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
  • 46. Faster, Greener — Worth It! Energy Consumed in one loop of EAM 140 120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only 100 Energy Expended (kJ) 80 60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs 40 20 0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36. Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
  • 47. Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012
  • 48. Early Kepler Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00 4.00 XK6 3 Time (s) Atomic Fluid 2.00 Time (s) XK6+GPU 1.00 2 0.50 XK7+GPU 0.25 XK6 0.13 1 XK6+GPU 0.06 0.03 0 1 2 4 8 16 32 64 128 Nodes 1 4 16 64 6 96 24 4 25 38 40 10 16 3.0 8.00 XK7+GPU 2.5 4.00 2.0 Time (s) 2.00 Time (s) Bulk Copper XK6 1.5 1.00 1.0 0.50 XK6+GPU 0.5 0.25 0.0 0.13 Nodes 1 4 16 64 6 96 24 4 25 38 1 2 4 8 16 32 64 128 40 10 16
  • 49. Early Kepler Benchmarks on Titan 64.00 32 32.00 XK7+GPU 16 16.00 Time (s) Protein 8 Time (s) 8.00 XK6 4.00 4 2.00 XK6+GPU 2 1.00 0.50 1 1 4 16 64 256 4096 16384 1024 1 2 4 8 16 32 64 128 Nodes 128.00 16 64.00 14 32.00 XK7+GPU 12 16.00 10 Time (s) 8.00 Time (s) Liquid Crystal 4.00 XK6 8 2.00 6 1.00 XK6+GPU 4 0.50 0.25 2 0.13 0 1 2 4 8 16 32 64 128 Nodes 1 4 16 64 6 96 24 4 25 38 40 10 16
  • 50. Early Titan XK6/XK7 Benchmarks 18 Speedup with Acceleration on XK6/XK7 Nodes 16 1 Node = 32K Particles 14 900 Nodes = 29M Particles 12 10 8 6 4 2 0 Atomic Fluid (cutoff Atomic Fluid (cutoff Bulk Copper Protein Liquid Crystal = 2.5σ) = 5.0σ) XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82 XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70 XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60 XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14
  • 51. Recommended GPU Node Configuration for LAMMPS Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand 51 Scale to multiple nodes with same single node configuration
  • 52. GROMACS 4.6 Final, Pre-Beta and 4.6 Beta
  • 54. Great Scaling in Small Systems 25.00 Running GROMACS 4.6 pre-beta with CUDA 4.1 21.68 Each blue node contains 1x Intel X5550 CPU 20.00 3.2x (95W TDP, 4 Cores per CPU) 3.2x Each green node contains 1x Intel X5550 CPU Nanoseconds / Day (95W TDP, 4 Cores per CPU) and 1x NVIDIA 15.00 M2090 (225W TDP per GPU) 13.01 CPU Only 10.00 3.6x With GPU 8.36 3.6x 5.00 3.7x Benchmark systems: RNAse in water with 16,816 atoms in truncated dodecahedron box 0.00 1 2 3 Number of Nodes Get up to 3.7x performance compared to CPU-only nodes
  • 55. Additional Strong Scaling on Larger System 128K Water Molecules 160 Running GROMACS 4.6 pre-beta with CUDA 4.1 Each blue node contains 1x Intel X5670 (95W 140 TDP, 6 Cores per CPU) 120 Each green node contains 1x Intel X5670 (95W 2x TDP, 6 Cores per CPU) and 1x NVIDIA M2070 Nanoseconds / Day 100 (225W TDP per GPU) 80 CPU Only 60 With GPU 2.8x 40 20 3.1x 0 8 16 32 64 128 Number of Nodes Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance when compared to CPU-only nodes
  • 56. Replace 3 Nodes with 2 GPUs Running GROMACS 4.6 pre-beta with CUDA ADH in Water (134K Atoms) 4.1 9 4 CPU Nodes 9000 The blue node contains 2x Intel X5550 CPUs 8.36 $8,000 (95W TDP, 4 Cores per CPU) 8 8000 The green node contains 2x Intel X5550 CPUs 7 6.7 7000 $6,500 (95W TDP, 4 Cores per CPU) and 2x NVIDIA M2090s as the GPU (225W TDP per 6 6000 GPU) 5 5000 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node 4 4000 configuration. Contact your preferred HW vendor for actual pricing. 3 3000 2 2000 1 1000 0 0 Nanoseconds/Day Cost Save thousands of dollars and perform 25% faster
  • 57. Greener Science ADH in Water (134K Atoms) Running GROMACS 4.6 with CUDA 4.1 12000 The blue nodes contain 2x Intel X5550 CPUs Energy Expended (KiloJoules Consumed) (95W TDP, 4 Cores per CPU) 10000 The green node contains 2x Intel X5550 CPUs, Lower is better 4 Cores per CPU) and 2x NVIDIA M2090s GPUs 8000 (225W TDP per GPU) 6000 4000 Energy Expended = Power x Time 2000 0 4 Nodes 1 Node + 2x M2090 (760 Watts) (640 Watts) In simulating each nanosecond, the GPU-accelerated system uses 33% less energy
  • 58. The Power of Kepler RNase Solvated Protein 24k Atoms 140 Running GROMACS version 4.6 beta 120 The grey nodes contain 1 or 2 E5-2687W CPUs (150W each, 8 Cores per CPU) and 1 or 2 100 NVIDIA M2090s. The green nodes contain 1 or 2 E5-2687W 80 CPUs (8 Cores per CPU) and 1 or 2 NVIDIA M2090 K20X GPUs (235W each). 60 K20X 40 20 0 1 CPU + 1 GPU 1 CPU + 2 GPU 2 CPU + 1 GPU 2 CPU + 2 GPU Upgrading an M2090 to a K20X increases performance 10-45% Ribonuclease
  • 59. K20X – Fast RNase Solvated Protein 24k Atoms 120 Running GROMACS version 4.6 beta 100 The blue nodes contain 1 or 2 E5-2687W CPUs (150W each, 8 Cores per CPU). 80 Nanoseconds / Day The green nodes contain 1 or 2 E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K20X GPUs (235W each). 60 CPU Only With 1 K20X 40 20 0 1 CPU 2 CPUs Adding a K20X increases performance by up to 3x Ribonuclease
  • 60. K20X, the Fastest Yet 192K Water Molecules 16 Running GROMACS version 4.6-beta2 and 14 CUDA 5.0.35 12 The blue node contains 2 E5-2687W CPUs (150W each, 8 Cores per CPU). Nanoseconds / Day 10 The green nodes contain 2 E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K20X GPUs 8 (235W each). 6 4 2 0 CPU CPU + K20X CPU + 2x K20X Using K20X nodes increases performance by 2.5x Water Try GPU accelerated GROMACS 4.6 for free – www.nvidia.com/GPUTestDrive
  • 61. Recommended GPU Node Configuration for GROMACS Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 1x Kepler-based GPUs (K20X, K20 or K10): need fast Sandy # of GPUs per CPU socket Bridge or perhaps the very fastest Westmeres, or high-end AMD Opterons GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher 61 Scale to multiple nodes with same single node configuration
  • 63. GPUs Outperform CPUs Daresbury Crambin 19.6k Atoms 70 Running CHARMM release C37b1 60 The blue nodes contains 44 X5667 CPUs (95W, 4 Cores per CPU). 50 Nanoseconds / Day The green nodes contain 2 X5667 CPUs and 1 40 or 2 NVIDIA C2070 GPUs (238W each). Note: Typical CPU and GPU node pricing used. 30 Pricing may vary depending on node configuration. Contact your preferred HW vendor 20 for actual pricing. 10 0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 $44,000 $3000 $4000 1 GPU = 15 CPUs
  • 64. More Bang for your Buck Daresbury Crambin 19.6k Atom 12 Running CHARMM release C37b1 10 The blue nodes contains 44 X5667 CPUs (95W, 4 Cores per CPU). Scaled Performance / Price 8 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W). 6 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW vendor 4 for actual pricing. 2 0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 Using GPUs delivers 10.6x the performance for the same cost
  • 65. Greener Science with NVIDIA Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms 18000 16000 Running CHARMM release C37b1 14000 The blue nodes contains 64 X5667 CPUs (95W, 4 Cores per CPU). Energy Expended (kJ) 12000 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W each). 10000 Lower is better Note: Typical CPU and GPU node pricing used. 8000 Pricing may vary depending on node configuration. Contact your preferred HW vendor 6000 for actual pricing. 4000 2000 Energy Expended = Power x Time 0 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 Using GPUs will decrease energy use by 75%
  • 66. www.acellera.com 470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms) M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009)
  • 67. www.acellera.com NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1 1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications
  • 69. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released since Version 6.12 www.abinit.org Abinit diagonalization / 1.3-2.7X Multi-GPU support orthogonalization Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development ACES III SIAL programming language and 10X on kernels Multi-GPU support content/training/electronic-structure- SIP runtime environment 2012/deumens_ESaccel_2012.pdf Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X Released June 2009, content/training/electronic-structure- DFT; Daubechies wavelets, BigDFT part of Abinit (1 CPU core to current release 1.6 2012/BigDFT-Formalism.pdf and GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp- content/training/electronic-structure- 2012/BigDFT-HPC-tues.pdf Under development, http://www.tcm.phy.cam.ac.uk/~mdt26/casino. Casino TBD TBD Spring 2013 release html Multi-GPU support GPU Perf compared against Multi-core x86 CPU socket. http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development GPU Perf benchmarked on GPU supported features CP2K library) 2-7X Multi-GPU support content/training/ascc_2012/friday/ACSS_2012_V and may be a kernel to kernel perf comparison andeVondele_s.pdf
  • 70. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes (ss|ss) type integrals within calculations using Hartree Fock ab Release in Summer 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419 GAMESS-UK initio methods and density 8x Multi-GPU support 63 functional theory. Supports organics & inorganics. Under development Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011 Gaussian Collaboration TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Electrostatic poisson equation, Released orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html, GPAW residual minimization method 8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC) (rmm-diis) Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 3x NICS Electronic Structure Determination Workshop 2012: with 32 GPUs vs. Under development http://www.olcf.ornl.gov/wp- LSMS Generalized Wang-Landau method 32 (16-core) Multi-GPU support content/training/electronic-structure- 2012/Eisenbach_OakRidge_February.pdf CPUs Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org support coming in Version 8 GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features Density-fitted MP2 (DF-MP2), 1.7-2.3X Under development and may be a kernel to kernel perf comparison www.molpro.net density fitted local correlation
  • 71. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported pseudodiagonalization, full Under Development Academic port. MOPAC2009 diagonalization, and density 3.8-14X Single GPU http://openmopac.net matrix assembling Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Release targeting end of 2012 NWChem CCSD & EOMCCSD task 3-10X projected Multiple GPUs And http://www.olcf.ornl.gov/wp- schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/ Density functional theory (DFT) First principles materials code that computes Released PEtot plane wave pseudopotential 6-10X Multi-GPU the behavior of the electron structures of calculations materials http://www.q- Q-CHEM RI-MP2 8x-14x Released, Version 4.0 chem.com/doc_for_web/qchem_manual_4.0.pdf GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 72. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported NCSA Released University of Illinois at Urbana-Champaign QMCPACK Main features 3-4x Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK Created by Irish Centre for Quantum PWscf package: linear algebra (matrix multiply), explicit 2.5-3.5x Released Version 5.0 High-End Computing http://www.quantum-espresso.org/index.php Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs and http://www.quantum-espresso.org/ Completely redesigned to exploit GPU parallelism. YouTube: 44-650X vs. Released http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5 http://www.olcf.ornl.gov/wp- version Multi-GPU/single node content/training/electronic-structure- 2012/Luehr-ESCMA.pdf 2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University VASP functionals including exact comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
  • 77. CP2K
  • 78. Kepler, it’s faster 14 12 Running CP2K version 12413-trunk on CUDA 5.0.36 Performance Relative to CPU Only 10 The blue node contains 2 E5-2687W CPUs (150W, 8 Cores per CPU). 8 The green nodes contain 2 E5-2687W CPUs and 1 or 2 NVIDIA K10, K20, or K20X GPUs (235W each). 6 4 2 0 CPU Only CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X Using GPUs delivers up to 12.6x the performance per node
  • 79. Strong Scaling 8 XK6 With GPUs 7 Speedup Relative to 256 non-GPU Cores XK6 Without GPUs Conducted on Cray XK6 Using matrix-matrix multiplication 6 NREP=6 and N=159,000 with 50% occupation 5 3x 4 2.9x 3 2 2.3x 1 0 256 512 768 # of Cores used Speedups increase as more nodes are added, up to 3x at 768 nodes
  • 80. Kepler, keeping the planet Green 350 300 Running CP2K version 12413-trunk on CUDA 5.0.36 250 The blue node contains 2 E5-2687W CPUs (150W, 8 Cores per CPU). Energy Expended (kJ) 200 The green nodes contain 2 E5-2687W CPUs and 1 or 2 NVIDIA K20 GPUs (235W each). Lower is better 150 Energy Expended = Power x Time 100 50 0 CPU Only CPU + K20 CPU + 2x K20 Using K20s will lower energy use by over 75% for the same simulation
  • 82. Gaussian Key quantum chemistry code ACS Fall 2011 press release Joint collaboration between Gaussian, NVDA and PGI for GPU acceleration: http://www.gaussian.com/g_press/nvidia_press.htm No such release exists for Intel MIC or AMD GPUs Mike Frisch quote: “Calculations using Gaussian are limited primarily by the available computing resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the development of hardware, compiler technology and application software among the three companies, the new application will bring the speed and cost-effectiveness of GPUs to the challenging problems and applications that Gaussian’s customers need to address.” NVIDIA Confidential
  • 84. GAMESS Partnership Overview Mark Gordon and Andrey Asadchev, key developers of GAMESS, in collaboration with NVIDIA. Mark Gordon is a recipient of a NVIDIA Professor Partnership Award. Quantum Chemistry one of major consumers of CPU cycles at national supercomputer centers NVIDIA developer resources fully allocated to GAMESS code “ We like to push the envelope as much as we can in the direction of highly scalable efficient codes. GPU technology seems like a good way to achieve this goal. Also, since we are associated with a DOE Laboratory, energy efficiency is important, and this is another reason to explore quantum chemistry on GPUs. ” Prof. Mark Gordon Distinguished Professor, Department of Chemistry, Iowa State University and Director, Applied Mathematical Sciences Program, AMES Laboratory 84
  • 85. GAMESS August 2011 GPU Performance First GPU supported GAMESS release via "libqc", a library for fast quantum chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software 2e- AO integrals and their assembly into a closed shell Fock matrix Performance for Two Small Molecules 2.0 GAMESS Aug. 2011 Release Relative 4x E5640 CPUs 4x E5640 CPUs + 4x Tesla C2070s 1.0 0.0 Ginkgolide (53 atoms) Vancomycin (176 atoms)
  • 86. Upcoming GAMESS Q4 2012 Release Multi-nodes with multi-GPUs supported Rys Quadrature Hartree-Fock 8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup. See 2012 publication Møller–Plesset perturbation theory (MP2): Preliminary code completed Paper in development Coupled Cluster SD(T): CCSD code completed, (T) in progress 86
  • 87. GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F Hartree-Fock GPU Speedups* 3.5 3.0 2.9 Adding 1x 2070 GPU 2.5 2.5 speeds up computations 2.5 2.4 2.3 2.3 2.3 by 2.3x to 2.9x 2.0 Speedup 1.5 1.0 0.5 * A. Asadchev, M.S. Gordon, “New 0.0 Multithreaded Hybrid CPU/GPU Approach to Taxol 6-31G Taxol 6-31G(d) Taxol 6- Taxol 6- Valinomycin 6- Valinomycin 6- Valinomycin 6- Hartree-Fock,” Journal of Chemical Theory and 31G(2d,2p) 31++G(d,p) 31G 31G(d) 31G(2d,2p) Computation (2012) NVIDIA CONFIDENTIAL 87
  • 88. GPAW
  • 92. 92
  • 93. 93
  • 94. 94
  • 95. 95
  • 96. 96
  • 97. 97
  • 99. NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes System: cluster consisting of dual-socket nodes constructed from: • 8-core AMD Interlagos processors • 64 GB of memory • Tesla M2090 (Fermi) GPUs The nodes are connected using a high-performance QDR Infiniband interconnect Courtesy of Kowolski, K., Bhaskaran- Nair, at al @ PNNL, JCTC (submitted)
  • 101. Kepler, fast science AUsurf 14 Running Quantum Espresso version 5.0-build7 on CUDA 5.0.36 12 Performance Relative to CPU Only The blue node contains 2 E5-2687W CPUs (150W, 8 Cores per CPU). 10 The green nodes contain 2 E5-2687W CPUs 8 and 1 or 2 NVIDIA M2090 or K10 GPUs (225W and 235W respectively). 6 4 2 0 CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10 Using K10s delivers up to 11.7x the performance per node over CPUs And 1.7x the performance when compared to M2090s
  • 102. Extreme Performance/Price from 1 GPU 4 Simulations run on FERMI @ ICHEC. 3.5 A 6-Core 2.66 GHz Intel X5650 was 3 used for the CPU Scaled to CPU Only 2.5 An NVIDIA C2050 was used for the GPU 2 1.5 CPU+ 1 GPU CPU 0.5 Only 0 Price: Performance: (Shilu-3) Performance: (Water-on-Calcite) Calcite structure Adding a GPU can improve performance by 3.7x while only increasing price by 25%
  • 103. Extreme Performance/Price from 1 GPU 4 Price and Performance scaled to the CPU only system Simulations run on FERMI @ ICHEC. 3.5 A 6-Core 2.66 GHz Intel X5650 was 3 used for the CPU 2.5 An NVIDIA C2050 was used for the GPU 2 1.5 CPU+ 1 GPU CPU 0.5 Only 0 Price: Performance: (AUSURF112, k- Performance: point) (AUSURF112, gamma-point) Calculation done for a gold surface of 112 atoms Adding a GPU can improve performance by 3.5x while only increasing price by 25%
  • 104. Replace 72 CPUs with 8 GPUs Simulations run on PLX @ CINECA. 250 Intel 6-Core 2.66 GHz X5550 were LSMO-BFO (120 Atoms) 8 K-points used for the CPUs 223 219 NVIDIA M2070s were used for the 200 GPUs Elapsed Time (minutes) 150 100 50 0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800) The GPU Accelerated setup performs faster and costs 24% less
  • 105. QE/PWscf - Green Science LSMO-BFO (120 Atoms) 8 K-points Simulations run on PLX @ CINECA. 12000 Intel 6-Core 2.66 GHz X5550 were used for the CPUs 10000 NVIDIA M2070s were used for the Power Consumption (Watts) Lower is better GPUs 8000 6000 4000 2000 0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800) Over a year, the lower power consumption would save $4300 on energy bills
  • 106. NVIDIA GPUs Use Less Energy Energy Consumption on Different Tests Simulations run on FERMI @ ICHEC. 0.6 A 6-Core 2.66 GHz Intel X5650 was used for the CPU CPU Only 0.5 An NVIDIA C2050 was used for the CPU+GPU GPU Power Consumption [kW/h] 0.4 Lower is better 0.3 -58% 0.2 0.1 -54% -57% 0 Shilu-3 AUSURF112 Water-on-Calcite In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only
  • 107. QE/PWscf - Great Strong Scaling in Parallel CdSe-159 Walltime of 1 full SCF Simulations run on STONEY @ ICHEC. 35000 Two quad core 2.87 GHz Intel X5560s 30000 were used in each node Lower is better CPU Two NVIDIA M2090s were used in 25000 each node for the CPU+GPU test CPU+GPU 2.5x 20000 Time (s) 15000 10000 2.2x 5000 2.1x 2.2x 0 2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112) Nodes (Total CPU Cores) 159 Cadmium Selenide nanodots Speedups up to 2.5x with GPU Accelerations
  • 108. QE/PWscf - More Powerful Strong Scaling GeSnTe134 Walltime of full SCF 4500 Simulations run on PLX @ CINECA. CPU 4000 Two 6-Core 2.4 GHz Intel E5645s were CPU+GPU used in each node 3500 Two NVIDIA M2070s were used in 1.6x Lower is better each node for the CPU+GPU test 3000 Time (s) 2500 2000 2.3x 1500 2.4x 2.1x 1000 500 0 4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528) Nodes (Total CPU Cores) Accelerate your cluster by up to 2.1x with NVIDIA GPUs Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive
  • 110. TeraChem Supercomputer Speeds on GPUs Time for SCF Step 100 TeraChem running on 8 C2050s on 1 node 90 NWChem running on 4096 Quad Core CPUs 80 In the Chinook Supercomputer 70 Giant Fullerene C240 Molecule Time (Seconds) 60 50 40 30 20 10 0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000) Similar performance from just a handful of GPUs
  • 111. TeraChem Bang for the Buck Performance/Price TeraChem running on 8 C2050s on 1 node 600 NWChem running on 4096 Quad Core CPUs In the Chinook Supercomputer Price/Performance relative to Supercomputer 493 500 Giant Fullerene C240 Molecule Note: Typical CPU and GPU node pricing 400 used. Pricing may vary depending on node configuration. Contact your preferred HW vendor for actual pricing. 300 200 100 1 0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000) Dollars spent on GPUs do 500x more science than those spent on CPUs
  • 112. Kepler’s Even Better Olestra BLYP 453 Atoms B3LYP/6-31G(d) 800 2000 TeraChem running on C2050 and K20C 700 1800 First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) 1600 600 1400 500 1200 Seconds Seconds 400 1000 300 800 600 200 400 100 200 0 0 C2050 K20C C2050 K20C Kepler performs 2x faster than Tesla
  • 113. Viz, ―Docking‖ and Related Applications Growing Related Features GPU Perf Release Status Notes Applications Supported Visualization from Visage Imaging. Next release, 5.4, will use 3D visualization of volumetric Released, Version 5.3.3 Amira 5® data and surfaces 70x Single GPU GPU for general purpose processing in some functions http://www.visageimaging.com/overview.html High-Throughput parallel blind Virtual Screening, Allows fast processing of large Available upon request to BINDSURF ligand databases 100X authors; single GPU http://www.biomedcentral.com/1471-2105/13/S14/S13 Empirical Free Released University of Bristol BUDE Energy Forcefield 6.5-13.4X Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm Released, Suite 2011 Schrodinger, Inc. Core Hopping GPU accelerated application 3.75-5000X Single and multi-GPUs. http://www.schrodinger.com/products/14/32/ Real-time shape similarity Released Open Eyes Scientific Software FastROCS searching/comparison 800-3000X Single and multi-GPUs. http://www.eyesopen.com/fastrocs Lines: 460% increase Cartoons: 1246% increase Released, Version 1.5 PyMol Surface: 1746% increase 1700x Single GPUs http://pymol.org/ Spheres: 753% increase Ribbon: 426% increase High quality rendering, GPU Perf compared against Multi-core x86 CPU socket. large structures (100 million atoms), 100-125X or greater GPU Perf benchmarked on GPU supported features Visualization from University of Illinois at Urbana-Champaign VMD analysis and visualization tasks, multiple on kernels Released, Version 1.9 and mayhttp://www.ks.uiuc.edu/Research/vmd/ be a kernel to kernel perf comparison GPU support for display of molecular
  • 114. FastROCS OpenEye Japan Hideyuki Sato, Ph.D. © 2012 OpenEye Scientific Software
  • 115. ROCS on the GPU: FastROCS Shape Overlays per 400000 300000 Second 200000 100000 0 CPU GPU
  • 116. Riding Moore’s Law 2000000 1800000 Shape Overlays per Second 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 C1060 C2050 C2075 C2090 K10 K20
  • 117. FastROCS scaling across 4x K10s (2 physical GPUs per K10) 53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule) 9000000 8000000 Conformers per Second 7000000 6000000 5000000 4000000 3000000 2000000 1000000 0 1 2 3 4 5 6 7 8 Number of individual K10 GPUs (Note, each K10 has 2 physical GPUs on the board)
  • 118. Benefits of GPU Accelerated Computing Faster than CPU only systems in all tests Large performance boost with marginal price increase Energy usage cut by more than half GPUs scale well within a node and over multiple nodes K20 GPU is our fastest and lowest power high performance GPU yet Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive 11 8
  • 119. GPU Test Drive Experience GPU Acceleration For Computational Chemistry Researchers, Biophysicists Preconfigured with Molecular Dynamics Apps Remotely Hosted GPU Servers Free & Easy – Sign up, Log in and See Results www.nvidia.com/gputestdrive 11 9

Editor's Notes

  • #4: Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
  • #5: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #6: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #7: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #8: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #12: cpuk10k20k20x2k102k202k20xambercellulose 0.744.345.396.146.46.787.5 factor9 nve3.4218.922.425.429.228.131.4jacnve12.4768.681.189.19895.6102.1trpcage210420559585418451475namdapoa11.373.453.5746.256.677.14atpase0.460.961.121.251.782.042.22stmv0.1150.290.310.350.520.560.61lammpsfluid lj 511.951.112.82 4.382.223.62eam11.72.472.923.34.55.5 rhodopsin11.330.771.62.351.482.28gromacsrnase46.7109120
  • #14: Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
  • #17: ns/dayDual E5-2687W CPUs 3.4Dual E5-2687W CPUs + M2090 11.9Dual E5-2687W CPUs + K10 18.9Dual E5-2687W CPUs + K20 22.4Dual E5-2687W CPUs + K20X 25.39
  • #18: cpu ns/day gpu ns/dayTrpcage 210 420Jacnve 12.47 68.6Factor 9 3.42 18.9Cellulose .74 3.73Myoglobin 6.12 122.3Nucleosome .1 2.4
  • #19: cpu ns/day gpu ns/dayTRPcage GB 210.32 559.32JAC NVE PME 12.47 81.09Factor IX NVE PME 3.42 22.44Cellulose NVE PME 0.74 5.39Myoglobin GB 6.12 156.45Nucleosome GB 0.10 2.80SPFP ECC off
  • #20: cpu ns/day gpu ns/dayTrpcage 210 585Jacnve 12.47 89.13Factor 9 3.42 25.4Cellulose .74 6.14Myoglobin 6.12 175.77Nucleosome .1 3.13
  • #21: Nodes cpu ns/day gpu ns/day1 .65 3.312 1.14 4.134 2.01 4.8
  • #22: Energytdpsec/nsenergy (kJ)2e5-2687 300 6928 207822687+K10 5351259673 22687+K20 535 1065 569 22687+K20X 535 969 518
  • #24: 1 CPU node (dual CPUs) = 12.47 ns/day1 CPU+ GPU node (dual CPUs and GPUs) = 95.59 ns/day
  • #27: Perflab:no gpuk10k20k20x2k102k202k20xcell 1 cpu0.374.445.46.166.376.937.67cell 2 cpu0.744.345.396.146.46.787.5
  • #28: Energytdpsec/nsenergy (kJ)2e5-2687 300 6928 207822687+K10 5351259673 22687+K20 535 1065 569 22687+K20X 535 969 518
  • #32: ns/dayDual E5-2687W CPUs 1.370Dual E5-2687W CPUs + M2090 2.632Dual E5-2687W CPUs + K10 3.448Dual E5-2687W CPUs + K20 3.571Dual E5-2687W CPUs + K20X 4.000
  • #33: cpu ns/day gpu ns/dayApoA1 1.370 3.571F1-ATPase 0.461 1.124STMV 0.116 0.314ECC off
  • #34: All #s are days/ns apoa1atpasestmvCPU Only0.732.178.641x K100.291.043.51x K200.28 0.893.181x K20X0.250.82.872x K100.160.561.932x K200.150.491.772x K20X0.140.451.63
  • #35: 32 64 128 256 512 640 768s/step GPU XK6 1.2414 0.660887 0.342743 0.199465 0.10837 0.089752 0.0774948s/step CPU XK6 4.62633 2.36707 1.19722 0.609124 0.314745 0.255016 0.209511ns/day Fermi XK6 0.069599 0.13073339 0.252084 0.433159 0.797269 0.962655 1.114913517ns/day CPU XK6 0.018676 0.03650082 0.072167 0.141843 0.274508 0.338802 0.412388848
  • #37: Config: TDP sec/ns energy 2x E5-2687W 150 63,072.0 9,460,800.0 2x E5-2687W+ 2x K20 600 24,192.0 14,515,200 TDP = Thermal Design Power
  • #38: ns/day tdp energyCpu .115 300 223kK10s .518 770 128kK20s .565 770 117kK20xs .613 770 108k
  • #42: CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  • #43: CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  • #44: Config: loop time:2x X5670 (HP Z800) 2717.6301xM2090 (2xX5570)511.7502xM2090 (2xX5570)274.9703xM2090 (2xX5570)210.4304xM2090 (2xX5570)148.880
  • #45: nodes:300400500600700800900CPU-only time:563.96423.83339.62281.58260.98220.83203.13CPU+GPU time: 159.06118.6296.4481.0371.5763.7658.96GPU speedup ratio:  3.553.573.523.483.653.463.45
  • #46: Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
  • #47: Power WTimeenergy spentCpu 300 382 114Cpu 1 k20x 535 130 69Cpu 2 k20x 770 70 54
  • #54: ns/daySingle E5-2687W CPUs 4.35 (1.0X)Dual E5-2687W CPUs 7.32 (1.7X)Single E5-2687W CPUs + M2090 7.33 (1.7X)Dual E5-2687W CPUs + M2090 7.54 (1.7X)Single E5-2687W CPUs + K10 13.24 (3.0X)Dual E5-2687W CPUs + K10 13.24 (3.0X)Single E5-2687W CPUs + K20 11.6 (2.7X)Dual E5-2687W CPUs + K20 12.26 (2.8X)Single E5-2687W CPUs + K20X 11.99 (2.7X)Dual E5-2687W CPUs + K20X 12.27 (2.8X)
  • #55: Nodes CPU only gpu1 2.26 8.362 3.58 13.014 6.7 21.68
  • #56: Nodes CPU GPU86.61320.3351611.28237.01632 23.06763.8766442.28496.62812872.694 144.424
  • #57: nanoseconds/day8 X5550 6.72M2090+2X5550 8.36CPU Node: 4 X 2 X $1000 = $8000CPU + GPU Node: 1 X 2 X $1000 + 2 X $2000 = $6000
  • #58: GPU: 640 (watts) * 10,334 (seconds/nanosecond) = 6.6 MegaJoulesCPU: 760 (watts) * 12,895 (seconds/nanosecond) = 9.8 MegaJoules
  • #64: 44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day6025.342.4price6000030004000
  • #65: 44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day 6025.342.4price6000030004000scaled price1 0.050.066666667perf/price18.43333333310.6
  • #66: 64 cpu2 cpu 1 gpu2 cpu 2 gpuNs/day 318.915.1tdp6080428666sec/ns2787.09679707.86515721.8543energy/ns16945.5484154.96623810.7549
  • #70: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #71: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #72: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #73: Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • #79: Test case not specified in perf lab run
  • #81: Test case not specified in perf lab run
  • #103: I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run Quantum Espresso/PWscf and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.What is Quantum Espresso/PWscf:-A set of programs used to calculate the electron configuration of atoms or molecules-Uses plane wave basis sets and quantum mechanical principles-Highly compute intensiveBenefits of GPU-accelerated Computing:-Faster than CPU only systems in all tests-Performance boost much larger than marginal price increase-Power consumption more than halved in all simulations-GPUs scale very well on clusters with dozens of nodes, and beyond===============Price assumes FERMI workstation ~$4000 and C2050 $1000shilu 3 water on calcite6 OpenMP CPU nodes 1025 15606 OpenMP CPU nodes 1 gpu 275 480FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  • #104: ausurf k ptausurf gamma 6 OpenMP CPU nodes 7100 s 7000 s 6 OpenMP CPU nodes 1 gpu 2350 s 2000 s FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  • #105: CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  • #106: National average 9.83 cents/kWh kWh/sim tests/year $/test yearly energy billCPU 42.372357 4.16 $9816GPU/CPU 23.214 2400 2.28 $5476CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  • #107: FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  • #108: total # of core2 (16) 4 (32)6 (48)8 (64)10 (80)12 (96)14 (112)time (s) cpu3100016500110009500750060005500time gpu+cpu12500700050004500350030002500SPEEDUP2.482.3571432.22.1111112.14285722.2STONEY (ICHEC): Bull Novascale R422-E2, 24 GPU nodesCPU: 2 Intel (Nehalem EP) Xeon X5560, 48 GByte RAMGPU: 2 x M2090SW: CUDA 4.0, Intel compilers
  • #109: # of cores4(48)8(96)12(144)16(192)24(288)32(384)44(528)time cpu3925265025252450174012901337time gpu+cpu242514371075900737675637SPEEDUP1.6185571.844122.3488372.7222222.3609231.9111112.098901PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  • #111: I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run TeraChem and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.Benefits of GPU Acceleration with TeraChem-Compete with Supercomputers-More powerful hardware-Significantly lower energy usage
  • #120: Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.