Accelerating S3D A GPGPU Case Study

Accelerating S3D: A GPGPU Case Study
Kyle Spafforda,∗
, Jeremy Mereditha
, Jeffrey Vettera
, Jacqueline Chenb
, Ray
Groutb
, Ramanan Sankarana
a
Oak Ridge National Laboratory, 1 Bethel Valley Road MS 6173, Oak Ridge, TN 37831
b
Combustion Research Facility, Sandia National Laboratories, Livermore, CA 94551
Abstract
The graphics processor (GPU) has evolved into an appealing choice for high
performance computing due to its superior memory bandwidth, raw process-
ing power, and flexible programmability. As such, GPUs represent an excel-
lent platform for accelerating scientific applications. This paper explores a
methodology for identifying applications which present significant potential
for acceleration. In particular, this work focuses on experiences from accel-
erating S3D, a high-fidelity turbulent reacting flow solver. The acceleration
process is examined from a holistic viewpoint, and includes details that arise
from different phases of the conversion. This paper also addresses the issue
of floating point accuracy and precision on the GPU, a topic of immense
importance to scientific computing. Several performance experiments are
conducted, and results are presented from the NVIDIA Tesla C1060 GPU.
We generalize from our experiences to provide a roadmap for deploying ex-
isting scientific applications on heterogeneous GPU platforms.
Keywords: Graphics Processors, Heterogeneous Computing,
Computational Chemistry
1. Introduction
Strong market forces from the gaming industry and increased demand for
high definition, real-time 3D graphics have been the driving forces behind
the GPU’s incredible transformation. Over the past several years, increases
in the memory bandwidth and the speed of floating point computation of
∗
Corresponding Author
Email address: spaffordkl@ornl.gov (Kyle Spafford)
Preprint submitted to Parallel Computing December 8, 2009

GPUs have steadily outpaced those of CPUs. In a relatively short period
of time, the GPU has evolved from an arcane, highly-specialized hardware
component into a remarkably flexible and powerful parallel coprocessor.
1.1. GPU Hardware
Originally, GPUs were designed to perform a limited collection of opera-
tions on a large volume of independent geometric data. These operations fell
into to only two main categories (vertex and fragment) and were highly paral-
lel and computationally intense, resulting in a highly specialized design with
multiple cores and small caches. As graphical tasks became more diverse, the
demand for flexibility began to influence GPU designs. GPUs transitioned
from a fixed function design, to one which allowed limited programmability
of its two specialized pipelines, and eventually to an approach where all its
cores were of a unified, more flexible type, supporting much greater control
from the programmer.
The NVidia Tesla C1060 GPU platform was designed specifically for high
performance computing. It boasts an impressive thirty streaming multipro-
cessors, each composed of eight stream processors for a total of 240 processor
cores, running at 1.3Ghz. Each multiprocessor has 16KB of shared mem-
ory, which can be accessed as quickly as a register if managed properly. The
Tesla C1060 has 4GB of global memory, as well as supplementary cached con-
stant and texture memory. Perhaps the most exciting feature of the Tesla
is its support for native double precision floating point operations, which
are tremendously important for scientific computing. Single precision com-
putations were sufficient for the graphics computations which GPUs were
initially intended to solve, and double precision, a relatively new feature in
GPUs, is dramatically slower than single precision. In order to achieve high
performance on GPUS, one must use careful memory management and ex-
ploit hardware specific features. This otherwise daunting task is simplified
by CUDA.
1.2. CUDA
The striking performance numbers of modern GPUs have resulted in a
surge of interest in general-purpose computation on graphics processing units
(GPGPU). GPGPU represents an inexpensive and power-efficient alternative
to more traditional HPC platforms. In the past, there has been a substantial
learning curve associated with GPGPU, and expert knowledge was required
to attain impressive performance. This involved extensive modification of
2

traditional approaches in order to effectively scale to the large number of cores
per GPU. However, as the flexibility of the GPU has increased, there has been
a welcomed decrease in the associated learning curve of the porting process.
In this study, we utilize NVIDIA’s Compute Unified Device Architecture
(CUDA), a parallel programming model and software environment. CUDA
exposes the power of the GPU to the programmer through a set of high level
language extensions, allowing for existing scientific codes to be more easily
transformed into GPU compatible applications.
Figure 1: CUDA Programming Model – Image from NVIDIA CUDA Programming
Guide[1].
1.2.1. Programming Model
While a full introduction to CUDA is beyond the scope of this paper, this
section mentions the basic concepts required to understand the scope of the
parallelism involved. CUDA views the GPU as a highly parallel coprocessor.
Functions called kernels, are composed of a large number of threads, which
are organized into blocks. A group of blocks is known as a grid, see Figure
1. Blocks contain a fast shared memory that is only available to threads
3

which belong to the block, while grids have access to the global GPU mem-
ory. Typical kernel launches involve one grid, which is composed of hundreds
or thousands of individual threads, a much higher degree of parallelism than
normally occurs with traditional parallel approaches on the CPU. This high
degree of parallelism and unique memory architecture have drastic conse-
quences for performance, which will be explored in a later section.
1.3. Domain and Algorithm Description
S3D is a massively parallel direct numerical solver (DNS) for the full com-
pressible Navier-Stokes, total energy, species and mass continuity equations
coupled with detailed chemistry[2, 3]. It is based on a high-order accurate,
non-dissipative numerical scheme solved on a three-dimensional structured
Cartesian mesh. Spatial differentiation is achieved through eighth-order fi-
nite differences along with tenth-order filters to damp any spurious oscilla-
tions in the solution. The differentiation and filtering require nine and eleven
point centered stencils, respectively. Time advancement is achieved through
a six-stage, fourth-order explicit Runge-Kutta (R-K) method. Navier Stokes
characteristic boundary condition (NSCBC) treatment[4, 5, 6] is used on the
boundaries.
Fully coupled mass conservation equations for the different chemical species
are solved as part of the simulation to obtain the chemical state of the sys-
tem. Detailed chemical kinetics and molecular transport models are used. An
optimized and fine-tuned library has been developed to compute the chemi-
cal reaction and species diffusion rates based on Sandia’s Chemkin package.
While Chemkin-standard chemistry and transport models are readily usable
with S3D, special attention is paid to the efficiency and performance of the
chemical models. Reduced chemical and transport models that are fine -
tuned to the target problem are developed as a pre-processing step.
S3D is written entirely in Fortran. It is parallelized using a three dimen-
sional domain decomposition and MPI communication. Each MPI process
is responsible for a piece of the three-dimensional domain. All MPI pro-
cesses have the same number of grid points and the same computational
load. Inter-processor communication is only between nearest neighbors in
a three-dimensional topology. A ghost-zone is constructed at the processor
boundaries by non-blocking MPI sends and receives among the nearest neigh-
bors in the three-dimensional processor topology. Global communications are
only required for monitoring and synchronization ahead of I/O.
4

S3D’s performance has been studied and optimized including I/O[7] and
control flow[8]. Still, further improvements allow for increased grid size, more
simulation timesteps, and more species equations. These are critical to the
scientific goals of turbulent combustion simulations in that they help achieve
higher Reynolds numbers, better statistics through larger ensembles, more
complete temporal development of a turbulent flame, and the simulation of
fuels with greater chemical complexity.
Here we assess S3D code performance and parallel scaling through simu-
lation of a small amplitude pressure wave propagating through the domain
for a short period of time. The test is conducted with detailed ethylene-air
(C2H4) chemistry consisting of twenty-two chemical species and mixture-
averaged molecular transport model. Due to the detailed chemical model,
the code solves for twenty-two species equations in addition to the five fluid
dynamic variables.
1.4. Related Work
Recent work by a number of researchers has investigated GPGPU with
impressive results in a variety of domains. Owens et. al. provide an excellent
history of the GPU [9], chronicling its transformation in great detail. It is
not uncommon to find researchers who achieve at least an order of magnitude
improvement over reference implementations. GPUs have been used to accel-
erate a variety of application kernels, including more traditional operations
like dense[10, 11, 12] and sparse[13] linear algebra as well as scatter-gather
techniques[14]. The GPU has been successfully applied to a wide variety of
fields including computational biophysics[15], molecular dynamics[16], and
medical imaging[17, 18]. Our work takes a slightly higher level approach.
While we do present performance measurements from an accelerated version
of S3D, we examine the acceleration process as a whole, and endeavor to
answer why certain applications perform so well on GPUs, while others fail
to achieve significant performance improvements.
2. Identifying Candidates for Acceleration
2.1. Profiling
The first step in identifying a scientific application for acceleration is
to identify the performance bottlenecks. The best case scenario involves a
small number of computationally intense functions which comprise most of
the runtime. This is a fairly basic requirement and is a direct consequence
5

of Amdahl’s law. The CPU based profiling tool Tau identified S3D’s ge-
trates kernel as a major bottleneck[19]. This kernel involves calculating the
rates of chemical reactions occurring in the simulation at each point in space.
This computation represents about half of the total runtime with the cur-
rent chemistry model. As S3D’s chemical model becomes more complex, we
anticipate that the getrates kernel will more strongly dominate runtime. As
the kernel’s total percentage of runtime increases, the greater the potential
for application speedup. Therefore, the first kernel to be examined should
be the most time consuming.
2.2. Parallelism and Data Dependency
One of the main advantages of the GPU is the high number of proces-
sors, so it follows that kernels must exhibit a high degree of parallelism to be
successful on a heterogenous GPU platform. While this can correspond to
task-based parallelism, GPUs have primarily been used for data-parallel op-
erations. This makes it difficult for GPUs to handle unstructured kernels, or
those with intricate patterns of data dependency. Indeed, in situations with
irregular control flow, individual threads can become serialized, which results
in performance loss. Since the memory architecture of a GPU is dramatically
different than most CPUs, memory access times can differ by several orders
of magnitude based on access pattern and type of memory. For example, on
the Tesla, an access to shared block memory is an two orders of magnitude
faster than an access to global memory. Therefore, kernels must often be
chosen based on memory access pattern, or restructured such that memory
access is more uniform in nature. In S3D, the getrates kernel operates on
a regular three dimensional mesh, so access patterns are fairly uniform, an
easy case for the GPU.
The following psuedocode outlines the general structure of the sequential
getrates kernel. The outer three loops can be computed in parallel, since
points in the mesh are independent.
for x = 1 to length
for y = 1 to length
for z = 1 to length
for n = 1 to nspecies
grid[x][y][z][n] = F(grid[x][y][z][1:nspecies])
where length refers to the length of an edge of the cube, nspecies refers to
6

the number of chemical species involved, and function F is an abstraction of
the more complex chemical computations.
In addition to the innate parallelism of GPUs, the system’s intercon-
nection bus can also have serious performance consequences for accelerated
applications. Discrete GPUs are often connected via a PCI-e bus, which
introduces a substantial amount of latency into computations. This makes
GPUs more effective at problems in which bandwidth is much more impor-
tant than latency or those which have a high ratio of computation to data.
In these cases, the speedup in the calculations or the increased throughput
is sufficient to overcome performance costs associated with transferring data
across the bus. In ideal cases, a large amount of data can saturate the bus
and amortize the startup costs associated with bus. In effect, this hides
communication time with computation.
3. Results and Discussion
3.1. Kernel Acceleration
Once a suitable portion of the application has been identified, the ac-
celeration process can begin. Parallel programming is inherently more diffi-
cult than sequential programming, and developing high performance code for
GPUs also incorporates complexity from architectural features. This “mem-
ory aware” programming environment grants the programmer control over
low level memory movement, but demands meticulous data orchestration to
maximize performance.
For S3D, the mapping between the getrates kernel and CUDA concepts is
fairly simple. Since getrates operates on a regular, three-dimensional mesh,
each point in the mesh is handled by a single thread. A block is composed
of a local region of the mesh. Block size varies between 64 and 128, based
on the available number of registers per GPU core, in order to maximize
occupancy.
During the development of the accelerated version of the getrates kernel,
the memory access pattern was the most important factor for performance.
When threads read or write memory in a highly parallel fashion, CUDA
coalesces the memory access into a single operation, which has a dramatic
and beneficial effect on performance. The optimized versions of the getrates
kernel also use batched memory transfers and exploit block shared memory.
This attention to detail pays off–accelerated versions of the getrates kernel
exhibit promising speedups over the serial CPU version: up to 31.4x for the
7

single precision version, and 17.0x for the double precision version for a single
iteration of the kernel, see Figure 2. The serial CPU version was measured
on 2.3Ghz quad core AMD Opteron processor with 16GB of memory.
Figure 2: Accelerated Kernel Results
3.2. Accuracy
While the evolution of the GPU has been remarkable, architectural rem-
nants of its original, specialized function remain. Perhaps the most relevant
of these to the scientific community is the bias towards single precision float-
ing point computations. Single precision arithmetic was sufficient for the
GPU’s original tasks (rasterization, etc.). GPU benchmarking traditionally
involved only these single precision computations, and performance demands
have clearly shaped the GPU’s allocation of hardware resources. Many GPUs
are incapable of double precision, and those that are typically pay a high
performance cost. This cost generally arises from the differing number of
floating point units, and it is almost always more than the performance dif-
ference between single and double precision on a traditional CPU. In S3D,
the cost can clearly be seen in the performance difference in the single versus
double precision versions of the getrates kernel.
8

From a performance standpoint, single precision computations are favor-
able compared to double precision, but the computations in scientific ap-
plications can be extremely sensitive to accuracy. Moreover, some double
precision operations are not always equivalent on the CPU and GPU. GPUs
may sacrifice fully IEEE compliant floating point operations for greater per-
formance. For example, scientific applications frequently make extensive use
of transcendental functions (sin, cos, etc.), and the Tesla’s hardware intrinsics
for these functions are faster, but less accurate than their CPU counterparts.
3.2.1. Accuracy in S3D
In S3D, the reaction rates calculated by the getrates kernel are integrated
over time as the simulation progresses, and error from inaccurate reaction
rates compounds and propagates to other simulation variables. While this is
the first comparison of double and single precision versions of S3D, the issue
of accuracy has been previously studied, and some upper bounds for error are
known. S3D has an internal monitor for the estimated error from integration,
and can take smaller timesteps in an effort to improve accuracy. Figure 3
shows the estimated error from integration versus simulation time. In this
graph, the CPU and GPU DP versions quickly begin to agree, while the single
precision version is much more erratic. In both double precision versions, the
internal mechanism for timestep control succeeds in settling on a timestep of
appropriate size. The single precision version has a much weaker guarantee
on accuracy, and the monitor has a difficult time controlling the timestep,
oscillating between large timesteps with high error (sometimes beyond the
acceptable bounds), and short timesteps with very low error. The increased
number of timesteps required by the GPU single precision version will have
consequences for performance, which will be explored in a later section.
The error from low precision can also be observed in simulation variables
such as temperature (see Figure 4) or in chemical species, such as H2O2(see
Figure 5). The current test essentially simulates a rapid ignition, and a rela-
tively significant time gap can be seen between the rapid rise in temperature
in the GPU single precision kernel versus the other versions. In the sensitive
time scale of ignition, this gap represents a serious error. In Figure 5, the
error is much more pronounced, as the single precision version fails to predict
the sudden decrease in H2O2 which occurs roughly at time 4.00E-04.
A similar trend can be observed throughout many different simulation
variables in S3D. The CPU version tends to agree almost perfectly with
the GPU double precision version, while the single precision version deviates
9

Figure 3: Estimated Integrated Error. 1.00E-03 is the upper bound on acceptable error.
The GPU DP and CPU versions completely overlap beginning roughly at time 4.00E-04.
substantially. Consequently, while the single precision version is much faster,
it may be insufficient for sensitive simulations.
3.3. S3D Performance Results
In an ideal setting, the chosen kernel would strongly dominate the runtime
of the application. However, in S3D, the getrates kernel comprises roughly
half of the total runtime, with some variation based on problem size. Table 1
shows how speedup in the getrates kernel scales to whole-code performance
improvements. Amdahl’s limit is the theoretical upper bound on speedup,
s∞ ≈ 1
1−fa
, where fa is the fraction of runtime that is accelerated.
Table 1: Performance Results - S - Single Precision D - Double Precision
Size Kernel Speedup % of Amdahl’s Actual Speedup
S D Total Limit S D
32 29.50x 14.98x 50.0% 2.00x 1.90x 1.84x
48 31.44x 16.97x 51.0% 2.04x 1.91x 1.87x
60 31.40x 16.08x 52.5% 2.11x 1.95x 1.90x
10

Figure 4: Simulation temperature. Note the time gap of the increase in temperature at
time roughly 3.00E-04. This corresponds to a delay in the prediction of ignition time.
In S3D, there is a complex relationship between performance and accu-
racy. When inaccuracy is detected, timestep size is reduced in an attempt
to decrease error, see Figure 6. Since single precision is less accurate, one
can see erratic timestep sizes. This means that given the same number of
timesteps, a highly accurate computation can simulate more time. In order
to truly measure performance, it is important to normalize the wallclock time
to account for this effect. In Table 1, normalized cost is the wallclock time it
takes to simulate one nanosecond at one point in space. While the getrates
kernel can be executed faster in single precision, the lack of accuracy causes
the simulation to take very small timesteps. In some cases (typically very
long simulations), the loss of accuracy in single precision calculations causes
the total amount of simulated time to decrease, potentially eliminating any
performance benefits.
As mentioned in Section 1.3, S3D is distributed using MPI. The domain
sizes listed in Table 1 are representative of the work done by a single node
in a production level simulation. As such, it is important to characterize
the scaling behavior of the accelerated version. Figures 7 and 8 present
parallel speedup and efficiency results from the Lens cluster. The Lens cluster
11

Figure 5: Chemical Species H2O2. The CPU and GPU DP versions completely agree,
while the GPU SP version significantly deviates, and fails to identify the dip at time
4.00E-O4
is made up of 32 nodes, with each node containing four quad-core AMD
Opteron processors, 64GB of memory, and two GPUs–one Tesla C1060 and
one GeForce 8800GTX. In our experiments, we do not utilize the GeForce
8800GTX because it lacks the ability to perform double precision operations.
The accelerated version of S3D exhibited classic weak scaling, with parallel
efficiency ranging between 84% and 98%.
4. Conclusions
Graphics processors are rapidly emerging as a viable platform for high
performance scientific computing. Improvements in the programming en-
vironments and libraries for these devices are making them an appealing,
cost-effective way to increase application performance. While the popularity
of these devices has surged, GPUs may not be appropriate for all applica-
tions. They offer the greatest benefit to applications with well structured,
data-parallel kernels. Our study has described the strengths of GPUs, and
provided insights from our experience in accelerating S3D. We have also
12

Table 2: Performance Results - Normalized cost is the average time it takes to simulate a
single point in space for one nanosecond. S - Single Precision D - Double Precision
Size Normalized Cost (microseconds)
CPU GPU DP GPU SP
32 12.3 6.67 6.47
48 12.9 7.30 6.98
60 12.0 6.31 6.12
examined one of the most important aspect of GPUs for the scientific com-
munity, accuracy. The differences in accuracy between GPU and IEEE arith-
metic resulted in drastic consequences for correctness in S3D. Despite this
relative weakness, the heterogeneous GPU version of the kernel still manages
to outperform the more traditional CPU version and produce high quality
results in a real scientific application.
[1] NVIDIA, CUDA Programming Guide 2.3 Downloaded June 1, 2009,
www.nvidia.com/object/cudadevelop.html.
[2] E. R. Hawkes, R. Sankaran, J. C. Sutherland, J. H. Chen, Direct numer-
ical simulation of turbulent combustion: fundamental insights towards
predictive models, Journal of Physics: Conference Series 16 (2005) 65–
79.
[3] J. C. Sutherland, Evaluation of mixing and reaction models for large-
eddy simulation of nonpremixed combustion using direct numerical sim-
ulation, Dept of Chemical and Fuels Engineering, PhD, University of
Utah.
[4] T. J. Poinsot, S. K. Lele, Boundary-conditions for direct simulations of
compressible viscous flows, Journal of Computational Physics 101 (1992)
104–129.
[5] J. C. Sutherland, C. A. Kennedy, Improved boundary conditions for
viscous, reacting, compressible flows, Journal of Computational Physics
191 (2003) 502–524.
[6] C. S. Yoo, Y. Wang, A. Trouve, H. G. Im, Characteristic boundary
conditions for direct simulations of turbulent counterflow flames, Com-
bustion Theory and Modelling 9 (2005) 617–646.
13

Figure 6: Timestep Size – This graph shows the size of the timesteps taken as the rapid
ignition simulation progressed. S3D reduces the timestep size when it detects integra-
tion inaccuracy. While the double precision versions take timesteps of roughly equivalent
size, the single precision version quickly reduces timestep size in an attempt to preserve
accuracy.
[7] W. Yu, J. Vetter, H. Oral, Performance characterization and optimiza-
tion of parallel I/O on the Cray XT, Parallel and Distributed Processing,
2008. IPDPS 2008. IEEE International Symposium on (2008) 1–11.
[8] J. Mellor-Crummey, Harnessing the power of emerging petascale plat-
forms, Journal of Physics: Conference Series 78 (1) (2007) 12–48.
[9] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips, GPU
computing, Proceedings of the IEEE 96 (5) (2008) 879–899.
[10] S. Barrachina, M. Castillo, F. Igual, R. Mayo, Evaluation and tuning of
the level 3 CUBLAS for graphics processors, Proceedings of the IEEE
Symposium on Parallel and Distributed Processing (IPDPS) (2008) 1–8.
[11] N. Fujimoto, Faster matrix-vector multiplication on GeForce 8800GTX,
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE Interna-
tional Symposium on (2008) 1–8.
14

Figure 7: GPU Parallel Speedup – This graph characterizes the parallel scaling of the
accelerated version of S3D. As the number of nodes increases, both the single and double
precision versions exhibit proportional increases in performance.
[12] G. Cummins, R. Adams, T. Newell, Scientific computation through a
GPU, Southeastcon, 2008. IEEE (2008) 244–246.
[13] J. Bolz, I. Farmer, E. Grinspun, P. Schröoder, Sparse matrix solvers on
the GPU: conjugate gradients and multigrid, in: SIGGRAPH ’03: ACM
SIGGRAPH 2003 Papers, 2003, pp. 917–924.
[14] B. He, N. K. Govindaraju, Q. Luo, B. Smith, Efficient gather and scatter
operations on graphics processors, in: SC ’07: Proceedings of the 2007
ACM/IEEE conference on Supercomputing, 2007, pp. 1–12.
[15] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,
K. Schulten, Accelerating molecular modeling applications with graphics
processors, Journal of Computational Chemistry 28 (2005) 2618–2640.
[16] C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W.-M. W. Hwu,
GPU acceleration of cutoff pair potentials for molecular modeling appli-
cations, in: CF ’08: Proceedings of the 2008 conference on Computing
frontiers, 2008, pp. 273–282.
15

Figure 8: GPU Parallel Efficiency – This graph shows the parallel efficiency (parallel
speedup divided by the number of processors) for the accelerated versions of S3D.
[17] J. Kruger, R. Westermann, Acceleration techniques for GPU-based vol-
ume rendering, Visualization, 2003. VIS 2003. IEEE (2003) 287–292.
[18] K. Mueller, F. Xu, Practical considerations for GPU-accelerated CT,
Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Sym-
posium on (2006) 1184–1187.
[19] S. Shende, A. D. Malony, J. Cuny, P. Beckman, S. Karmesin, K. Lindlan,
Portable profiling and tracing for parallel, scientific applications using
C++, in: SPDT ’98: Proceedings of the SIGMETRICS symposium on
Parallel and distributed tools, 1998, pp. 134–145.
16

Accelerating S3D A GPGPU Case Study

More Related Content

Similar to Accelerating S3D A GPGPU Case Study (20)

More from Martha Brown (20)

Recently uploaded (20)

Accelerating S3D A GPGPU Case Study