SlideShare a Scribd company logo
Accelerating S3D: A GPGPU Case Study
Kyle Spafforda,∗
, Jeremy Mereditha
, Jeffrey Vettera
, Jacqueline Chenb
, Ray
Groutb
, Ramanan Sankarana
a
Oak Ridge National Laboratory, 1 Bethel Valley Road MS 6173, Oak Ridge, TN 37831
b
Combustion Research Facility, Sandia National Laboratories, Livermore, CA 94551
Abstract
The graphics processor (GPU) has evolved into an appealing choice for high
performance computing due to its superior memory bandwidth, raw process-
ing power, and flexible programmability. As such, GPUs represent an excel-
lent platform for accelerating scientific applications. This paper explores a
methodology for identifying applications which present significant potential
for acceleration. In particular, this work focuses on experiences from accel-
erating S3D, a high-fidelity turbulent reacting flow solver. The acceleration
process is examined from a holistic viewpoint, and includes details that arise
from different phases of the conversion. This paper also addresses the issue
of floating point accuracy and precision on the GPU, a topic of immense
importance to scientific computing. Several performance experiments are
conducted, and results are presented from the NVIDIA Tesla C1060 GPU.
We generalize from our experiences to provide a roadmap for deploying ex-
isting scientific applications on heterogeneous GPU platforms.
Keywords: Graphics Processors, Heterogeneous Computing,
Computational Chemistry
1. Introduction
Strong market forces from the gaming industry and increased demand for
high definition, real-time 3D graphics have been the driving forces behind
the GPU’s incredible transformation. Over the past several years, increases
in the memory bandwidth and the speed of floating point computation of
∗
Corresponding Author
Email address: spaffordkl@ornl.gov (Kyle Spafford)
Preprint submitted to Parallel Computing December 8, 2009
GPUs have steadily outpaced those of CPUs. In a relatively short period
of time, the GPU has evolved from an arcane, highly-specialized hardware
component into a remarkably flexible and powerful parallel coprocessor.
1.1. GPU Hardware
Originally, GPUs were designed to perform a limited collection of opera-
tions on a large volume of independent geometric data. These operations fell
into to only two main categories (vertex and fragment) and were highly paral-
lel and computationally intense, resulting in a highly specialized design with
multiple cores and small caches. As graphical tasks became more diverse, the
demand for flexibility began to influence GPU designs. GPUs transitioned
from a fixed function design, to one which allowed limited programmability
of its two specialized pipelines, and eventually to an approach where all its
cores were of a unified, more flexible type, supporting much greater control
from the programmer.
The NVidia Tesla C1060 GPU platform was designed specifically for high
performance computing. It boasts an impressive thirty streaming multipro-
cessors, each composed of eight stream processors for a total of 240 processor
cores, running at 1.3Ghz. Each multiprocessor has 16KB of shared mem-
ory, which can be accessed as quickly as a register if managed properly. The
Tesla C1060 has 4GB of global memory, as well as supplementary cached con-
stant and texture memory. Perhaps the most exciting feature of the Tesla
is its support for native double precision floating point operations, which
are tremendously important for scientific computing. Single precision com-
putations were sufficient for the graphics computations which GPUs were
initially intended to solve, and double precision, a relatively new feature in
GPUs, is dramatically slower than single precision. In order to achieve high
performance on GPUS, one must use careful memory management and ex-
ploit hardware specific features. This otherwise daunting task is simplified
by CUDA.
1.2. CUDA
The striking performance numbers of modern GPUs have resulted in a
surge of interest in general-purpose computation on graphics processing units
(GPGPU). GPGPU represents an inexpensive and power-efficient alternative
to more traditional HPC platforms. In the past, there has been a substantial
learning curve associated with GPGPU, and expert knowledge was required
to attain impressive performance. This involved extensive modification of
2
traditional approaches in order to effectively scale to the large number of cores
per GPU. However, as the flexibility of the GPU has increased, there has been
a welcomed decrease in the associated learning curve of the porting process.
In this study, we utilize NVIDIA’s Compute Unified Device Architecture
(CUDA), a parallel programming model and software environment. CUDA
exposes the power of the GPU to the programmer through a set of high level
language extensions, allowing for existing scientific codes to be more easily
transformed into GPU compatible applications.
Figure 1: CUDA Programming Model – Image from NVIDIA CUDA Programming
Guide[1].
1.2.1. Programming Model
While a full introduction to CUDA is beyond the scope of this paper, this
section mentions the basic concepts required to understand the scope of the
parallelism involved. CUDA views the GPU as a highly parallel coprocessor.
Functions called kernels, are composed of a large number of threads, which
are organized into blocks. A group of blocks is known as a grid, see Figure
1. Blocks contain a fast shared memory that is only available to threads
3
which belong to the block, while grids have access to the global GPU mem-
ory. Typical kernel launches involve one grid, which is composed of hundreds
or thousands of individual threads, a much higher degree of parallelism than
normally occurs with traditional parallel approaches on the CPU. This high
degree of parallelism and unique memory architecture have drastic conse-
quences for performance, which will be explored in a later section.
1.3. Domain and Algorithm Description
S3D is a massively parallel direct numerical solver (DNS) for the full com-
pressible Navier-Stokes, total energy, species and mass continuity equations
coupled with detailed chemistry[2, 3]. It is based on a high-order accurate,
non-dissipative numerical scheme solved on a three-dimensional structured
Cartesian mesh. Spatial differentiation is achieved through eighth-order fi-
nite differences along with tenth-order filters to damp any spurious oscilla-
tions in the solution. The differentiation and filtering require nine and eleven
point centered stencils, respectively. Time advancement is achieved through
a six-stage, fourth-order explicit Runge-Kutta (R-K) method. Navier Stokes
characteristic boundary condition (NSCBC) treatment[4, 5, 6] is used on the
boundaries.
Fully coupled mass conservation equations for the different chemical species
are solved as part of the simulation to obtain the chemical state of the sys-
tem. Detailed chemical kinetics and molecular transport models are used. An
optimized and fine-tuned library has been developed to compute the chemi-
cal reaction and species diffusion rates based on Sandia’s Chemkin package.
While Chemkin-standard chemistry and transport models are readily usable
with S3D, special attention is paid to the efficiency and performance of the
chemical models. Reduced chemical and transport models that are fine -
tuned to the target problem are developed as a pre-processing step.
S3D is written entirely in Fortran. It is parallelized using a three dimen-
sional domain decomposition and MPI communication. Each MPI process
is responsible for a piece of the three-dimensional domain. All MPI pro-
cesses have the same number of grid points and the same computational
load. Inter-processor communication is only between nearest neighbors in
a three-dimensional topology. A ghost-zone is constructed at the processor
boundaries by non-blocking MPI sends and receives among the nearest neigh-
bors in the three-dimensional processor topology. Global communications are
only required for monitoring and synchronization ahead of I/O.
4
S3D’s performance has been studied and optimized including I/O[7] and
control flow[8]. Still, further improvements allow for increased grid size, more
simulation timesteps, and more species equations. These are critical to the
scientific goals of turbulent combustion simulations in that they help achieve
higher Reynolds numbers, better statistics through larger ensembles, more
complete temporal development of a turbulent flame, and the simulation of
fuels with greater chemical complexity.
Here we assess S3D code performance and parallel scaling through simu-
lation of a small amplitude pressure wave propagating through the domain
for a short period of time. The test is conducted with detailed ethylene-air
(C2H4) chemistry consisting of twenty-two chemical species and mixture-
averaged molecular transport model. Due to the detailed chemical model,
the code solves for twenty-two species equations in addition to the five fluid
dynamic variables.
1.4. Related Work
Recent work by a number of researchers has investigated GPGPU with
impressive results in a variety of domains. Owens et. al. provide an excellent
history of the GPU [9], chronicling its transformation in great detail. It is
not uncommon to find researchers who achieve at least an order of magnitude
improvement over reference implementations. GPUs have been used to accel-
erate a variety of application kernels, including more traditional operations
like dense[10, 11, 12] and sparse[13] linear algebra as well as scatter-gather
techniques[14]. The GPU has been successfully applied to a wide variety of
fields including computational biophysics[15], molecular dynamics[16], and
medical imaging[17, 18]. Our work takes a slightly higher level approach.
While we do present performance measurements from an accelerated version
of S3D, we examine the acceleration process as a whole, and endeavor to
answer why certain applications perform so well on GPUs, while others fail
to achieve significant performance improvements.
2. Identifying Candidates for Acceleration
2.1. Profiling
The first step in identifying a scientific application for acceleration is
to identify the performance bottlenecks. The best case scenario involves a
small number of computationally intense functions which comprise most of
the runtime. This is a fairly basic requirement and is a direct consequence
5
of Amdahl’s law. The CPU based profiling tool Tau identified S3D’s ge-
trates kernel as a major bottleneck[19]. This kernel involves calculating the
rates of chemical reactions occurring in the simulation at each point in space.
This computation represents about half of the total runtime with the cur-
rent chemistry model. As S3D’s chemical model becomes more complex, we
anticipate that the getrates kernel will more strongly dominate runtime. As
the kernel’s total percentage of runtime increases, the greater the potential
for application speedup. Therefore, the first kernel to be examined should
be the most time consuming.
2.2. Parallelism and Data Dependency
One of the main advantages of the GPU is the high number of proces-
sors, so it follows that kernels must exhibit a high degree of parallelism to be
successful on a heterogenous GPU platform. While this can correspond to
task-based parallelism, GPUs have primarily been used for data-parallel op-
erations. This makes it difficult for GPUs to handle unstructured kernels, or
those with intricate patterns of data dependency. Indeed, in situations with
irregular control flow, individual threads can become serialized, which results
in performance loss. Since the memory architecture of a GPU is dramatically
different than most CPUs, memory access times can differ by several orders
of magnitude based on access pattern and type of memory. For example, on
the Tesla, an access to shared block memory is an two orders of magnitude
faster than an access to global memory. Therefore, kernels must often be
chosen based on memory access pattern, or restructured such that memory
access is more uniform in nature. In S3D, the getrates kernel operates on
a regular three dimensional mesh, so access patterns are fairly uniform, an
easy case for the GPU.
The following psuedocode outlines the general structure of the sequential
getrates kernel. The outer three loops can be computed in parallel, since
points in the mesh are independent.
for x = 1 to length
for y = 1 to length
for z = 1 to length
for n = 1 to nspecies
grid[x][y][z][n] = F(grid[x][y][z][1:nspecies])
where length refers to the length of an edge of the cube, nspecies refers to
6
the number of chemical species involved, and function F is an abstraction of
the more complex chemical computations.
In addition to the innate parallelism of GPUs, the system’s intercon-
nection bus can also have serious performance consequences for accelerated
applications. Discrete GPUs are often connected via a PCI-e bus, which
introduces a substantial amount of latency into computations. This makes
GPUs more effective at problems in which bandwidth is much more impor-
tant than latency or those which have a high ratio of computation to data.
In these cases, the speedup in the calculations or the increased throughput
is sufficient to overcome performance costs associated with transferring data
across the bus. In ideal cases, a large amount of data can saturate the bus
and amortize the startup costs associated with bus. In effect, this hides
communication time with computation.
3. Results and Discussion
3.1. Kernel Acceleration
Once a suitable portion of the application has been identified, the ac-
celeration process can begin. Parallel programming is inherently more diffi-
cult than sequential programming, and developing high performance code for
GPUs also incorporates complexity from architectural features. This “mem-
ory aware” programming environment grants the programmer control over
low level memory movement, but demands meticulous data orchestration to
maximize performance.
For S3D, the mapping between the getrates kernel and CUDA concepts is
fairly simple. Since getrates operates on a regular, three-dimensional mesh,
each point in the mesh is handled by a single thread. A block is composed
of a local region of the mesh. Block size varies between 64 and 128, based
on the available number of registers per GPU core, in order to maximize
occupancy.
During the development of the accelerated version of the getrates kernel,
the memory access pattern was the most important factor for performance.
When threads read or write memory in a highly parallel fashion, CUDA
coalesces the memory access into a single operation, which has a dramatic
and beneficial effect on performance. The optimized versions of the getrates
kernel also use batched memory transfers and exploit block shared memory.
This attention to detail pays off–accelerated versions of the getrates kernel
exhibit promising speedups over the serial CPU version: up to 31.4x for the
7
single precision version, and 17.0x for the double precision version for a single
iteration of the kernel, see Figure 2. The serial CPU version was measured
on 2.3Ghz quad core AMD Opteron processor with 16GB of memory.
Figure 2: Accelerated Kernel Results
3.2. Accuracy
While the evolution of the GPU has been remarkable, architectural rem-
nants of its original, specialized function remain. Perhaps the most relevant
of these to the scientific community is the bias towards single precision float-
ing point computations. Single precision arithmetic was sufficient for the
GPU’s original tasks (rasterization, etc.). GPU benchmarking traditionally
involved only these single precision computations, and performance demands
have clearly shaped the GPU’s allocation of hardware resources. Many GPUs
are incapable of double precision, and those that are typically pay a high
performance cost. This cost generally arises from the differing number of
floating point units, and it is almost always more than the performance dif-
ference between single and double precision on a traditional CPU. In S3D,
the cost can clearly be seen in the performance difference in the single versus
double precision versions of the getrates kernel.
8
From a performance standpoint, single precision computations are favor-
able compared to double precision, but the computations in scientific ap-
plications can be extremely sensitive to accuracy. Moreover, some double
precision operations are not always equivalent on the CPU and GPU. GPUs
may sacrifice fully IEEE compliant floating point operations for greater per-
formance. For example, scientific applications frequently make extensive use
of transcendental functions (sin, cos, etc.), and the Tesla’s hardware intrinsics
for these functions are faster, but less accurate than their CPU counterparts.
3.2.1. Accuracy in S3D
In S3D, the reaction rates calculated by the getrates kernel are integrated
over time as the simulation progresses, and error from inaccurate reaction
rates compounds and propagates to other simulation variables. While this is
the first comparison of double and single precision versions of S3D, the issue
of accuracy has been previously studied, and some upper bounds for error are
known. S3D has an internal monitor for the estimated error from integration,
and can take smaller timesteps in an effort to improve accuracy. Figure 3
shows the estimated error from integration versus simulation time. In this
graph, the CPU and GPU DP versions quickly begin to agree, while the single
precision version is much more erratic. In both double precision versions, the
internal mechanism for timestep control succeeds in settling on a timestep of
appropriate size. The single precision version has a much weaker guarantee
on accuracy, and the monitor has a difficult time controlling the timestep,
oscillating between large timesteps with high error (sometimes beyond the
acceptable bounds), and short timesteps with very low error. The increased
number of timesteps required by the GPU single precision version will have
consequences for performance, which will be explored in a later section.
The error from low precision can also be observed in simulation variables
such as temperature (see Figure 4) or in chemical species, such as H2O2(see
Figure 5). The current test essentially simulates a rapid ignition, and a rela-
tively significant time gap can be seen between the rapid rise in temperature
in the GPU single precision kernel versus the other versions. In the sensitive
time scale of ignition, this gap represents a serious error. In Figure 5, the
error is much more pronounced, as the single precision version fails to predict
the sudden decrease in H2O2 which occurs roughly at time 4.00E-04.
A similar trend can be observed throughout many different simulation
variables in S3D. The CPU version tends to agree almost perfectly with
the GPU double precision version, while the single precision version deviates
9
Figure 3: Estimated Integrated Error. 1.00E-03 is the upper bound on acceptable error.
The GPU DP and CPU versions completely overlap beginning roughly at time 4.00E-04.
substantially. Consequently, while the single precision version is much faster,
it may be insufficient for sensitive simulations.
3.3. S3D Performance Results
In an ideal setting, the chosen kernel would strongly dominate the runtime
of the application. However, in S3D, the getrates kernel comprises roughly
half of the total runtime, with some variation based on problem size. Table 1
shows how speedup in the getrates kernel scales to whole-code performance
improvements. Amdahl’s limit is the theoretical upper bound on speedup,
s∞ ≈ 1
1−fa
, where fa is the fraction of runtime that is accelerated.
Table 1: Performance Results - S - Single Precision D - Double Precision
Size Kernel Speedup % of Amdahl’s Actual Speedup
S D Total Limit S D
32 29.50x 14.98x 50.0% 2.00x 1.90x 1.84x
48 31.44x 16.97x 51.0% 2.04x 1.91x 1.87x
60 31.40x 16.08x 52.5% 2.11x 1.95x 1.90x
10
Figure 4: Simulation temperature. Note the time gap of the increase in temperature at
time roughly 3.00E-04. This corresponds to a delay in the prediction of ignition time.
In S3D, there is a complex relationship between performance and accu-
racy. When inaccuracy is detected, timestep size is reduced in an attempt
to decrease error, see Figure 6. Since single precision is less accurate, one
can see erratic timestep sizes. This means that given the same number of
timesteps, a highly accurate computation can simulate more time. In order
to truly measure performance, it is important to normalize the wallclock time
to account for this effect. In Table 1, normalized cost is the wallclock time it
takes to simulate one nanosecond at one point in space. While the getrates
kernel can be executed faster in single precision, the lack of accuracy causes
the simulation to take very small timesteps. In some cases (typically very
long simulations), the loss of accuracy in single precision calculations causes
the total amount of simulated time to decrease, potentially eliminating any
performance benefits.
As mentioned in Section 1.3, S3D is distributed using MPI. The domain
sizes listed in Table 1 are representative of the work done by a single node
in a production level simulation. As such, it is important to characterize
the scaling behavior of the accelerated version. Figures 7 and 8 present
parallel speedup and efficiency results from the Lens cluster. The Lens cluster
11
Figure 5: Chemical Species H2O2. The CPU and GPU DP versions completely agree,
while the GPU SP version significantly deviates, and fails to identify the dip at time
4.00E-O4
is made up of 32 nodes, with each node containing four quad-core AMD
Opteron processors, 64GB of memory, and two GPUs–one Tesla C1060 and
one GeForce 8800GTX. In our experiments, we do not utilize the GeForce
8800GTX because it lacks the ability to perform double precision operations.
The accelerated version of S3D exhibited classic weak scaling, with parallel
efficiency ranging between 84% and 98%.
4. Conclusions
Graphics processors are rapidly emerging as a viable platform for high
performance scientific computing. Improvements in the programming en-
vironments and libraries for these devices are making them an appealing,
cost-effective way to increase application performance. While the popularity
of these devices has surged, GPUs may not be appropriate for all applica-
tions. They offer the greatest benefit to applications with well structured,
data-parallel kernels. Our study has described the strengths of GPUs, and
provided insights from our experience in accelerating S3D. We have also
12
Table 2: Performance Results - Normalized cost is the average time it takes to simulate a
single point in space for one nanosecond. S - Single Precision D - Double Precision
Size Normalized Cost (microseconds)
CPU GPU DP GPU SP
32 12.3 6.67 6.47
48 12.9 7.30 6.98
60 12.0 6.31 6.12
examined one of the most important aspect of GPUs for the scientific com-
munity, accuracy. The differences in accuracy between GPU and IEEE arith-
metic resulted in drastic consequences for correctness in S3D. Despite this
relative weakness, the heterogeneous GPU version of the kernel still manages
to outperform the more traditional CPU version and produce high quality
results in a real scientific application.
[1] NVIDIA, CUDA Programming Guide 2.3 Downloaded June 1, 2009,
www.nvidia.com/object/cudadevelop.html.
[2] E. R. Hawkes, R. Sankaran, J. C. Sutherland, J. H. Chen, Direct numer-
ical simulation of turbulent combustion: fundamental insights towards
predictive models, Journal of Physics: Conference Series 16 (2005) 65–
79.
[3] J. C. Sutherland, Evaluation of mixing and reaction models for large-
eddy simulation of nonpremixed combustion using direct numerical sim-
ulation, Dept of Chemical and Fuels Engineering, PhD, University of
Utah.
[4] T. J. Poinsot, S. K. Lele, Boundary-conditions for direct simulations of
compressible viscous flows, Journal of Computational Physics 101 (1992)
104–129.
[5] J. C. Sutherland, C. A. Kennedy, Improved boundary conditions for
viscous, reacting, compressible flows, Journal of Computational Physics
191 (2003) 502–524.
[6] C. S. Yoo, Y. Wang, A. Trouve, H. G. Im, Characteristic boundary
conditions for direct simulations of turbulent counterflow flames, Com-
bustion Theory and Modelling 9 (2005) 617–646.
13
Figure 6: Timestep Size – This graph shows the size of the timesteps taken as the rapid
ignition simulation progressed. S3D reduces the timestep size when it detects integra-
tion inaccuracy. While the double precision versions take timesteps of roughly equivalent
size, the single precision version quickly reduces timestep size in an attempt to preserve
accuracy.
[7] W. Yu, J. Vetter, H. Oral, Performance characterization and optimiza-
tion of parallel I/O on the Cray XT, Parallel and Distributed Processing,
2008. IPDPS 2008. IEEE International Symposium on (2008) 1–11.
[8] J. Mellor-Crummey, Harnessing the power of emerging petascale plat-
forms, Journal of Physics: Conference Series 78 (1) (2007) 12–48.
[9] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips, GPU
computing, Proceedings of the IEEE 96 (5) (2008) 879–899.
[10] S. Barrachina, M. Castillo, F. Igual, R. Mayo, Evaluation and tuning of
the level 3 CUBLAS for graphics processors, Proceedings of the IEEE
Symposium on Parallel and Distributed Processing (IPDPS) (2008) 1–8.
[11] N. Fujimoto, Faster matrix-vector multiplication on GeForce 8800GTX,
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE Interna-
tional Symposium on (2008) 1–8.
14
Figure 7: GPU Parallel Speedup – This graph characterizes the parallel scaling of the
accelerated version of S3D. As the number of nodes increases, both the single and double
precision versions exhibit proportional increases in performance.
[12] G. Cummins, R. Adams, T. Newell, Scientific computation through a
GPU, Southeastcon, 2008. IEEE (2008) 244–246.
[13] J. Bolz, I. Farmer, E. Grinspun, P. Schröoder, Sparse matrix solvers on
the GPU: conjugate gradients and multigrid, in: SIGGRAPH ’03: ACM
SIGGRAPH 2003 Papers, 2003, pp. 917–924.
[14] B. He, N. K. Govindaraju, Q. Luo, B. Smith, Efficient gather and scatter
operations on graphics processors, in: SC ’07: Proceedings of the 2007
ACM/IEEE conference on Supercomputing, 2007, pp. 1–12.
[15] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco,
K. Schulten, Accelerating molecular modeling applications with graphics
processors, Journal of Computational Chemistry 28 (2005) 2618–2640.
[16] C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W.-M. W. Hwu,
GPU acceleration of cutoff pair potentials for molecular modeling appli-
cations, in: CF ’08: Proceedings of the 2008 conference on Computing
frontiers, 2008, pp. 273–282.
15
Figure 8: GPU Parallel Efficiency – This graph shows the parallel efficiency (parallel
speedup divided by the number of processors) for the accelerated versions of S3D.
[17] J. Kruger, R. Westermann, Acceleration techniques for GPU-based vol-
ume rendering, Visualization, 2003. VIS 2003. IEEE (2003) 287–292.
[18] K. Mueller, F. Xu, Practical considerations for GPU-accelerated CT,
Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Sym-
posium on (2006) 1184–1187.
[19] S. Shende, A. D. Malony, J. Cuny, P. Beckman, S. Karmesin, K. Lindlan,
Portable profiling and tracing for parallel, scientific applications using
C++, in: SPDT ’98: Proceedings of the SIGMETRICS symposium on
Parallel and distributed tools, 1998, pp. 134–145.
16

More Related Content

PPTX
Graphics processing unit ppt
PDF
Volume 2-issue-6-2040-2045
PDF
Volume 2-issue-6-2040-2045
PDF
PDF
Enhance similarity searching algorithm with optimized fast population count m...
PDF
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
PDF
openCL Paper
PDF
GPU Computing: An Introduction
Graphics processing unit ppt
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
Enhance similarity searching algorithm with optimized fast population count m...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
openCL Paper
GPU Computing: An Introduction

Similar to Accelerating S3D A GPGPU Case Study (20)

PPT
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
PPT
Graphics Processing Unit (GPU) system.ppt
PDF
CMES201308262603_16563
PDF
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
PPTX
GPU Computing: A brief overview
PDF
Newbie’s guide to_the_gpgpu_universe
PPTX
OpenACC Monthly Highlights: October2020
PDF
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
PDF
GPGPU_report_v3
PDF
V3I8-0460
PDF
The Rise of Parallel Computing
PPT
Cuda intro
PDF
Gpu perf-presentation
PDF
GPGPU Computation
PDF
IRJET-A Study on Parallization of Genetic Algorithms on GPUS using CUDA
PDF
19564926 graphics-processing-unit
PDF
Nvidia Cuda Apps Jun27 11
PDF
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
Graphics Processing Unit (GPU) system.ppt
CMES201308262603_16563
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
GPU Computing: A brief overview
Newbie’s guide to_the_gpgpu_universe
OpenACC Monthly Highlights: October2020
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
GPGPU_report_v3
V3I8-0460
The Rise of Parallel Computing
Cuda intro
Gpu perf-presentation
GPGPU Computation
IRJET-A Study on Parallization of Genetic Algorithms on GPUS using CUDA
19564926 graphics-processing-unit
Nvidia Cuda Apps Jun27 11
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing

More from Martha Brown (20)

PDF
Business Proposal Letter THE RESEARCH PROPO
PDF
What Are The Best Research Methods For Writers
PDF
(PDF) Editorial - Writing For Publication
PDF
Canada Role In World Essay United Nations Internati
PDF
5 Best Images Of 12-Sided Snowflake Printable Templ
PDF
Monster Page Borders (Teacher Made). Online assignment writing service.
PDF
How To Resource In An Essay Salt Lake Juvenile Defense
PDF
How To Write A Play Script (With Pictures) - WikiHow
PDF
How To Write A Great Narrative Essay. How Do Y
PDF
Apa Itu Template What Is Template Images
PDF
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
PDF
Phenomenal How To Write A Satirical Essay Thatsnotus
PDF
The Best Providers To Get Custom Term Paper Writing Service
PDF
How To Choose A Perfect Topic For Essay. Online assignment writing service.
PDF
Pin On Dissertation Help Online. Online assignment writing service.
PDF
Cantest Sample Essay. Online assignment writing service.
PDF
Article Critique Example In His 1999 Article The - Ma
PDF
College Essay Examples Of College Essays
PDF
Writing A TOK Essay. Online assignment writing service.
PDF
How To Write A Good Classific. Online assignment writing service.
Business Proposal Letter THE RESEARCH PROPO
What Are The Best Research Methods For Writers
(PDF) Editorial - Writing For Publication
Canada Role In World Essay United Nations Internati
5 Best Images Of 12-Sided Snowflake Printable Templ
Monster Page Borders (Teacher Made). Online assignment writing service.
How To Resource In An Essay Salt Lake Juvenile Defense
How To Write A Play Script (With Pictures) - WikiHow
How To Write A Great Narrative Essay. How Do Y
Apa Itu Template What Is Template Images
Fake Essay Writer Tumblr - Formatessay.Web.Fc2.Com
Phenomenal How To Write A Satirical Essay Thatsnotus
The Best Providers To Get Custom Term Paper Writing Service
How To Choose A Perfect Topic For Essay. Online assignment writing service.
Pin On Dissertation Help Online. Online assignment writing service.
Cantest Sample Essay. Online assignment writing service.
Article Critique Example In His 1999 Article The - Ma
College Essay Examples Of College Essays
Writing A TOK Essay. Online assignment writing service.
How To Write A Good Classific. Online assignment writing service.

Recently uploaded (20)

PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Lesson notes of climatology university.
PDF
Basic Mud Logging Guide for educational purpose
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
RMMM.pdf make it easy to upload and study
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
master seminar digital applications in india
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Pre independence Education in Inndia.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Cell Structure & Organelles in detailed.
TR - Agricultural Crops Production NC III.pdf
Computing-Curriculum for Schools in Ghana
Pharma ospi slides which help in ospi learning
Lesson notes of climatology university.
Basic Mud Logging Guide for educational purpose
Microbial disease of the cardiovascular and lymphatic systems
Renaissance Architecture: A Journey from Faith to Humanism
RMMM.pdf make it easy to upload and study
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O7-L3 Supply Chain Operations - ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
human mycosis Human fungal infections are called human mycosis..pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
master seminar digital applications in india
Anesthesia in Laparoscopic Surgery in India
Pre independence Education in Inndia.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Cell Structure & Organelles in detailed.

Accelerating S3D A GPGPU Case Study

  • 1. Accelerating S3D: A GPGPU Case Study Kyle Spafforda,∗ , Jeremy Mereditha , Jeffrey Vettera , Jacqueline Chenb , Ray Groutb , Ramanan Sankarana a Oak Ridge National Laboratory, 1 Bethel Valley Road MS 6173, Oak Ridge, TN 37831 b Combustion Research Facility, Sandia National Laboratories, Livermore, CA 94551 Abstract The graphics processor (GPU) has evolved into an appealing choice for high performance computing due to its superior memory bandwidth, raw process- ing power, and flexible programmability. As such, GPUs represent an excel- lent platform for accelerating scientific applications. This paper explores a methodology for identifying applications which present significant potential for acceleration. In particular, this work focuses on experiences from accel- erating S3D, a high-fidelity turbulent reacting flow solver. The acceleration process is examined from a holistic viewpoint, and includes details that arise from different phases of the conversion. This paper also addresses the issue of floating point accuracy and precision on the GPU, a topic of immense importance to scientific computing. Several performance experiments are conducted, and results are presented from the NVIDIA Tesla C1060 GPU. We generalize from our experiences to provide a roadmap for deploying ex- isting scientific applications on heterogeneous GPU platforms. Keywords: Graphics Processors, Heterogeneous Computing, Computational Chemistry 1. Introduction Strong market forces from the gaming industry and increased demand for high definition, real-time 3D graphics have been the driving forces behind the GPU’s incredible transformation. Over the past several years, increases in the memory bandwidth and the speed of floating point computation of ∗ Corresponding Author Email address: spaffordkl@ornl.gov (Kyle Spafford) Preprint submitted to Parallel Computing December 8, 2009
  • 2. GPUs have steadily outpaced those of CPUs. In a relatively short period of time, the GPU has evolved from an arcane, highly-specialized hardware component into a remarkably flexible and powerful parallel coprocessor. 1.1. GPU Hardware Originally, GPUs were designed to perform a limited collection of opera- tions on a large volume of independent geometric data. These operations fell into to only two main categories (vertex and fragment) and were highly paral- lel and computationally intense, resulting in a highly specialized design with multiple cores and small caches. As graphical tasks became more diverse, the demand for flexibility began to influence GPU designs. GPUs transitioned from a fixed function design, to one which allowed limited programmability of its two specialized pipelines, and eventually to an approach where all its cores were of a unified, more flexible type, supporting much greater control from the programmer. The NVidia Tesla C1060 GPU platform was designed specifically for high performance computing. It boasts an impressive thirty streaming multipro- cessors, each composed of eight stream processors for a total of 240 processor cores, running at 1.3Ghz. Each multiprocessor has 16KB of shared mem- ory, which can be accessed as quickly as a register if managed properly. The Tesla C1060 has 4GB of global memory, as well as supplementary cached con- stant and texture memory. Perhaps the most exciting feature of the Tesla is its support for native double precision floating point operations, which are tremendously important for scientific computing. Single precision com- putations were sufficient for the graphics computations which GPUs were initially intended to solve, and double precision, a relatively new feature in GPUs, is dramatically slower than single precision. In order to achieve high performance on GPUS, one must use careful memory management and ex- ploit hardware specific features. This otherwise daunting task is simplified by CUDA. 1.2. CUDA The striking performance numbers of modern GPUs have resulted in a surge of interest in general-purpose computation on graphics processing units (GPGPU). GPGPU represents an inexpensive and power-efficient alternative to more traditional HPC platforms. In the past, there has been a substantial learning curve associated with GPGPU, and expert knowledge was required to attain impressive performance. This involved extensive modification of 2
  • 3. traditional approaches in order to effectively scale to the large number of cores per GPU. However, as the flexibility of the GPU has increased, there has been a welcomed decrease in the associated learning curve of the porting process. In this study, we utilize NVIDIA’s Compute Unified Device Architecture (CUDA), a parallel programming model and software environment. CUDA exposes the power of the GPU to the programmer through a set of high level language extensions, allowing for existing scientific codes to be more easily transformed into GPU compatible applications. Figure 1: CUDA Programming Model – Image from NVIDIA CUDA Programming Guide[1]. 1.2.1. Programming Model While a full introduction to CUDA is beyond the scope of this paper, this section mentions the basic concepts required to understand the scope of the parallelism involved. CUDA views the GPU as a highly parallel coprocessor. Functions called kernels, are composed of a large number of threads, which are organized into blocks. A group of blocks is known as a grid, see Figure 1. Blocks contain a fast shared memory that is only available to threads 3
  • 4. which belong to the block, while grids have access to the global GPU mem- ory. Typical kernel launches involve one grid, which is composed of hundreds or thousands of individual threads, a much higher degree of parallelism than normally occurs with traditional parallel approaches on the CPU. This high degree of parallelism and unique memory architecture have drastic conse- quences for performance, which will be explored in a later section. 1.3. Domain and Algorithm Description S3D is a massively parallel direct numerical solver (DNS) for the full com- pressible Navier-Stokes, total energy, species and mass continuity equations coupled with detailed chemistry[2, 3]. It is based on a high-order accurate, non-dissipative numerical scheme solved on a three-dimensional structured Cartesian mesh. Spatial differentiation is achieved through eighth-order fi- nite differences along with tenth-order filters to damp any spurious oscilla- tions in the solution. The differentiation and filtering require nine and eleven point centered stencils, respectively. Time advancement is achieved through a six-stage, fourth-order explicit Runge-Kutta (R-K) method. Navier Stokes characteristic boundary condition (NSCBC) treatment[4, 5, 6] is used on the boundaries. Fully coupled mass conservation equations for the different chemical species are solved as part of the simulation to obtain the chemical state of the sys- tem. Detailed chemical kinetics and molecular transport models are used. An optimized and fine-tuned library has been developed to compute the chemi- cal reaction and species diffusion rates based on Sandia’s Chemkin package. While Chemkin-standard chemistry and transport models are readily usable with S3D, special attention is paid to the efficiency and performance of the chemical models. Reduced chemical and transport models that are fine - tuned to the target problem are developed as a pre-processing step. S3D is written entirely in Fortran. It is parallelized using a three dimen- sional domain decomposition and MPI communication. Each MPI process is responsible for a piece of the three-dimensional domain. All MPI pro- cesses have the same number of grid points and the same computational load. Inter-processor communication is only between nearest neighbors in a three-dimensional topology. A ghost-zone is constructed at the processor boundaries by non-blocking MPI sends and receives among the nearest neigh- bors in the three-dimensional processor topology. Global communications are only required for monitoring and synchronization ahead of I/O. 4
  • 5. S3D’s performance has been studied and optimized including I/O[7] and control flow[8]. Still, further improvements allow for increased grid size, more simulation timesteps, and more species equations. These are critical to the scientific goals of turbulent combustion simulations in that they help achieve higher Reynolds numbers, better statistics through larger ensembles, more complete temporal development of a turbulent flame, and the simulation of fuels with greater chemical complexity. Here we assess S3D code performance and parallel scaling through simu- lation of a small amplitude pressure wave propagating through the domain for a short period of time. The test is conducted with detailed ethylene-air (C2H4) chemistry consisting of twenty-two chemical species and mixture- averaged molecular transport model. Due to the detailed chemical model, the code solves for twenty-two species equations in addition to the five fluid dynamic variables. 1.4. Related Work Recent work by a number of researchers has investigated GPGPU with impressive results in a variety of domains. Owens et. al. provide an excellent history of the GPU [9], chronicling its transformation in great detail. It is not uncommon to find researchers who achieve at least an order of magnitude improvement over reference implementations. GPUs have been used to accel- erate a variety of application kernels, including more traditional operations like dense[10, 11, 12] and sparse[13] linear algebra as well as scatter-gather techniques[14]. The GPU has been successfully applied to a wide variety of fields including computational biophysics[15], molecular dynamics[16], and medical imaging[17, 18]. Our work takes a slightly higher level approach. While we do present performance measurements from an accelerated version of S3D, we examine the acceleration process as a whole, and endeavor to answer why certain applications perform so well on GPUs, while others fail to achieve significant performance improvements. 2. Identifying Candidates for Acceleration 2.1. Profiling The first step in identifying a scientific application for acceleration is to identify the performance bottlenecks. The best case scenario involves a small number of computationally intense functions which comprise most of the runtime. This is a fairly basic requirement and is a direct consequence 5
  • 6. of Amdahl’s law. The CPU based profiling tool Tau identified S3D’s ge- trates kernel as a major bottleneck[19]. This kernel involves calculating the rates of chemical reactions occurring in the simulation at each point in space. This computation represents about half of the total runtime with the cur- rent chemistry model. As S3D’s chemical model becomes more complex, we anticipate that the getrates kernel will more strongly dominate runtime. As the kernel’s total percentage of runtime increases, the greater the potential for application speedup. Therefore, the first kernel to be examined should be the most time consuming. 2.2. Parallelism and Data Dependency One of the main advantages of the GPU is the high number of proces- sors, so it follows that kernels must exhibit a high degree of parallelism to be successful on a heterogenous GPU platform. While this can correspond to task-based parallelism, GPUs have primarily been used for data-parallel op- erations. This makes it difficult for GPUs to handle unstructured kernels, or those with intricate patterns of data dependency. Indeed, in situations with irregular control flow, individual threads can become serialized, which results in performance loss. Since the memory architecture of a GPU is dramatically different than most CPUs, memory access times can differ by several orders of magnitude based on access pattern and type of memory. For example, on the Tesla, an access to shared block memory is an two orders of magnitude faster than an access to global memory. Therefore, kernels must often be chosen based on memory access pattern, or restructured such that memory access is more uniform in nature. In S3D, the getrates kernel operates on a regular three dimensional mesh, so access patterns are fairly uniform, an easy case for the GPU. The following psuedocode outlines the general structure of the sequential getrates kernel. The outer three loops can be computed in parallel, since points in the mesh are independent. for x = 1 to length for y = 1 to length for z = 1 to length for n = 1 to nspecies grid[x][y][z][n] = F(grid[x][y][z][1:nspecies]) where length refers to the length of an edge of the cube, nspecies refers to 6
  • 7. the number of chemical species involved, and function F is an abstraction of the more complex chemical computations. In addition to the innate parallelism of GPUs, the system’s intercon- nection bus can also have serious performance consequences for accelerated applications. Discrete GPUs are often connected via a PCI-e bus, which introduces a substantial amount of latency into computations. This makes GPUs more effective at problems in which bandwidth is much more impor- tant than latency or those which have a high ratio of computation to data. In these cases, the speedup in the calculations or the increased throughput is sufficient to overcome performance costs associated with transferring data across the bus. In ideal cases, a large amount of data can saturate the bus and amortize the startup costs associated with bus. In effect, this hides communication time with computation. 3. Results and Discussion 3.1. Kernel Acceleration Once a suitable portion of the application has been identified, the ac- celeration process can begin. Parallel programming is inherently more diffi- cult than sequential programming, and developing high performance code for GPUs also incorporates complexity from architectural features. This “mem- ory aware” programming environment grants the programmer control over low level memory movement, but demands meticulous data orchestration to maximize performance. For S3D, the mapping between the getrates kernel and CUDA concepts is fairly simple. Since getrates operates on a regular, three-dimensional mesh, each point in the mesh is handled by a single thread. A block is composed of a local region of the mesh. Block size varies between 64 and 128, based on the available number of registers per GPU core, in order to maximize occupancy. During the development of the accelerated version of the getrates kernel, the memory access pattern was the most important factor for performance. When threads read or write memory in a highly parallel fashion, CUDA coalesces the memory access into a single operation, which has a dramatic and beneficial effect on performance. The optimized versions of the getrates kernel also use batched memory transfers and exploit block shared memory. This attention to detail pays off–accelerated versions of the getrates kernel exhibit promising speedups over the serial CPU version: up to 31.4x for the 7
  • 8. single precision version, and 17.0x for the double precision version for a single iteration of the kernel, see Figure 2. The serial CPU version was measured on 2.3Ghz quad core AMD Opteron processor with 16GB of memory. Figure 2: Accelerated Kernel Results 3.2. Accuracy While the evolution of the GPU has been remarkable, architectural rem- nants of its original, specialized function remain. Perhaps the most relevant of these to the scientific community is the bias towards single precision float- ing point computations. Single precision arithmetic was sufficient for the GPU’s original tasks (rasterization, etc.). GPU benchmarking traditionally involved only these single precision computations, and performance demands have clearly shaped the GPU’s allocation of hardware resources. Many GPUs are incapable of double precision, and those that are typically pay a high performance cost. This cost generally arises from the differing number of floating point units, and it is almost always more than the performance dif- ference between single and double precision on a traditional CPU. In S3D, the cost can clearly be seen in the performance difference in the single versus double precision versions of the getrates kernel. 8
  • 9. From a performance standpoint, single precision computations are favor- able compared to double precision, but the computations in scientific ap- plications can be extremely sensitive to accuracy. Moreover, some double precision operations are not always equivalent on the CPU and GPU. GPUs may sacrifice fully IEEE compliant floating point operations for greater per- formance. For example, scientific applications frequently make extensive use of transcendental functions (sin, cos, etc.), and the Tesla’s hardware intrinsics for these functions are faster, but less accurate than their CPU counterparts. 3.2.1. Accuracy in S3D In S3D, the reaction rates calculated by the getrates kernel are integrated over time as the simulation progresses, and error from inaccurate reaction rates compounds and propagates to other simulation variables. While this is the first comparison of double and single precision versions of S3D, the issue of accuracy has been previously studied, and some upper bounds for error are known. S3D has an internal monitor for the estimated error from integration, and can take smaller timesteps in an effort to improve accuracy. Figure 3 shows the estimated error from integration versus simulation time. In this graph, the CPU and GPU DP versions quickly begin to agree, while the single precision version is much more erratic. In both double precision versions, the internal mechanism for timestep control succeeds in settling on a timestep of appropriate size. The single precision version has a much weaker guarantee on accuracy, and the monitor has a difficult time controlling the timestep, oscillating between large timesteps with high error (sometimes beyond the acceptable bounds), and short timesteps with very low error. The increased number of timesteps required by the GPU single precision version will have consequences for performance, which will be explored in a later section. The error from low precision can also be observed in simulation variables such as temperature (see Figure 4) or in chemical species, such as H2O2(see Figure 5). The current test essentially simulates a rapid ignition, and a rela- tively significant time gap can be seen between the rapid rise in temperature in the GPU single precision kernel versus the other versions. In the sensitive time scale of ignition, this gap represents a serious error. In Figure 5, the error is much more pronounced, as the single precision version fails to predict the sudden decrease in H2O2 which occurs roughly at time 4.00E-04. A similar trend can be observed throughout many different simulation variables in S3D. The CPU version tends to agree almost perfectly with the GPU double precision version, while the single precision version deviates 9
  • 10. Figure 3: Estimated Integrated Error. 1.00E-03 is the upper bound on acceptable error. The GPU DP and CPU versions completely overlap beginning roughly at time 4.00E-04. substantially. Consequently, while the single precision version is much faster, it may be insufficient for sensitive simulations. 3.3. S3D Performance Results In an ideal setting, the chosen kernel would strongly dominate the runtime of the application. However, in S3D, the getrates kernel comprises roughly half of the total runtime, with some variation based on problem size. Table 1 shows how speedup in the getrates kernel scales to whole-code performance improvements. Amdahl’s limit is the theoretical upper bound on speedup, s∞ ≈ 1 1−fa , where fa is the fraction of runtime that is accelerated. Table 1: Performance Results - S - Single Precision D - Double Precision Size Kernel Speedup % of Amdahl’s Actual Speedup S D Total Limit S D 32 29.50x 14.98x 50.0% 2.00x 1.90x 1.84x 48 31.44x 16.97x 51.0% 2.04x 1.91x 1.87x 60 31.40x 16.08x 52.5% 2.11x 1.95x 1.90x 10
  • 11. Figure 4: Simulation temperature. Note the time gap of the increase in temperature at time roughly 3.00E-04. This corresponds to a delay in the prediction of ignition time. In S3D, there is a complex relationship between performance and accu- racy. When inaccuracy is detected, timestep size is reduced in an attempt to decrease error, see Figure 6. Since single precision is less accurate, one can see erratic timestep sizes. This means that given the same number of timesteps, a highly accurate computation can simulate more time. In order to truly measure performance, it is important to normalize the wallclock time to account for this effect. In Table 1, normalized cost is the wallclock time it takes to simulate one nanosecond at one point in space. While the getrates kernel can be executed faster in single precision, the lack of accuracy causes the simulation to take very small timesteps. In some cases (typically very long simulations), the loss of accuracy in single precision calculations causes the total amount of simulated time to decrease, potentially eliminating any performance benefits. As mentioned in Section 1.3, S3D is distributed using MPI. The domain sizes listed in Table 1 are representative of the work done by a single node in a production level simulation. As such, it is important to characterize the scaling behavior of the accelerated version. Figures 7 and 8 present parallel speedup and efficiency results from the Lens cluster. The Lens cluster 11
  • 12. Figure 5: Chemical Species H2O2. The CPU and GPU DP versions completely agree, while the GPU SP version significantly deviates, and fails to identify the dip at time 4.00E-O4 is made up of 32 nodes, with each node containing four quad-core AMD Opteron processors, 64GB of memory, and two GPUs–one Tesla C1060 and one GeForce 8800GTX. In our experiments, we do not utilize the GeForce 8800GTX because it lacks the ability to perform double precision operations. The accelerated version of S3D exhibited classic weak scaling, with parallel efficiency ranging between 84% and 98%. 4. Conclusions Graphics processors are rapidly emerging as a viable platform for high performance scientific computing. Improvements in the programming en- vironments and libraries for these devices are making them an appealing, cost-effective way to increase application performance. While the popularity of these devices has surged, GPUs may not be appropriate for all applica- tions. They offer the greatest benefit to applications with well structured, data-parallel kernels. Our study has described the strengths of GPUs, and provided insights from our experience in accelerating S3D. We have also 12
  • 13. Table 2: Performance Results - Normalized cost is the average time it takes to simulate a single point in space for one nanosecond. S - Single Precision D - Double Precision Size Normalized Cost (microseconds) CPU GPU DP GPU SP 32 12.3 6.67 6.47 48 12.9 7.30 6.98 60 12.0 6.31 6.12 examined one of the most important aspect of GPUs for the scientific com- munity, accuracy. The differences in accuracy between GPU and IEEE arith- metic resulted in drastic consequences for correctness in S3D. Despite this relative weakness, the heterogeneous GPU version of the kernel still manages to outperform the more traditional CPU version and produce high quality results in a real scientific application. [1] NVIDIA, CUDA Programming Guide 2.3 Downloaded June 1, 2009, www.nvidia.com/object/cudadevelop.html. [2] E. R. Hawkes, R. Sankaran, J. C. Sutherland, J. H. Chen, Direct numer- ical simulation of turbulent combustion: fundamental insights towards predictive models, Journal of Physics: Conference Series 16 (2005) 65– 79. [3] J. C. Sutherland, Evaluation of mixing and reaction models for large- eddy simulation of nonpremixed combustion using direct numerical sim- ulation, Dept of Chemical and Fuels Engineering, PhD, University of Utah. [4] T. J. Poinsot, S. K. Lele, Boundary-conditions for direct simulations of compressible viscous flows, Journal of Computational Physics 101 (1992) 104–129. [5] J. C. Sutherland, C. A. Kennedy, Improved boundary conditions for viscous, reacting, compressible flows, Journal of Computational Physics 191 (2003) 502–524. [6] C. S. Yoo, Y. Wang, A. Trouve, H. G. Im, Characteristic boundary conditions for direct simulations of turbulent counterflow flames, Com- bustion Theory and Modelling 9 (2005) 617–646. 13
  • 14. Figure 6: Timestep Size – This graph shows the size of the timesteps taken as the rapid ignition simulation progressed. S3D reduces the timestep size when it detects integra- tion inaccuracy. While the double precision versions take timesteps of roughly equivalent size, the single precision version quickly reduces timestep size in an attempt to preserve accuracy. [7] W. Yu, J. Vetter, H. Oral, Performance characterization and optimiza- tion of parallel I/O on the Cray XT, Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on (2008) 1–11. [8] J. Mellor-Crummey, Harnessing the power of emerging petascale plat- forms, Journal of Physics: Conference Series 78 (1) (2007) 12–48. [9] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips, GPU computing, Proceedings of the IEEE 96 (5) (2008) 879–899. [10] S. Barrachina, M. Castillo, F. Igual, R. Mayo, Evaluation and tuning of the level 3 CUBLAS for graphics processors, Proceedings of the IEEE Symposium on Parallel and Distributed Processing (IPDPS) (2008) 1–8. [11] N. Fujimoto, Faster matrix-vector multiplication on GeForce 8800GTX, Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE Interna- tional Symposium on (2008) 1–8. 14
  • 15. Figure 7: GPU Parallel Speedup – This graph characterizes the parallel scaling of the accelerated version of S3D. As the number of nodes increases, both the single and double precision versions exhibit proportional increases in performance. [12] G. Cummins, R. Adams, T. Newell, Scientific computation through a GPU, Southeastcon, 2008. IEEE (2008) 244–246. [13] J. Bolz, I. Farmer, E. Grinspun, P. Schröoder, Sparse matrix solvers on the GPU: conjugate gradients and multigrid, in: SIGGRAPH ’03: ACM SIGGRAPH 2003 Papers, 2003, pp. 917–924. [14] B. He, N. K. Govindaraju, Q. Luo, B. Smith, Efficient gather and scatter operations on graphics processors, in: SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007, pp. 1–12. [15] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, K. Schulten, Accelerating molecular modeling applications with graphics processors, Journal of Computational Chemistry 28 (2005) 2618–2640. [16] C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, W.-M. W. Hwu, GPU acceleration of cutoff pair potentials for molecular modeling appli- cations, in: CF ’08: Proceedings of the 2008 conference on Computing frontiers, 2008, pp. 273–282. 15
  • 16. Figure 8: GPU Parallel Efficiency – This graph shows the parallel efficiency (parallel speedup divided by the number of processors) for the accelerated versions of S3D. [17] J. Kruger, R. Westermann, Acceleration techniques for GPU-based vol- ume rendering, Visualization, 2003. VIS 2003. IEEE (2003) 287–292. [18] K. Mueller, F. Xu, Practical considerations for GPU-accelerated CT, Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Sym- posium on (2006) 1184–1187. [19] S. Shende, A. D. Malony, J. Cuny, P. Beckman, S. Karmesin, K. Lindlan, Portable profiling and tracing for parallel, scientific applications using C++, in: SPDT ’98: Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, 1998, pp. 134–145. 16