SlideShare a Scribd company logo
18CSE454T: HIGH
PERFORMANCE
COMPUTING
Course Syllabus
This is Unit 1 of High Performance Computing For SRM students
Outline
• Modern processor
• Stored-program computer architecture
• General-purpose cache-based microprocessor architecture
Modern Processor
• Super Computing between 1975 and 1995.
• HPC marketing companies like Cray, CDC, NEC, Fujitsu, or Thinking
Machines – performance and price.
• Single-chip general-purpose microprocessors – 1970’s
• Till 1990’s – theoretical peak performance.
• For cost-effective, off-the-shelf systems, scientific computing is not
upto the requirement.
• High application performance on the single CPU level as well as for
highly parallel workloads
Modern Processor
• Standard cache-based microprocessors
• Vector computers support
• Supercomputer architecture
Discussed in further topics
What is available on your desktop?
•Intel Core i7 4790 (Haswell) processor
•4 cores at 4.0 GHz
•8 double precision floating point operations
per second (FLOP/s) – full fused multiply add
instructions (FMAs)
• 2 floating point units (FPUs) per core
• 16 FLOP per cycle
• FMA example: $0 = $0 x $2 + $1
•64 GFLOP/s per core theoretical peak
•256 GFLOP/s full system theoretical peak
2/13/2024 Computational Methods 7
Stored-program computer architecture
• Turing in 1936
• First implemented in a real machine (EDVAC) in 1949 by Eckert and
Mauchly
• Block diagram for the stored-program digital computer
• Instructions represented as numbers that are stored as data in
memory.
• Instructions are read and executed by a control unit
• Separate arithmetic/logic unit is responsible for the actual
computations and manipulates data stored in memory along with the
instructions
• I/O facilities enable communication with users.
• Control and arithmetic units CPU - interfaces to memory and I/O
• Programming a stored-program computer amounts to modifying
instructions in memory
• Ex. Compiler - Translates high-level language like C or Fortran into
instructions that can be stored in memory and then executed by a
computer.
• Instructions and data must be continuously fed to the control and
arithmetic units - Speed of the memory interface (Limitation) – Compute
the performance - von Neumann bottleneck.
• It is inherently sequential, processing a single instruction with single
operand or a group of operands from memory - SISD (Single Instruction
Single Data)
• Modified and extended to support parallelism in many different flavors
(SISD, SIMD, MISD, MIMD)
General-purpose cache-based microprocessor
architecture
• Arithmetic units for floating-point (FP) and integer (INT) operations
• Administrative logic that helps to feed those units with operands
• CPU registers - floating-point and integer - hold operands to be accessed by instructions with no
significant delay.
• All operands for arithmetic operations – CPU Register
• Load (LD) and store (ST) units handle instructions that transfer data to and from registers
• Instructions are sorted into several queues, waiting to be executed
• caches hold data and instructions - Reuse
• Chip area is usually occupied by caches
• Branch prediction, reorder buffers, data shortcuts, transaction queues, etc., - Modern Processor
Outline
• Performance metrics and benchmarks
Performance metrics and benchmarks
• Peak performance (CPU Core)
• CPU Core reaches to the certain limit with application based on factors.
• Speed of a CPU.
• Scientific Computing - FP – Double precision (MUL and ADD operation based
on Flops/sec
• Divide, square root, trigonometric functions are not encountered while
executing the resource
• High performance microprocessors are designed to deliver at most two or
four double-precision floating-point results per clock cycle
• Typical clock frequencies between 2 and 3 GHz,
• Peak arithmetic performance between 4 and 12 GFlops/sec per core.
• Bandwidth  GBytes/sec (To and from the caches and main memory)
• GFlops/sec and GBytes/sec metrics
• Cache-based microprocessor is very data-centric
• Computation  manipulation of data items
• Low-level benchmark is a program that tries to test some specific
feature of the architecture like, e.g., peak performance
Vector triad  nested loop, the inner level executing a multiply add operation on
the elements of three vectors and storing the result.
Measure the performance of data transfers between memory and arithmetic units
of a processor
• Inner level: three load streams for arrays B, C and D and one store
stream for A are active.
• Depending on N, loop might execute in a very small time and hard to
measure.
• The outer loop thus repeats the triad R times so that execution time
becomes large enough to be accurately measurable.
• The aim of the masked-out call to the dummy() subroutine is to
prevent the compiler from doing an obvious optimization
• Without the call, the compiler might discover that the inner loop does
not depend at all on the outer loop index j and dropthe outer loop
right away.
• Call to dummy() fools the compiler into believing that the arrays may
change between outer loop iterations.
• MFlops/sec rate
• Wallclock time  Elapsed time
• Get a wallclock time stamp
• Performance graphs for the vector triad obtained on different
generations of cache-based microprocessors and a vector system.
• Performance grows with N until some maximum is reached, followed
by several sudden breakdowns
• Performance stays constant for very large loops
• Vector systems to standard CPUs in that they meet different domains
of applicability
• Low-level benchmarks are powerful tools to get information about
the basic capabilities of a processor
Outline
• Transistors galore: Moore’s Law
• Pipelining
Transistors galore: Moore’s Law
• Build computer chips, their “complexity” or general capability
doubled 24 months Moore’s Law
• Gordon Moore, co-founder of Intel Corp.
• Number of components (transistors) on a chip
• Minimal manufacturing cost per component
• Increasing chip transistor counts and clock speeds have enabled
processor
1965: G. Moore claimed
#transistors on “microchip”
doubles every 12-24 months
Moore’s law
Intel Sandy Bridge EP: 2.3 billion
Nvidia Kepler: 7 billion
Intel Broadwell: 7.2 billion
Nvidia Pascal: 15 billion
Concept
• Pipelined functional units:
• subdividing complex operations (like, e.g., floating point addition
and multiplication) into simple components that can be executed
using different functional units on the CPU
• Increase instruction throughput, i.e., the number of instructions
executed per clock cycle.
• Instruction level parallelism (ILP) - throughput of one instruction
per cycle
• Superscalar architecture:
• Direct” instruction-level parallelism by enabling an instruction
throughput of more than one per cycle
• Modern microprocessors are up to six-way superscalar
• Data parallelism through SIMD instructions:
• SIMD (Single Instruction Multiple Data) instructions issue identical
operations on a whole array of integer or FP operands, usually in
special registers
• Improve arithmetic peak performance without the requirement
for increased superscalarity
• Out-of-order execution:
• If arguments to instructions are not available in registers “on
time,” e.g., because the memory subsystem is too slow to keep up
with processor speed, out-of-order execution can avoid idle times
(also called stalls)
• By executing instructions that appear later in the instruction
stream but have their parameters available
• Improves instruction throughput and makes it easier for compilers
to arrange machine code for optimal performance
• Reorder buffer that stores instructions until they become eligible
for execution.
• Larger caches:
• Small, fast, on-chip memories serve as temporary data storage for
holding copies of data that is to be used again “soon,” or that is
close to data that has recently been used.
• Increasing gap between processor and memory speeds
• Enlarging the cache size
• Simplified instruction set:
• CISC to the RISC paradigm
• CISC (Complex Instruction Set Computer), a processor executes
very complex, powerful instructions, requiring a large hardware
effort for decoding but keeping programs small and compact.
• Saved memory – scarce resource for a long time
• RISC (Reduced Instruction Set Computer) features a very simple
instruction set that can be executed very rapidly (few clock cycles
per instruction; in the extreme case each instruction takes only a
single cycle).
• Clock rate of microprocessors could be increased in a way that
would never have been possible with CISC.
• it frees up transistors for other uses
• Moore’s Law promises a steady growth in transistor count, but more
complexity does not automatically translate into more efficiency:
• average code will not be able to use them;
• number of independent instructions in a sequential instruction
stream is limited;
• steady increase in clock frequencies is required to keep the single-
core performance;
• faster clock boosts power dissipation, making idling transistors
even more useless
• simplify processor designs
• additional transistors for larger caches
• Multicore processors, i.e., several CPU cores on a single die or
socket
Pipelining
• If it takes m different steps to finish the product, m products are
continually worked on in different stages of completion
• Complex operations like loading and storing data or performing
floating-point arithmetic cannot be executed in a single cycle without
excessive hardware requirement
• assembly line concept
• fetch–decode–execute” pipeline, in which each stage can operate
independently of the others
• While an instruction is being executed, another one is being decoded
and a third one is being fetched from instruction (L1I) cache
• Limitation: These still complex tasks are usually broken down even
further (higher clock rate)
• Ex: Floating point multiplication: Simple subtasks - A(:)=B(:)*C(:)
• wind-up phase
• all units are continuously busy, generating one result per cycle -
multiply pipe has a latency (or depth)
• for a pipeline of depth m, executing N independent, subsequent
operations takes N +m−1 steps
• expected speedup - m cycles to generate a single result
• Tseq/ Tpipe = mN / N +m−1
• Throughput is N / Tpipe = 1/ (1+ (m−1/N))
• p = 1/(1+ (m−1/Nc)) ⇒ Nc = ((m−1)p) / (1− p)
• pipeline bubbles - several tens of cycles for square root or divide,
often more than 100 for trigonometric functions – Stalls
• Avoiding such functions is thus a primary goal of code optimization
real” code involves more operations like, e.g., loads, stores, address
calculations, instruction fetch and decode, etc.,
• Each operand - way from memory to a register and each result must
be written out
• compiler to arrange instructions in such a way as to make efficient use
of all the different pipelines
• (In order architecture) out-of-order processors due to the large
latencies for some operations
• If operands are not delivered “on time” to execution units, all the
complicated pipelining mechanisms are of no use
• simple scaling loop
do i=1,N
A(i) = s * A(i) 
enddo
loop: load A(i)
mult A(i) = A(i) * s
store A(i)
i = i + 1
branch -> loop
• multiply operation can be pipelined, the pipeline will stall if the load
operation on A(i) does not provide the data on time. Similarly, the
store operation can only commence if the latency for mult has passed
and a valid result is available.
• Assuming a latency of four cycles for load, two cycles for mult and
two cycles for store - extremely inefficient
• required to interleave different loop iterations to bridge the latencies
and avoid stalls
• wind-up and wind-down code
loop: load A(i+6)
mult A(i+2) = A(i+2) * s
store A(i)
i = i + 1
branch -> loop
• Interleaving of loop iterations in order to meet latency requirements
is called software pipelining
• loop-carried dependencies, i.e., if a loop iteration depends on the
result of some other iteration, there are situations when neither the
compiler nor the processor hardware can prevent pipeline stalls
• Pseudo dependency
• A(i+1); A(i) ; A(i-1)
Superscalarity
• Capable of executing more than one instruction or, more  more than one
“result” per cycle.
• Multiple instructions can be fetched and decoded concurrently
• Address and other integer calculations are performed in multiple
integer (add, mult, shift, mask)
• Multiple floating-point pipelines can run in parallel;
• one or two combined multiply-add pipes that perform a=b+c*d with a
throughput of one each.
• Caches are fast enough to sustain more than one load or store operation
per cycle, and the number of available execution units for loads and
stores reflects
• Special form of parallel execution, and a variant of instruction level
parallelism (ILP).
• Out-of-order execution and compiler optimization must work
together in order to fully exploit superscalarity
• most advanced architectures it is extremely hard for compiler-
generated code to achieve a throughput of more than 2–3
instructions per cycle.
SIMD
• First vector supercomputers in the 1970s
• Fundamental design principle for the massively parallel Connection
Machines in the 1980s and early 1990s
• Cache-based processors have instruction set extensions for both
integer and floating-point SIMD operations
• They allow the concurrent execution of arithmetic operations on a
“wide” register that can hold, e.g., two DP or four SP floating-point
words
• Fig. where two 128-bit registers hold four single-precision floating-
point values each
• A single instruction can initiate four additions at once
• SIMD does not specify anything about the possible concurrency of
those operations; the four additions could be truly parallel, if
sufficient arithmetic units are available, or just be fed to a single
pipeline
• Latter strategy uses SIMD as a device to reduce superscalarity (and
thus complexity) without sacrificing peak arithmetic performance, the
former option boosts peak performance
• Memory subsystem (or at least the cache) must be able to sustain
sufficient bandwidth to keep all units busy.
Outline
• Memory Hierarchies
• Cache
• Cache Mapping
• Prefetch
Memory Hierarchies
• Data can be stored in a computer system in many different ways
• As CPUs have a set of registers, which can be accessed without delay.
• one or more small but very fast caches holding copies of recently
used data items
• Main memory is much slower, but also much larger than cache
• Finally, data can be stored on disk and copied to main memory as
needed
• vital to understand how data transfer works between the different
levels in order to identify performance bottlenecks
Cache
• Caches are low-capacity, high-speed memories that are commonly
integrated on the CPU die
• Need for caches
• Peak performance  several GFlops/sec per core
• Memory bandwidth  the rate at which data can be transferred from
memory to the CPU, is still stuck at a couple of GBytes/sec, which is
entirely insufficient to feed all arithmetic units and keep them busy
continuously
• Initial waiting time called latency passes until data can actually flow
• Memory latency  order of several hundred CPU
cycles
• Moore’s Law still guarantees a constant rate of
improvement in chip complexity and
performance
• DRAM gap  increasing “distance” between CPU
and memory in terms of latency and bandwidth
• At least two levels of cache called L1 and L2
• L1 is normally split into two parts, one for
instructions (“I-cache,” “L1I”) and one for data
(“L1D”). Outer cache levels are normally unified,
storing data as well as instructions
• “closer” a cache is to the CPU’s registers, i.e., the
higher its bandwidth and the lower its latency
• CPU issues a read request (“load”) for transferring a data item to a
register, first-level cache logic checks whether this item already
resides in cache
• Cache hit and the request can be satisfied immediately, with low
latency
• Cache miss, however, data must be fetched from outer cache levels
or, in the worst case, from main memory
• If all cache entries are occupied, a hardware-implemented algorithm
evicts old items from cache and replaces them with new data
• I-cache misses are rare events compared to D-cache misses
• Caches can only have a positive effect on performance if the data
access pattern of an application shows some locality of reference
• Data items that have been loaded into a cache are to be used again
“soon enough” to not have been evicted in the meantime. This is also
called temporal locality
• Estimate the performance gain
• Let be the cache reuse ratio, i.e., the fraction of loads or stores that
can be satisfied from cache because there was a recent load or store to
the same address
• Access time to main memory (latency and bandwidth) is denoted by Tm.
• Access time is reduced to Tc = Tm/. For some finite , the average access time will thus be
Tav =Tc+(1−)Tm
• we calculate an access performance gain of
• supporting temporal locality is not sufficient.
• Many applications show streaming patterns where large amounts of
data are loaded into the CPU, modified and written back without the
potential of reuse “in time
• reuse ratio is zero for streaming
• Each new load is expensive as an item has to be evicted from cache and replaced
by the new one, incurring huge latency
• In order to reduce the latency penalty for streaming, caches feature a peculiar
organization into cache lines
• Advantage of cache lines is that the latency penalty of a cache miss occurs only
on the first miss on an item belonging to a line
• Cache hit ratio
• Spatial locality
• Not only does each load incur a miss and subsequent latency penalty,
it also leads to the transfer of a whole cache line, polluting the
memory bus with data that will probably never be used
• Advantages of using cache lines prevail and very few processor
manufacturers have provided means of bypassing the mechanism.
• Graphs
• In presence of caches, if data to be written out already resides in
cache, a write hit occurs.
• There are several possibilities for handling this case, but usually
outermost caches work with a write-back strategy:
• The cache line is modified in cache and written to memory as a whole
when evicted
• On a write miss, however, cache-memory consistency dictates that
the cache line in question must first be transferred from memory to
cache before an entry can be modified. This is called write allocate
• leads to the situation that a data write stream from CPU to memory
uses the bus twice: once for all the cache line allocations and once for
evicting modified lines (the data transfer requirement for the triad
benchmark code is increased by 25% due to write allocates)
• avoiding write-allocate transactions
• Nontemporal stores. These are special store instructions that bypass
all cache levels and write directly to memory.
• Cache does not get “polluted” by store streams that do not exhibit
temporal locality anyway. In order to prevent excessive latencies,
there is usually a small write combine buffer, which bundles a number
of successive nontemporal stores
• Cache line zero. Special instructions “zero out” a cache line and mark
it as modified without a prior read.
• The data is written to memory when evicted. In comparison to
nontemporal stores, this technique uses up cache space for the store
stream.
• does not slow down store operations in cache-bound situations
• Cache line zero must be used with extreme care: All elements of a
cache line are evicted to memory, even if only a part of them were
actually modified.
Cache mapping
• No restriction on which cache line can be associated with which
memory locations
• fully associative.
• For each cache line the cache logic must store its location in the CPU’s
address space, and each memory access must be checked against the
list of all those addresses
• least recently used (LRU) strategy
• NRU (not recently used) or random replacement
This is Unit 1 of High Performance Computing For SRM students
• direct-mapped cache, which maps the full cache size repeatedly into
memory
• Memory locations that lie a multiple of the cache size apart are
always mapped to the same cache line, and the cache line that
corresponds to some address can be obtained very quickly by
masking out the most significant bits.
• The downside of a direct-mapped cache is that it is disposed toward
cache thrashing, which means that cache lines are loaded into and
evicted from the cache in rapid succession.
• “strided” vector triad code for DP data
1 do i=1,N,CACHE_SIZE_IN_BYTES/8
2 A(i) = B(i) + C(i) * D(i)
3 enddo
• To keep administrative overhead low and still reduce the danger of
conflict misses and cache thrashing, a set-associative cache is divided
into m direct-mapped caches equal in size, so-called ways
• The number of ways m is the number of different cache lines a
memory address can be mapped to two-way set-associative cache
• On each memory access, the hardware merely has to determine
which way the data resides in or, in the case of a miss, to which of the
m possible cache lines it should be loaded.
• tradeoff between low latency and prevention of thrashing
• Innermost (L1) caches tend to be less set-associative than outer cache
levels.
• set-associativity varies between two- and 48-way.
• The effective cache size, i.e., the part of the cache that is actually
useful for exploiting spatial and temporal locality in an application
code could be quite small, depending on the number of data streams,
their strides and mutual offsets
Prefetch
• Exploiting spatial locality by the introduction of cache lines improves cache
efficiency a lot, there is still the problem of latency on the first miss
1 do i=1,N
2 S = S + A(i)*A(i)
3 enddo
• Assuming a cache line length of four elements, three loads can be
satisfied from cache before another miss occurs.
• The long latency leads to long phases of inactivity on the memory
bus.
This is Unit 1 of High Performance Computing For SRM students
• typical cache line lengths between 64 and 128 bytes (8–16 DP words)
• latency, and streaming applications would suffer not only from
insufficient bandwidth but also from low memory bus utilization
• Assuming a typical commodity system with a memory latency of 50 ns
and a bandwidth of 10 GBytes/sec, a single 128-byte cache line
transfer takes 13 ns, so 80% of the potential bus bandwidth is unused.
Latency has thus an even more severe impact on performance than
bandwidth.
• The latency problem can be solved in many cases
• Prefetching supplies the cache with data ahead of the actual
requirements of an application.
• The compiler can do this by interleaving special instructions with the
software pipelined instruction stream that “touch” cache lines early
enough to give the hardware time to load them into cache
asynchronously
• a hardware prefetcher that can detect regular access patterns and
tries to read ahead application data, keeping up the continuous data
stream and hence serving the same purpose as prefetch instructions
• The memory subsystem must be able to sustain a certain number of
outstanding prefetch operations, i.e., pending prefetch requests, or
else the memory pipeline will stall and latency cannot be hidden
completely.
• We can estimate the number of outstanding prefetches required for hiding the
latency completely: If Tℓ is the latency and B is the bandwidth, the transfer of a
whole line of length Lc (in bytes) takes a time of
• T = Tℓ+ Lc / B
• One prefetch operation must be initiated per cache line transfer, and the number
of cache lines that can be transferred during time T is the number of prefetches P
that
This is Unit 1 of High Performance Computing For SRM students
• P = T / (Lc/B) = 1+ (Tℓ / (Lc/B))
• As an example, for a cache line length of 128 bytes (16 DP words), B = 10
GBytes/sec and Tℓ =50 ns we get P≈5 outstanding prefetches. If this requirement
cannot be met, latency will not be hidden completely and the full memory
bandwidth will not be utilized.
• Application that executes so many floating point operations on the cache line
data that they cannot be hidden behind the transfer will not be limited by
bandwidth and put less strain on the memory subsystem
• Applications with heavy demands on bandwidth can overstrain the prefetch
mechanism.
• A second processor core using a shared path to memory can sometimes provide
for the missing prefetches, yielding a slight bandwidth boost
• Main memory access is unavoidable, a good programming guideline is to try to
establish long continuous data streams
• stress the role of prefetching for hiding latency, but the effects of
bandwidth limitations are ignored.
• It should be clear that prefetching cannot enhance available memory
bandwidth, although the transfer time for a single cache line is
dominated by latency.
Outline
• Multi-core Processor
• Multi-threaded Processor
Multi Core Processor
• Moore’s Law is still valid
• Switching and leakage power of several hundred- million-transistor
chips are so large
• Advances and growing cache sizes alone will not be sufficient to keep
up the one-to-one correspondence of Moore’s Law with application
performance  multicore designs.
• For given semiconductor process technology, power dissipation of
modern CPUs is proportional to the third power of clock frequency fc
(actually it is linear in fc and quadratic in supply voltage Vcc, but a
decrease in fc allows for a proportional decrease in Vcc).
• Lowering fc and thus Vcc can therefore dramatically reduce power
dissipation.
• Assuming that a single core with clock frequency fc has a
performance of p and a power dissipation of W, some relative change
in performance ep =Dp/p will emerge for a relative clock change of e f
= Dfc/ fc.
• All other things being equal, |e f | is an upper limit for |ep|, which in
turn will depend on the applications considered.
• Power dissipation is
W +W = (1+ f )3W .
• Reducing clock frequency opens the possibility to place more than
one CPU core on the same die
(1+f )3m = 1 =⇒ f = m−1/3−1 .
• The cores on one die can either have separate caches or share certain levels.
• We will call a group of cores that share a certain cache level a cache group.
• For instance, the hexa-core chip in comprises six L1 groups (one core each), three dual-
core L2 groups, and one L3 group which encompasses the whole socket.
• Sharing a cache enables communication between cores without reverting to main
memory, reducing latency and improving bandwidth by about an order of magnitude.
• An adverse effect of sharing could be possible cache bandwidth bottlenecks. The
performance impact of shared and separate caches on applications is highly code- and
system-dependent.
• Most recent multicore designs feature an integrated memory controller
to which memory modules can be attached directly without separate
logic (“chipset”).
• This reduces main memory latency and allows the addition of fast
intersocket networks like HyperTransport or QuickPath . There may
exist fast data paths between caches to enable, e.g., efficient cache
coherence communication  cache coherence.
• Multicore transition is the absolute necessity to put those resources
to efficient use by parallel programming
• Multicore is the gradual reduction in main memory bandwidth and
cache size available per core.
• Effects with larger caches, the performance of some algorithms is
always bound by main memory bandwidth, and multiple cores
sharing a common memory bus suffer from contention.
• the complex structure of shared and nonshared caches on current
multicore chips makes communication characteristics between
different cores highly nonisotropic.
• If there is a shared cache, two cores can exchange certain amounts of
information much faster; e.g., they can synchronize via a variable in
cache instead of having to exchange data over the memory bus
• depending on the communication characteristics and bandwidth
demands of running applications, it can be extremely important
where exactly multiple threads or processes are running in a
multicore.
Multi threaded Processor
• Modern Processor – High performance attainment - Usage of
pipelining
• Factors inhibit the usage of pipelining
• Dependencies
• memory latencies,
• insufficient loop length
• unfortunate instruction mix
• branch misprediction
• frequent pipeline bubbles - large part of the execution resources
remains idle based on latencies and stall
• longer pipelines
• Threading capabilities - Hyper-Threading or SMT (Simultaneous
Multithreading
• Architecture comprises of all data, status and control registers,
including stack and instruction pointers.
• Execution resources like arithmetic units, caches, queues, memory
interfaces etc. are not duplicated.
• Due to the multiple architectural states, the CPU appears to be
composed of several cores logical processors and can thus execute
multiple instruction streams, or threads, in parallel.
• fill bubbles in a pipeline due to stalls in one thread with instructions
from another
• If there are multiple pipelines that can run in parallel and one thread
leaves one or more of them in an idle state, another thread
• SMT how fine-grained the switching between threads can be
performed on a pipeline
• Pipeline has to be flushed in order to support a new thread  no long stalls must be
bridged.
• SMT can enhance instruction throughput (instructions executed per cycle) if there is
potential to intersperse instructions from multiple threads within or across with fine-
grained two-way SMT.
• It is well optimized, floating-point centric scientific code tends not to profit much from
SMT, but there are exceptions
• The number of outstanding memory references scales with the number of threads so
that full memory bandwidth can only be achieved if many threads are running
concurrently.
• The performance of a single thread is not improved, however, and
there may even be a small performance hit for a single instruction
stream if resources are hardwired to threads with SMT enabled.
• lead to an increase in capacity misses and code is sensitive to cache
size  multiple threads share many resources
• SMT may severely increase the cost of synchronization:
• If several threads on a physical core wait for some event to occur by
executing tight, “spin-waiting” loops, they strongly compete for the
shared execution units.
• This can lead to large synchronization latencies
Outline
• Vector processors
• Design principles
• Maximum performance estimates
• Programming for vector architectures
Vector processors
• Cray 1 supercomputer – Powerful - until RISC-based massively parallel
machines became available.
• Extreme demands on memory bandwidth and time to solution.
• real application performance to peak performance - suitable
“vectorizable” code
• SIMD (Single Instruction Multiple Data) paradigm which demands that
a single machine instruction is automatically applied to presumably
large—number of arguments of the same type, i.e., a vector.
• Modern cache-based microprocessors
• vector computers
• have much more massive parallelism built into execution units
• SIMD architectures can exploit significant data-level parallelism for:
– matrix-oriented scientific computing
– media-oriented image and sound processors
• SIMD is more energy efficient than MIMD
– Only needs to fetch one instruction per data operation
– Makes SIMD attractive for personal mobile devices
• SIMD allows programmer to continue to think sequentially
Design principles
• Register-to-register machines:
Machine instructions operate on vector registers which
can hold a number of arguments, usually between 64 and 256 (double
precision)
• The width of a vector register is called the vector length Lv
• There is a pipeline for each arithmetic operation like addition,
multiplication, divide, etc., and each pipeline can deliver a certain
number of results per cycle.
• For MULT and ADD pipes, this number varies between two and 16
and one speaks of a multitrack pipeline
This is Unit 1 of High Performance Computing For SRM students
• Other operations like square root and divide are significantly more
complex, hence the pipeline throughput is much lower for them.
• A vector processor with single-track pipes can achieve a similar peak
performance per clock cycle as a superscalar cache-based
microprocessor
• Supply data to the vector registers, there is one or more load, store or
combined load/store pipes connecting directly to main memory
• NEC SX-9
• SIMD: A(1:N)=B(1:N)+C(1:N)
• cache-based microprocessor: elements of A, B, and C
• For each index, two loads, one addition and one store operation
would have to be executed together with the required integer and
branch logic to perform the loop
• vload V1(1:N) = B(1:N)
• vload V2(1:N) = C(1:N)
• vadd V3(1:N) = V1(1:N) + V2(1:N)
• vstore A(1:N) = V3(1:N)
• V1, V2, and V3 denote vector registers
• The original arrays are traversed in chunks of the vector length
• An operation like vector addition does not have to wait until its
argument vector registers are completely filled but can commence
after some initial arguments are available.
• This feature is called chaining - different pipes (like MULT and ADD) to
operate concurrently
• pre-RISC era where multi issue superscalar processors with fast
instruction caches were unavailable.
• feeding the arithmetic pipes with data - massively banked main
memory layout because current memory chips require some cycles of
recovery time - bank busy time
• current vector computers provide thousands of memory banks
• Writing a program so that the compiler can generate effective SIMD
vector instructions is called vectorization
Maximum performance estimates
• The peak performance of a vector processor is given by the number
of tracks for the ADD and MULT pipelines and its clock frequency.
• Peak performance,
2 (ADD+MULT)×4 (tracks)×2 (GHz)=16 GFlops/sec
• Square root, divide and other operations are not considered
• Memory bandwidth, a single four-track LD/ST pipeline, for reading or
writing.
4 (tracks)×2 (GHz)×8 (bytes)=64 GBytes/sec
• The memory interface of vector CPUs often runs at the same
frequency as the core, delivering more bandwidth in relation to peak
performance
• Non vectorizable
• Performance prediction based on architectural properties and loop
code characteristics.
• vector triad - performs three loads, one store and two flops
(MULT+ADD).
• Only a single LD/ST pipe, loads and stores and even loads to different
arrays cannot overlap each other, but they can overlap arithmetic and
be chained to arithmetic pipes.
• First a vector register must be loaded with data from array C. As the
LD/ST pipe starts filling a vector register with data from array D, the
MULT pipe can start performing arithmetic on C and D.
• As soon as data from B is available, the ADD pipe can compute the
final result, which is then stored to memory by the LD/ST pipe again.
• 25% of peak performance
• amounts to 4 GFlops/sec
Programming for vector architectures:
• vectorization is the lack of true data dependencies across the
iterations of a loop
• Software pipelining i.e., forward references are allowed but backward
references inhibit vectorization
• Branches in inner loop kernels are also a problem
• Mask registers (essentially boolean registers with vector length) are
provided that allow selective execution of loop iterations.
do i = 1,N
if(y(i) .le. 0.d0) then
x(i) = s * y(i)
else
x(i) = y(i) * y(i)
endif
Enddo
• A vector of boolean values is first generated from the branch
conditional using the logic pipeline
This is Unit 1 of High Performance Computing For SRM students
• This vector is then used to choose results from the if or else branch
• Both branches are executed for all loop indices which may be waste
of resources if expensive operations are involved
• For a single branch (no else part), especially in cases where expensive
operations like divide and square roots are involved, the gather /
scatter method is an efficient way to vectorization
1 do i = 1,N
2 if(y(i) .ge. 0.d0) then
3 x(i) = sqrt(y(i))
4 endif
5 enddo
• All needed elements of the required argument vectors are first
collected into vector registers (gather), then the vector operation is
executed on them and finally the results are stored back (scatter).

More Related Content

PDF
Motivation for multithreaded architectures
PPTX
Advanced processor principles
PPTX
High performance computing
PPT
02_Computer Evolution and Performance.ppt
PDF
Challenges in Embedded Computing
PPT
Computer Evolution.ppt
PPTX
RISC Vs CISC Computer architecture and design
PPT
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Motivation for multithreaded architectures
Advanced processor principles
High performance computing
02_Computer Evolution and Performance.ppt
Challenges in Embedded Computing
Computer Evolution.ppt
RISC Vs CISC Computer architecture and design
chap2_nnejjejehhehehhhhhhhhhehslides.ppt

Similar to This is Unit 1 of High Performance Computing For SRM students (20)

PPT
Chap2 slides
PDF
chap2_slidesforparallelcomputingananthgarama
PDF
236UNIT5-COA.pdfvfrffccvnnnjuyyhgfeeesdd
PPT
Basics of micro controllers for biginners
PPTX
Microprocessor and Microcontroller Lecture.pptx
PPTX
Debate on RISC-CISC
PPTX
Chip Multithreading Systems Need a New Operating System Scheduler
PDF
7 processor
PDF
Multithreaded processors ppt
PPT
Parallel processing
PPTX
1.1.2 Processor and primary storage components.pptx
PPTX
1 microcontroller 8051 detailed explanation
PPTX
Microprocessor
PPTX
Uni Processor Architecture
PPT
Reduced instruction set computers
DOC
Aca module 1
PPTX
Processors selection
PPTX
Parallel Processors (SIMD)
PPTX
Parallel Processors (SIMD)
Chap2 slides
chap2_slidesforparallelcomputingananthgarama
236UNIT5-COA.pdfvfrffccvnnnjuyyhgfeeesdd
Basics of micro controllers for biginners
Microprocessor and Microcontroller Lecture.pptx
Debate on RISC-CISC
Chip Multithreading Systems Need a New Operating System Scheduler
7 processor
Multithreaded processors ppt
Parallel processing
1.1.2 Processor and primary storage components.pptx
1 microcontroller 8051 detailed explanation
Microprocessor
Uni Processor Architecture
Reduced instruction set computers
Aca module 1
Processors selection
Parallel Processors (SIMD)
Parallel Processors (SIMD)
Ad

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Welding lecture in detail for understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Lecture Notes Electrical Wiring System Components
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Well-logging-methods_new................
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Geodesy 1.pptx...............................................
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CH1 Production IntroductoryConcepts.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Welding lecture in detail for understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Construction Project Organization Group 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Lecture Notes Electrical Wiring System Components
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
Well-logging-methods_new................
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
OOP with Java - Java Introduction (Basics)
Geodesy 1.pptx...............................................
Mechanical Engineering MATERIALS Selection
CH1 Production IntroductoryConcepts.pptx
Ad

This is Unit 1 of High Performance Computing For SRM students

  • 4. Outline • Modern processor • Stored-program computer architecture • General-purpose cache-based microprocessor architecture
  • 5. Modern Processor • Super Computing between 1975 and 1995. • HPC marketing companies like Cray, CDC, NEC, Fujitsu, or Thinking Machines – performance and price. • Single-chip general-purpose microprocessors – 1970’s • Till 1990’s – theoretical peak performance. • For cost-effective, off-the-shelf systems, scientific computing is not upto the requirement. • High application performance on the single CPU level as well as for highly parallel workloads
  • 6. Modern Processor • Standard cache-based microprocessors • Vector computers support • Supercomputer architecture Discussed in further topics
  • 7. What is available on your desktop? •Intel Core i7 4790 (Haswell) processor •4 cores at 4.0 GHz •8 double precision floating point operations per second (FLOP/s) – full fused multiply add instructions (FMAs) • 2 floating point units (FPUs) per core • 16 FLOP per cycle • FMA example: $0 = $0 x $2 + $1 •64 GFLOP/s per core theoretical peak •256 GFLOP/s full system theoretical peak 2/13/2024 Computational Methods 7
  • 8. Stored-program computer architecture • Turing in 1936 • First implemented in a real machine (EDVAC) in 1949 by Eckert and Mauchly • Block diagram for the stored-program digital computer
  • 9. • Instructions represented as numbers that are stored as data in memory. • Instructions are read and executed by a control unit • Separate arithmetic/logic unit is responsible for the actual computations and manipulates data stored in memory along with the instructions • I/O facilities enable communication with users. • Control and arithmetic units CPU - interfaces to memory and I/O • Programming a stored-program computer amounts to modifying instructions in memory • Ex. Compiler - Translates high-level language like C or Fortran into instructions that can be stored in memory and then executed by a computer.
  • 10. • Instructions and data must be continuously fed to the control and arithmetic units - Speed of the memory interface (Limitation) – Compute the performance - von Neumann bottleneck. • It is inherently sequential, processing a single instruction with single operand or a group of operands from memory - SISD (Single Instruction Single Data) • Modified and extended to support parallelism in many different flavors (SISD, SIMD, MISD, MIMD)
  • 12. • Arithmetic units for floating-point (FP) and integer (INT) operations • Administrative logic that helps to feed those units with operands • CPU registers - floating-point and integer - hold operands to be accessed by instructions with no significant delay. • All operands for arithmetic operations – CPU Register • Load (LD) and store (ST) units handle instructions that transfer data to and from registers • Instructions are sorted into several queues, waiting to be executed • caches hold data and instructions - Reuse • Chip area is usually occupied by caches • Branch prediction, reorder buffers, data shortcuts, transaction queues, etc., - Modern Processor
  • 14. Performance metrics and benchmarks • Peak performance (CPU Core) • CPU Core reaches to the certain limit with application based on factors. • Speed of a CPU. • Scientific Computing - FP – Double precision (MUL and ADD operation based on Flops/sec • Divide, square root, trigonometric functions are not encountered while executing the resource • High performance microprocessors are designed to deliver at most two or four double-precision floating-point results per clock cycle
  • 15. • Typical clock frequencies between 2 and 3 GHz, • Peak arithmetic performance between 4 and 12 GFlops/sec per core. • Bandwidth  GBytes/sec (To and from the caches and main memory) • GFlops/sec and GBytes/sec metrics • Cache-based microprocessor is very data-centric • Computation  manipulation of data items • Low-level benchmark is a program that tries to test some specific feature of the architecture like, e.g., peak performance
  • 16. Vector triad  nested loop, the inner level executing a multiply add operation on the elements of three vectors and storing the result. Measure the performance of data transfers between memory and arithmetic units of a processor
  • 17. • Inner level: three load streams for arrays B, C and D and one store stream for A are active. • Depending on N, loop might execute in a very small time and hard to measure. • The outer loop thus repeats the triad R times so that execution time becomes large enough to be accurately measurable. • The aim of the masked-out call to the dummy() subroutine is to prevent the compiler from doing an obvious optimization • Without the call, the compiler might discover that the inner loop does not depend at all on the outer loop index j and dropthe outer loop right away. • Call to dummy() fools the compiler into believing that the arrays may change between outer loop iterations.
  • 18. • MFlops/sec rate • Wallclock time  Elapsed time • Get a wallclock time stamp • Performance graphs for the vector triad obtained on different generations of cache-based microprocessors and a vector system.
  • 19. • Performance grows with N until some maximum is reached, followed by several sudden breakdowns • Performance stays constant for very large loops • Vector systems to standard CPUs in that they meet different domains of applicability • Low-level benchmarks are powerful tools to get information about the basic capabilities of a processor
  • 20. Outline • Transistors galore: Moore’s Law • Pipelining
  • 21. Transistors galore: Moore’s Law • Build computer chips, their “complexity” or general capability doubled 24 months Moore’s Law • Gordon Moore, co-founder of Intel Corp. • Number of components (transistors) on a chip • Minimal manufacturing cost per component • Increasing chip transistor counts and clock speeds have enabled processor
  • 22. 1965: G. Moore claimed #transistors on “microchip” doubles every 12-24 months Moore’s law Intel Sandy Bridge EP: 2.3 billion Nvidia Kepler: 7 billion Intel Broadwell: 7.2 billion Nvidia Pascal: 15 billion
  • 23. Concept • Pipelined functional units: • subdividing complex operations (like, e.g., floating point addition and multiplication) into simple components that can be executed using different functional units on the CPU • Increase instruction throughput, i.e., the number of instructions executed per clock cycle. • Instruction level parallelism (ILP) - throughput of one instruction per cycle
  • 24. • Superscalar architecture: • Direct” instruction-level parallelism by enabling an instruction throughput of more than one per cycle • Modern microprocessors are up to six-way superscalar • Data parallelism through SIMD instructions: • SIMD (Single Instruction Multiple Data) instructions issue identical operations on a whole array of integer or FP operands, usually in special registers • Improve arithmetic peak performance without the requirement for increased superscalarity
  • 25. • Out-of-order execution: • If arguments to instructions are not available in registers “on time,” e.g., because the memory subsystem is too slow to keep up with processor speed, out-of-order execution can avoid idle times (also called stalls) • By executing instructions that appear later in the instruction stream but have their parameters available • Improves instruction throughput and makes it easier for compilers to arrange machine code for optimal performance • Reorder buffer that stores instructions until they become eligible for execution.
  • 26. • Larger caches: • Small, fast, on-chip memories serve as temporary data storage for holding copies of data that is to be used again “soon,” or that is close to data that has recently been used. • Increasing gap between processor and memory speeds • Enlarging the cache size • Simplified instruction set: • CISC to the RISC paradigm • CISC (Complex Instruction Set Computer), a processor executes very complex, powerful instructions, requiring a large hardware effort for decoding but keeping programs small and compact.
  • 27. • Saved memory – scarce resource for a long time • RISC (Reduced Instruction Set Computer) features a very simple instruction set that can be executed very rapidly (few clock cycles per instruction; in the extreme case each instruction takes only a single cycle). • Clock rate of microprocessors could be increased in a way that would never have been possible with CISC. • it frees up transistors for other uses • Moore’s Law promises a steady growth in transistor count, but more complexity does not automatically translate into more efficiency:
  • 28. • average code will not be able to use them; • number of independent instructions in a sequential instruction stream is limited; • steady increase in clock frequencies is required to keep the single- core performance; • faster clock boosts power dissipation, making idling transistors even more useless • simplify processor designs • additional transistors for larger caches • Multicore processors, i.e., several CPU cores on a single die or socket
  • 29. Pipelining • If it takes m different steps to finish the product, m products are continually worked on in different stages of completion • Complex operations like loading and storing data or performing floating-point arithmetic cannot be executed in a single cycle without excessive hardware requirement • assembly line concept • fetch–decode–execute” pipeline, in which each stage can operate independently of the others • While an instruction is being executed, another one is being decoded and a third one is being fetched from instruction (L1I) cache
  • 30. • Limitation: These still complex tasks are usually broken down even further (higher clock rate) • Ex: Floating point multiplication: Simple subtasks - A(:)=B(:)*C(:)
  • 31. • wind-up phase • all units are continuously busy, generating one result per cycle - multiply pipe has a latency (or depth) • for a pipeline of depth m, executing N independent, subsequent operations takes N +m−1 steps • expected speedup - m cycles to generate a single result • Tseq/ Tpipe = mN / N +m−1 • Throughput is N / Tpipe = 1/ (1+ (m−1/N)) • p = 1/(1+ (m−1/Nc)) ⇒ Nc = ((m−1)p) / (1− p)
  • 32. • pipeline bubbles - several tens of cycles for square root or divide, often more than 100 for trigonometric functions – Stalls • Avoiding such functions is thus a primary goal of code optimization real” code involves more operations like, e.g., loads, stores, address calculations, instruction fetch and decode, etc.,
  • 33. • Each operand - way from memory to a register and each result must be written out • compiler to arrange instructions in such a way as to make efficient use of all the different pipelines • (In order architecture) out-of-order processors due to the large latencies for some operations • If operands are not delivered “on time” to execution units, all the complicated pipelining mechanisms are of no use • simple scaling loop do i=1,N A(i) = s * A(i)  enddo loop: load A(i) mult A(i) = A(i) * s store A(i) i = i + 1 branch -> loop
  • 34. • multiply operation can be pipelined, the pipeline will stall if the load operation on A(i) does not provide the data on time. Similarly, the store operation can only commence if the latency for mult has passed and a valid result is available. • Assuming a latency of four cycles for load, two cycles for mult and two cycles for store - extremely inefficient • required to interleave different loop iterations to bridge the latencies and avoid stalls • wind-up and wind-down code loop: load A(i+6) mult A(i+2) = A(i+2) * s store A(i) i = i + 1 branch -> loop
  • 35. • Interleaving of loop iterations in order to meet latency requirements is called software pipelining • loop-carried dependencies, i.e., if a loop iteration depends on the result of some other iteration, there are situations when neither the compiler nor the processor hardware can prevent pipeline stalls • Pseudo dependency • A(i+1); A(i) ; A(i-1)
  • 36. Superscalarity • Capable of executing more than one instruction or, more  more than one “result” per cycle. • Multiple instructions can be fetched and decoded concurrently • Address and other integer calculations are performed in multiple integer (add, mult, shift, mask) • Multiple floating-point pipelines can run in parallel; • one or two combined multiply-add pipes that perform a=b+c*d with a throughput of one each. • Caches are fast enough to sustain more than one load or store operation per cycle, and the number of available execution units for loads and stores reflects
  • 37. • Special form of parallel execution, and a variant of instruction level parallelism (ILP). • Out-of-order execution and compiler optimization must work together in order to fully exploit superscalarity • most advanced architectures it is extremely hard for compiler- generated code to achieve a throughput of more than 2–3 instructions per cycle.
  • 38. SIMD
  • 39. • First vector supercomputers in the 1970s • Fundamental design principle for the massively parallel Connection Machines in the 1980s and early 1990s • Cache-based processors have instruction set extensions for both integer and floating-point SIMD operations • They allow the concurrent execution of arithmetic operations on a “wide” register that can hold, e.g., two DP or four SP floating-point words • Fig. where two 128-bit registers hold four single-precision floating- point values each • A single instruction can initiate four additions at once
  • 40. • SIMD does not specify anything about the possible concurrency of those operations; the four additions could be truly parallel, if sufficient arithmetic units are available, or just be fed to a single pipeline • Latter strategy uses SIMD as a device to reduce superscalarity (and thus complexity) without sacrificing peak arithmetic performance, the former option boosts peak performance • Memory subsystem (or at least the cache) must be able to sustain sufficient bandwidth to keep all units busy.
  • 41. Outline • Memory Hierarchies • Cache • Cache Mapping • Prefetch
  • 42. Memory Hierarchies • Data can be stored in a computer system in many different ways • As CPUs have a set of registers, which can be accessed without delay. • one or more small but very fast caches holding copies of recently used data items • Main memory is much slower, but also much larger than cache • Finally, data can be stored on disk and copied to main memory as needed • vital to understand how data transfer works between the different levels in order to identify performance bottlenecks
  • 43. Cache • Caches are low-capacity, high-speed memories that are commonly integrated on the CPU die • Need for caches • Peak performance  several GFlops/sec per core • Memory bandwidth  the rate at which data can be transferred from memory to the CPU, is still stuck at a couple of GBytes/sec, which is entirely insufficient to feed all arithmetic units and keep them busy continuously • Initial waiting time called latency passes until data can actually flow
  • 44. • Memory latency  order of several hundred CPU cycles • Moore’s Law still guarantees a constant rate of improvement in chip complexity and performance • DRAM gap  increasing “distance” between CPU and memory in terms of latency and bandwidth • At least two levels of cache called L1 and L2 • L1 is normally split into two parts, one for instructions (“I-cache,” “L1I”) and one for data (“L1D”). Outer cache levels are normally unified, storing data as well as instructions • “closer” a cache is to the CPU’s registers, i.e., the higher its bandwidth and the lower its latency
  • 45. • CPU issues a read request (“load”) for transferring a data item to a register, first-level cache logic checks whether this item already resides in cache • Cache hit and the request can be satisfied immediately, with low latency • Cache miss, however, data must be fetched from outer cache levels or, in the worst case, from main memory • If all cache entries are occupied, a hardware-implemented algorithm evicts old items from cache and replaces them with new data • I-cache misses are rare events compared to D-cache misses
  • 46. • Caches can only have a positive effect on performance if the data access pattern of an application shows some locality of reference • Data items that have been loaded into a cache are to be used again “soon enough” to not have been evicted in the meantime. This is also called temporal locality • Estimate the performance gain • Let be the cache reuse ratio, i.e., the fraction of loads or stores that can be satisfied from cache because there was a recent load or store to the same address • Access time to main memory (latency and bandwidth) is denoted by Tm. • Access time is reduced to Tc = Tm/. For some finite , the average access time will thus be Tav =Tc+(1−)Tm • we calculate an access performance gain of
  • 47. • supporting temporal locality is not sufficient. • Many applications show streaming patterns where large amounts of data are loaded into the CPU, modified and written back without the potential of reuse “in time • reuse ratio is zero for streaming • Each new load is expensive as an item has to be evicted from cache and replaced by the new one, incurring huge latency • In order to reduce the latency penalty for streaming, caches feature a peculiar organization into cache lines • Advantage of cache lines is that the latency penalty of a cache miss occurs only on the first miss on an item belonging to a line • Cache hit ratio • Spatial locality
  • 48. • Not only does each load incur a miss and subsequent latency penalty, it also leads to the transfer of a whole cache line, polluting the memory bus with data that will probably never be used • Advantages of using cache lines prevail and very few processor manufacturers have provided means of bypassing the mechanism. • Graphs • In presence of caches, if data to be written out already resides in cache, a write hit occurs. • There are several possibilities for handling this case, but usually outermost caches work with a write-back strategy: • The cache line is modified in cache and written to memory as a whole when evicted
  • 49. • On a write miss, however, cache-memory consistency dictates that the cache line in question must first be transferred from memory to cache before an entry can be modified. This is called write allocate • leads to the situation that a data write stream from CPU to memory uses the bus twice: once for all the cache line allocations and once for evicting modified lines (the data transfer requirement for the triad benchmark code is increased by 25% due to write allocates) • avoiding write-allocate transactions
  • 50. • Nontemporal stores. These are special store instructions that bypass all cache levels and write directly to memory. • Cache does not get “polluted” by store streams that do not exhibit temporal locality anyway. In order to prevent excessive latencies, there is usually a small write combine buffer, which bundles a number of successive nontemporal stores • Cache line zero. Special instructions “zero out” a cache line and mark it as modified without a prior read. • The data is written to memory when evicted. In comparison to nontemporal stores, this technique uses up cache space for the store stream. • does not slow down store operations in cache-bound situations • Cache line zero must be used with extreme care: All elements of a cache line are evicted to memory, even if only a part of them were actually modified.
  • 51. Cache mapping • No restriction on which cache line can be associated with which memory locations • fully associative. • For each cache line the cache logic must store its location in the CPU’s address space, and each memory access must be checked against the list of all those addresses • least recently used (LRU) strategy • NRU (not recently used) or random replacement
  • 53. • direct-mapped cache, which maps the full cache size repeatedly into memory • Memory locations that lie a multiple of the cache size apart are always mapped to the same cache line, and the cache line that corresponds to some address can be obtained very quickly by masking out the most significant bits. • The downside of a direct-mapped cache is that it is disposed toward cache thrashing, which means that cache lines are loaded into and evicted from the cache in rapid succession. • “strided” vector triad code for DP data 1 do i=1,N,CACHE_SIZE_IN_BYTES/8 2 A(i) = B(i) + C(i) * D(i) 3 enddo
  • 54. • To keep administrative overhead low and still reduce the danger of conflict misses and cache thrashing, a set-associative cache is divided into m direct-mapped caches equal in size, so-called ways
  • 55. • The number of ways m is the number of different cache lines a memory address can be mapped to two-way set-associative cache • On each memory access, the hardware merely has to determine which way the data resides in or, in the case of a miss, to which of the m possible cache lines it should be loaded. • tradeoff between low latency and prevention of thrashing • Innermost (L1) caches tend to be less set-associative than outer cache levels. • set-associativity varies between two- and 48-way. • The effective cache size, i.e., the part of the cache that is actually useful for exploiting spatial and temporal locality in an application code could be quite small, depending on the number of data streams, their strides and mutual offsets
  • 56. Prefetch • Exploiting spatial locality by the introduction of cache lines improves cache efficiency a lot, there is still the problem of latency on the first miss 1 do i=1,N 2 S = S + A(i)*A(i) 3 enddo • Assuming a cache line length of four elements, three loads can be satisfied from cache before another miss occurs. • The long latency leads to long phases of inactivity on the memory bus.
  • 58. • typical cache line lengths between 64 and 128 bytes (8–16 DP words) • latency, and streaming applications would suffer not only from insufficient bandwidth but also from low memory bus utilization • Assuming a typical commodity system with a memory latency of 50 ns and a bandwidth of 10 GBytes/sec, a single 128-byte cache line transfer takes 13 ns, so 80% of the potential bus bandwidth is unused. Latency has thus an even more severe impact on performance than bandwidth. • The latency problem can be solved in many cases • Prefetching supplies the cache with data ahead of the actual requirements of an application. • The compiler can do this by interleaving special instructions with the software pipelined instruction stream that “touch” cache lines early enough to give the hardware time to load them into cache asynchronously
  • 59. • a hardware prefetcher that can detect regular access patterns and tries to read ahead application data, keeping up the continuous data stream and hence serving the same purpose as prefetch instructions • The memory subsystem must be able to sustain a certain number of outstanding prefetch operations, i.e., pending prefetch requests, or else the memory pipeline will stall and latency cannot be hidden completely. • We can estimate the number of outstanding prefetches required for hiding the latency completely: If Tℓ is the latency and B is the bandwidth, the transfer of a whole line of length Lc (in bytes) takes a time of • T = Tℓ+ Lc / B • One prefetch operation must be initiated per cache line transfer, and the number of cache lines that can be transferred during time T is the number of prefetches P that
  • 61. • P = T / (Lc/B) = 1+ (Tℓ / (Lc/B)) • As an example, for a cache line length of 128 bytes (16 DP words), B = 10 GBytes/sec and Tℓ =50 ns we get P≈5 outstanding prefetches. If this requirement cannot be met, latency will not be hidden completely and the full memory bandwidth will not be utilized. • Application that executes so many floating point operations on the cache line data that they cannot be hidden behind the transfer will not be limited by bandwidth and put less strain on the memory subsystem • Applications with heavy demands on bandwidth can overstrain the prefetch mechanism. • A second processor core using a shared path to memory can sometimes provide for the missing prefetches, yielding a slight bandwidth boost • Main memory access is unavoidable, a good programming guideline is to try to establish long continuous data streams
  • 62. • stress the role of prefetching for hiding latency, but the effects of bandwidth limitations are ignored. • It should be clear that prefetching cannot enhance available memory bandwidth, although the transfer time for a single cache line is dominated by latency.
  • 63. Outline • Multi-core Processor • Multi-threaded Processor
  • 64. Multi Core Processor • Moore’s Law is still valid • Switching and leakage power of several hundred- million-transistor chips are so large • Advances and growing cache sizes alone will not be sufficient to keep up the one-to-one correspondence of Moore’s Law with application performance  multicore designs. • For given semiconductor process technology, power dissipation of modern CPUs is proportional to the third power of clock frequency fc (actually it is linear in fc and quadratic in supply voltage Vcc, but a decrease in fc allows for a proportional decrease in Vcc).
  • 65. • Lowering fc and thus Vcc can therefore dramatically reduce power dissipation. • Assuming that a single core with clock frequency fc has a performance of p and a power dissipation of W, some relative change in performance ep =Dp/p will emerge for a relative clock change of e f = Dfc/ fc. • All other things being equal, |e f | is an upper limit for |ep|, which in turn will depend on the applications considered. • Power dissipation is W +W = (1+ f )3W . • Reducing clock frequency opens the possibility to place more than one CPU core on the same die
  • 66. (1+f )3m = 1 =⇒ f = m−1/3−1 . • The cores on one die can either have separate caches or share certain levels. • We will call a group of cores that share a certain cache level a cache group. • For instance, the hexa-core chip in comprises six L1 groups (one core each), three dual- core L2 groups, and one L3 group which encompasses the whole socket. • Sharing a cache enables communication between cores without reverting to main memory, reducing latency and improving bandwidth by about an order of magnitude. • An adverse effect of sharing could be possible cache bandwidth bottlenecks. The performance impact of shared and separate caches on applications is highly code- and system-dependent.
  • 67. • Most recent multicore designs feature an integrated memory controller to which memory modules can be attached directly without separate logic (“chipset”). • This reduces main memory latency and allows the addition of fast intersocket networks like HyperTransport or QuickPath . There may exist fast data paths between caches to enable, e.g., efficient cache coherence communication  cache coherence. • Multicore transition is the absolute necessity to put those resources to efficient use by parallel programming • Multicore is the gradual reduction in main memory bandwidth and cache size available per core. • Effects with larger caches, the performance of some algorithms is always bound by main memory bandwidth, and multiple cores sharing a common memory bus suffer from contention.
  • 68. • the complex structure of shared and nonshared caches on current multicore chips makes communication characteristics between different cores highly nonisotropic. • If there is a shared cache, two cores can exchange certain amounts of information much faster; e.g., they can synchronize via a variable in cache instead of having to exchange data over the memory bus • depending on the communication characteristics and bandwidth demands of running applications, it can be extremely important where exactly multiple threads or processes are running in a multicore.
  • 69. Multi threaded Processor • Modern Processor – High performance attainment - Usage of pipelining • Factors inhibit the usage of pipelining • Dependencies • memory latencies, • insufficient loop length • unfortunate instruction mix • branch misprediction
  • 70. • frequent pipeline bubbles - large part of the execution resources remains idle based on latencies and stall • longer pipelines • Threading capabilities - Hyper-Threading or SMT (Simultaneous Multithreading
  • 71. • Architecture comprises of all data, status and control registers, including stack and instruction pointers. • Execution resources like arithmetic units, caches, queues, memory interfaces etc. are not duplicated. • Due to the multiple architectural states, the CPU appears to be composed of several cores logical processors and can thus execute multiple instruction streams, or threads, in parallel. • fill bubbles in a pipeline due to stalls in one thread with instructions from another • If there are multiple pipelines that can run in parallel and one thread leaves one or more of them in an idle state, another thread • SMT how fine-grained the switching between threads can be performed on a pipeline
  • 72. • Pipeline has to be flushed in order to support a new thread  no long stalls must be bridged. • SMT can enhance instruction throughput (instructions executed per cycle) if there is potential to intersperse instructions from multiple threads within or across with fine- grained two-way SMT. • It is well optimized, floating-point centric scientific code tends not to profit much from SMT, but there are exceptions • The number of outstanding memory references scales with the number of threads so that full memory bandwidth can only be achieved if many threads are running concurrently.
  • 73. • The performance of a single thread is not improved, however, and there may even be a small performance hit for a single instruction stream if resources are hardwired to threads with SMT enabled. • lead to an increase in capacity misses and code is sensitive to cache size  multiple threads share many resources • SMT may severely increase the cost of synchronization: • If several threads on a physical core wait for some event to occur by executing tight, “spin-waiting” loops, they strongly compete for the shared execution units. • This can lead to large synchronization latencies
  • 74. Outline • Vector processors • Design principles • Maximum performance estimates • Programming for vector architectures
  • 75. Vector processors • Cray 1 supercomputer – Powerful - until RISC-based massively parallel machines became available. • Extreme demands on memory bandwidth and time to solution. • real application performance to peak performance - suitable “vectorizable” code • SIMD (Single Instruction Multiple Data) paradigm which demands that a single machine instruction is automatically applied to presumably large—number of arguments of the same type, i.e., a vector. • Modern cache-based microprocessors • vector computers • have much more massive parallelism built into execution units
  • 76. • SIMD architectures can exploit significant data-level parallelism for: – matrix-oriented scientific computing – media-oriented image and sound processors • SIMD is more energy efficient than MIMD – Only needs to fetch one instruction per data operation – Makes SIMD attractive for personal mobile devices • SIMD allows programmer to continue to think sequentially
  • 77. Design principles • Register-to-register machines: Machine instructions operate on vector registers which can hold a number of arguments, usually between 64 and 256 (double precision) • The width of a vector register is called the vector length Lv • There is a pipeline for each arithmetic operation like addition, multiplication, divide, etc., and each pipeline can deliver a certain number of results per cycle. • For MULT and ADD pipes, this number varies between two and 16 and one speaks of a multitrack pipeline
  • 79. • Other operations like square root and divide are significantly more complex, hence the pipeline throughput is much lower for them. • A vector processor with single-track pipes can achieve a similar peak performance per clock cycle as a superscalar cache-based microprocessor • Supply data to the vector registers, there is one or more load, store or combined load/store pipes connecting directly to main memory • NEC SX-9 • SIMD: A(1:N)=B(1:N)+C(1:N) • cache-based microprocessor: elements of A, B, and C • For each index, two loads, one addition and one store operation would have to be executed together with the required integer and branch logic to perform the loop
  • 80. • vload V1(1:N) = B(1:N) • vload V2(1:N) = C(1:N) • vadd V3(1:N) = V1(1:N) + V2(1:N) • vstore A(1:N) = V3(1:N) • V1, V2, and V3 denote vector registers • The original arrays are traversed in chunks of the vector length • An operation like vector addition does not have to wait until its argument vector registers are completely filled but can commence after some initial arguments are available. • This feature is called chaining - different pipes (like MULT and ADD) to operate concurrently
  • 81. • pre-RISC era where multi issue superscalar processors with fast instruction caches were unavailable. • feeding the arithmetic pipes with data - massively banked main memory layout because current memory chips require some cycles of recovery time - bank busy time • current vector computers provide thousands of memory banks • Writing a program so that the compiler can generate effective SIMD vector instructions is called vectorization
  • 82. Maximum performance estimates • The peak performance of a vector processor is given by the number of tracks for the ADD and MULT pipelines and its clock frequency. • Peak performance, 2 (ADD+MULT)×4 (tracks)×2 (GHz)=16 GFlops/sec • Square root, divide and other operations are not considered • Memory bandwidth, a single four-track LD/ST pipeline, for reading or writing. 4 (tracks)×2 (GHz)×8 (bytes)=64 GBytes/sec
  • 83. • The memory interface of vector CPUs often runs at the same frequency as the core, delivering more bandwidth in relation to peak performance • Non vectorizable • Performance prediction based on architectural properties and loop code characteristics. • vector triad - performs three loads, one store and two flops (MULT+ADD). • Only a single LD/ST pipe, loads and stores and even loads to different arrays cannot overlap each other, but they can overlap arithmetic and be chained to arithmetic pipes.
  • 84. • First a vector register must be loaded with data from array C. As the LD/ST pipe starts filling a vector register with data from array D, the MULT pipe can start performing arithmetic on C and D. • As soon as data from B is available, the ADD pipe can compute the final result, which is then stored to memory by the LD/ST pipe again.
  • 85. • 25% of peak performance • amounts to 4 GFlops/sec Programming for vector architectures: • vectorization is the lack of true data dependencies across the iterations of a loop • Software pipelining i.e., forward references are allowed but backward references inhibit vectorization • Branches in inner loop kernels are also a problem • Mask registers (essentially boolean registers with vector length) are provided that allow selective execution of loop iterations.
  • 86. do i = 1,N if(y(i) .le. 0.d0) then x(i) = s * y(i) else x(i) = y(i) * y(i) endif Enddo • A vector of boolean values is first generated from the branch conditional using the logic pipeline
  • 88. • This vector is then used to choose results from the if or else branch • Both branches are executed for all loop indices which may be waste of resources if expensive operations are involved • For a single branch (no else part), especially in cases where expensive operations like divide and square roots are involved, the gather / scatter method is an efficient way to vectorization 1 do i = 1,N 2 if(y(i) .ge. 0.d0) then 3 x(i) = sqrt(y(i)) 4 endif 5 enddo • All needed elements of the required argument vectors are first collected into vector registers (gather), then the vector operation is executed on them and finally the results are stored back (scatter).