Chapter_1.ppt, why parallel computing? parallel algorithms

1
Copyright © 2010, Elsevier Inc. All rights Reserved
Chapter 1
Why Parallel Computing?
An Introduction to Parallel Programming
Peter Pacheco

2
Roadmap
 Why we need ever-increasing performance.
 Why we’re building parallel systems.
 Why we need to write parallel programs.
 How do we write parallel programs?
 What we’ll be doing.
 Concurrent, parallel, distributed!
#
Chapter
Subtitle

3
Changing times
 From 1986 – 2002, microprocessors were
speeding like a rocket, increasing in
performance an average of 50% per year.
 Since then, it’s dropped to about 20%
increase per year.

4
Changing times
 this difference in performance increase has been
associated with a dramatic change in processor
design.
 By 2005, manufacturers of microprocessors had
decided that the road to rapidly increasing
performance lay in the direction of parallelism.
 Rather than trying to continue to develop ever-
faster sequential processors, manufacturers
started putting multiple complete processors on
a single integrated circuit.

5
An intelligent solution
 Instead of designing and building faster
microprocessors, put multiple processors
on a single integrated circuit.

6
An intelligent solution
 This change has a very important consequence
for software developers:
 simply adding more processors will not improve the
performance of the vast majority of serial programs,
 that is, programs that were written to run on a single
processor.
 Such programs are unaware of the existence of
multiple processors, and the performance of
such a program on a system with multiple
processors will be the same as its performance
on a single processor of the multiprocessor
system.

7
Now it’s up to the programmers
 Adding more processors doesn’t help
much if programmers aren’t aware of
them…
 … or don’t know how to use them.
 Serial programs don’t benefit from this
approach (in most cases).

8
Questions
All of this raises a number of questions:
 1. Why do we care? Aren’t single processor systems fast
enough? After all, 20% per year is still a pretty significant
performance improvement.
 2. Why can’t microprocessor manufacturers continue to
develop much faster single processor systems? Why
build parallel systems? Why build systems with multiple
processors?
 3. Why can’t we write programs that will automatically
convert serial programs into parallel programs, that is,
programs that take advantage of the presence of multiple
processors?

9
Why we need ever-increasing
performance
 Computational power is increasing, but so
are our computation problems and needs.
 Problems we never dreamed of have been
solved because of past increases, such as
decoding the human genome.
 More complex problems are still waiting to
be solved.

10
Climate modeling

11
Climate modeling
 In order to better understand climate change, we
need far more accurate computer models,
 models that include interactions between the
atmosphere, the oceans, solid land, and the ice caps
at the poles.
 We also need to be able to make detailed
studies of how various interventions might affect
the global climate.

12
Protein folding

13
Protein folding.
 It’s believed that misfolded proteins may
be involved in diseases
 such as Huntington’s, Parkinson’s, and
Alzheimer’s,
 but our ability to study configurations of
complex molecules such as proteins is
severely limited by our current
computational power.

14
Drug discovery

15
Drug discovery.
 There are many ways in which increased computational
power can be used in research into new medical
treatments.
 For example, there are many drugs that are effective in
treating a relatively small fraction of those suffering from
some disease.
 It’s possible that we can devise alternative treatments by
careful analysis of the genomes of the individuals for
whom the known treatment is ineffective.
 This, however, will involve extensive computational
analysis of genomes.

16
Energy research

17
Energy research
 Increased computational power will make it
possible to program much more detailed
models of technologies
 such as wind turbines, solar cells, and
batteries.
 These programs may provide the
information needed to construct far more
efficient clean energy sources.

18
Data analysis

19
Data analysis
 We generate huge amounts of data.
 The quantity of data stored worldwide doubles every two
years [28], but the vast majority of it is largely useless
unless it’s analyzed.
 As an example, knowing the sequence of nucleotides in
human DNA is, by itself, of little use.
 Understanding how this sequence affects development
and how it can cause disease requires extensive
analysis.
 In addition to genomics, vast quantities of data are
generated by particle colliders such as the Large Hadron
Collider at CERN, medical imaging, astronomical
research, and Web search engines—to name a few.

20
Why we’re building parallel
systems
 Up to now, performance increases have
been attributable to increasing density of
transistors.
 But there are
inherent
problems.

21
Why we’re building parallel systems
 Much of the increase in single processor performance
has been driven by the ever-increasing density of
transistors on integrated circuits.
 As the size of transistors decreases, their speed can be
increased, and the overall speed of the integrated circuit
can be increased.
 However, as the speed of transistors increases, their
power consumption also increases.
 Most of this power is spent as heat, and when an
integrated circuit gets too hot, it becomes unreliable.
 Integrated circuits are reaching the limits of their ability to
dissipate heat [26].

22
A little physics lesson
 Smaller transistors = faster processors.
 Faster processors = increased power
consumption.
 Increased power consumption = increased
heat.
 Increased heat = unreliable processors.

23
Why CPU heat up?
 A computer's CPU works by either enabling
electric signals to pass through its microscopic
transistors or by blocking them.
 As electricity passes through the CPU or gets
blocked inside, it gets turned into heat energy.
 While a processor in a high-performance
workstation may run hot due to heavy use, a
processor in a regular computer that overheats is
almost always a sign of a malfunctioning system.

24
Solution
 Move away from single-core systems to
multicore processors.
 “core” = central processing unit (CPU)
 Introducing parallelism!!!

25
Parallelism
 How then, can we exploit the continuing increase in
transistor density?
 The answer is parallelism.
 Rather than building ever-faster, more complex,
monolithic processors, the industry has decided to put
multiple, relatively simple, complete processors on a
single chip.
 Such integrated circuits are called multicore processors,
and core has become synonymous with central
processing unit, or CPU.
 In this setting a conventional processor with one CPU is
often called a single-core system.

26
Why we need to write parallel programs
 Most programs that have been written for conventional,
single-core systems cannot exploit the presence of
multiple cores.
 We can run multiple instances of a program on a
multicore system, but this is often of little help.
 For example, being able to run multiple instances of our
favorite game program isn’t really what we want—we
want the program to run faster with more realistic
graphics.

27
Why we need to write parallel programs
 In order to do this, we need to either
 rewrite our serial programs so that they’re parallel, so
that they can make use of multiple cores,
 or write translation programs, that is, programs that
will automatically convert serial programs into parallel
programs.
 The bad news is that researchers have had very
limited success writing programs that convert
serial programs in languages such as C and C++
into parallel programs.

28
Why we need to write parallel
programs
 Running multiple instances of a serial
program often isn’t very useful.
 Think of running multiple instances of your
favorite game.
 What you really want is for
it to run faster.

29
Approaches to the serial problem
 Rewrite serial programs so that they’re
parallel.
 Write translation programs that
automatically convert serial programs into
parallel programs.
 This is very difficult to do.
 Success has been limited.

30
More problems
 Some coding constructs can be
recognized by an automatic program
generator, and converted to a parallel
construct.
 However, it’s likely that the result will be a
very inefficient program.
 Sometimes the best parallel solution is to
step back and devise an entirely new
algorithm.

31
Example
 Compute n values and add them together.
 Serial solution:

32
Example (cont.)
 We have p cores, p much smaller than n.
 Each core performs a partial sum of
approximately n/p values.
Each core uses it’s own private variables
and executes this block of code
independently of the other cores.

33
Example (cont.)
 Here the prefix my_ indicates that each
core is using its own, private variables,
 each core can execute this block of code
independently of the other cores.

34
Example (cont.)
 After each core completes execution of the
code, is a private variable my_sum
contains the sum of the values computed
by its calls to Compute_next_value.
 Ex., 8 cores, n = 24, then the calls to
Compute_next_value() return:
1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9

35
Example (cont.)
 Once all the cores are done computing
their private my_sum, they form a global
sum by sending results to a designated
“master” core which adds the final result.

36
Example (cont.)
my_sum
my_sum

37
Example (cont.)
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
sum 95 - - - - - - -

38
But wait!
There’s a much better way
to compute the global sum.

39
Better parallel algorithm
 But you can probably see a better way to
do this
 especially if the number of cores is large.
 Instead of making the master core do all
the work of computing the final sum,
 we can pair the cores so that while core 0
adds in the result of core 1, core 2 can add
in the result of core 3, core 4 can add in
the result of core 5 and so on.

40
Better parallel algorithm
 Don’t make the master core do all the
work.
 Share it among the other cores.
 Pair the cores so that core 0 adds its result
with core 1’s result.
 Core 2 adds its result with core 3’s result,
etc.
 Work with odd and even numbered pairs of
cores.

41
Better parallel algorithm (cont.)
 Repeat the process now with only the
evenly ranked cores.
 Core 0 adds result from core 2.
 Core 4 adds the result from core 6, etc.
 Now cores divisible by 4 repeat the
process, and so forth, until core 0 has the
final result.

42
Multiple cores forming a global
sum

43
Analysis
 In the first example, the master core
performs 7 receives and 7 additions.
 In the second example, the master core
performs 3 receives and 3 additions.
 The improvement is more than a factor of 2!

44
Analysis (cont.)
 The difference is more dramatic with a
larger number of cores.
 If we have 1000 cores:
 The first example would require the master to
perform 999 receives and 999 additions.
 The second example would only require 10
receives and 10 additions.
 That’s an improvement of almost a factor
of 100!

45
Analysis
 it’s unlikely that a translation program
would “discover” the second global sum.
 Rather there would more likely be a
predefined efficient global sum that the
translation program would have access to.
 It could “recognize” the original serial loop
and replace it with a parallel global sum.

46
How do we write parallel
programs?
 Task parallelism
 Partition various tasks carried out solving the
problem among the cores.
 Data parallelism
 Partition the data used in solving the problem
among the cores.
 Each core carries out similar operations on it’s
part of the data.

47
Professor P
15 questions
300 exams

48
Professor P’s grading assistants
TA#1
TA#2 TA#3

49
Division of work –
data parallelism
TA#1
TA#2
TA#3
100 exams
100 exams
100 exams

50
task parallelism
TA#1
TA#2
TA#3
Questions 1 - 5
Questions 6 - 10
Questions 11 - 15

51
Division of work – data parallelism
 The first part of the global sum example would probably
be considered an example of data-parallelism.
 The data are the values computed by
Compute_next_value(), and each core carries out the
same operations on its assigned elements: it computes
the required values by calling Compute_next_value() and
adds them together.

52
Division of work – task parallelism
 The second part of the first global sum
example might be considered an example
of task-parallelism.
 There are two tasks: receiving and adding
the cores’ partial sums, which is carried
out by the master core, and giving the
partial sum to the master core, which is
carried out by the other cores.

53
task parallelism
Tasks
1) Receiving
2) Addition

54
Division of work
 When the cores can work independently, writing
a parallel program is much the same as writing a
serial program.
 Things get a good deal more complex when the
cores need to coordinate their work.
 In the second global sum example, although the
tree structure in the diagram is very easy to
understand, writing the actual code is relatively
complex.
 Unfortunately, it’s much more common for the
cores to need coordination.

55
Coordination
 Cores usually need to coordinate their work.
 Communication – one or more cores send
their current partial sums to another core.
 Load balancing – share the work evenly
among the cores so that one is not heavily
loaded.
 Synchronization – because each core works
at its own pace, make sure cores do not get
too far ahead of the rest.

56
What we’ll be doing
 Learning to write programs that are
explicitly parallel.
 Using the C language.
 Using three different extensions to C.
 Message-Passing Interface (MPI)
 Posix Threads (Pthreads)
 OpenMP
 CUDA

57
Type of parallel systems
 You may wonder why we’re learning three
different extensions to C instead of just
one.
 The answer has to do with both the
extensions and parallel systems.
 There are two main types of parallel
systems that we’ll be focusing on:
 shared memory systems
 distributed-memory systems.

58
 Shared-memory
 The cores can share access to the computer’s
memory.
 Coordinate the cores by having them examine
and update shared memory locations.
 Distributed-memory
 Each core has its own, private memory.
 The cores must communicate explicitly by
sending messages across a network.

59
Shared-memory Distributed-memory

60
Terminology
 Concurrent computing – a program is one
in which multiple tasks can be in progress
at any instant.
 Parallel computing – a program is one in
which multiple tasks cooperate closely to
solve a problem
 Distributed computing – a program may
need to cooperate with other programs to
solve a problem.

61
Terminology
 So parallel and distributed programs are
concurrent, but a program such as a
multitasking operating system is also
concurrent, even when it is run on a
machine with only one core, since multiple
tasks can be in progress at any instant.

62
Terminology
 There isn’t a clear-cut distinction between
parallel and distributed programs, but a
parallel program usually runs multiple
tasks simultaneously on cores that are
physically close to each other and that
either share the same memory or are
connected by a very high-speed network.

63
Terminology
 On the other hand, distributed programs tend
to be more “loosely coupled.”
 The tasks may be executed by multiple
computers that are separated by large distances,
and the tasks themselves are often executed by
programs that were created independently.
 As examples, our two concurrent addition
programs would be considered parallel by most
authors, while a Web search program would be
considered distributed.

64
Terminology
 But beware, there isn’t general agreement
on these terms.
 For example, many authors consider
shared-memory programs to be “parallel”
and distributed-memory programs to be
“distributed.”
 As our title suggests, we’ll be interested in
parallel programs—programs in which
closely coupled tasks cooperate to solve a
problem.

65
Concluding Remarks (1)
 The laws of physics have brought us to the
doorstep of multicore technology.
 Serial programs typically don’t benefit from
multiple cores.
 Automatic parallel program generation
from serial program code isn’t the most
efficient approach to get high performance
from multicore computers.

66
Concluding Remarks (2)
 Learning to write parallel programs
involves learning how to coordinate the
cores.
 Parallel programs are usually very
complex and therefore, require novel
program techniques and development.

Chapter_1.ppt, why parallel computing? parallel algorithms

More Related Content

Similar to Chapter_1.ppt, why parallel computing? parallel algorithms (20)

Recently uploaded (20)

Chapter_1.ppt, why parallel computing? parallel algorithms

Editor's Notes