SlideShare a Scribd company logo
1
Copyright © 2010, Elsevier Inc. All rights Reserved
Chapter 1
Why Parallel Computing?
An Introduction to Parallel Programming
Peter Pacheco
2
Copyright © 2010, Elsevier Inc. All rights Reserved
Roadmap
 Why we need ever-increasing performance.
 Why we’re building parallel systems.
 Why we need to write parallel programs.
 How do we write parallel programs?
 What we’ll be doing.
 Concurrent, parallel, distributed!
#
Chapter
Subtitle
3
Changing times
Copyright © 2010, Elsevier Inc. All rights Reserved
 From 1986 – 2002, microprocessors were
speeding like a rocket, increasing in
performance an average of 50% per year.
 Since then, it’s dropped to about 20%
increase per year.
4
Changing times
 this difference in performance increase has been
associated with a dramatic change in processor
design.
 By 2005, manufacturers of microprocessors had
decided that the road to rapidly increasing
performance lay in the direction of parallelism.
 Rather than trying to continue to develop ever-
faster sequential processors, manufacturers
started putting multiple complete processors on
a single integrated circuit.
Copyright © 2010, Elsevier Inc. All rights Reserved
5
An intelligent solution
Copyright © 2010, Elsevier Inc. All rights Reserved
 Instead of designing and building faster
microprocessors, put multiple processors
on a single integrated circuit.
6
An intelligent solution
 This change has a very important consequence
for software developers:
 simply adding more processors will not improve the
performance of the vast majority of serial programs,
 that is, programs that were written to run on a single
processor.
 Such programs are unaware of the existence of
multiple processors, and the performance of
such a program on a system with multiple
processors will be the same as its performance
on a single processor of the multiprocessor
system.
Copyright © 2010, Elsevier Inc. All rights Reserved
7
Now it’s up to the programmers
 Adding more processors doesn’t help
much if programmers aren’t aware of
them…
 … or don’t know how to use them.
 Serial programs don’t benefit from this
approach (in most cases).
Copyright © 2010, Elsevier Inc. All rights Reserved
8
Questions
All of this raises a number of questions:
 1. Why do we care? Aren’t single processor systems fast
enough? After all, 20% per year is still a pretty significant
performance improvement.
 2. Why can’t microprocessor manufacturers continue to
develop much faster single processor systems? Why
build parallel systems? Why build systems with multiple
processors?
 3. Why can’t we write programs that will automatically
convert serial programs into parallel programs, that is,
programs that take advantage of the presence of multiple
processors?
Copyright © 2010, Elsevier Inc. All rights Reserved
9
Why we need ever-increasing
performance
 Computational power is increasing, but so
are our computation problems and needs.
 Problems we never dreamed of have been
solved because of past increases, such as
decoding the human genome.
 More complex problems are still waiting to
be solved.
Copyright © 2010, Elsevier Inc. All rights Reserved
10
Climate modeling
Copyright © 2010, Elsevier Inc. All rights Reserved
11
Climate modeling
 In order to better understand climate change, we
need far more accurate computer models,
 models that include interactions between the
atmosphere, the oceans, solid land, and the ice caps
at the poles.
 We also need to be able to make detailed
studies of how various interventions might affect
the global climate.
Copyright © 2010, Elsevier Inc. All rights Reserved
12
Protein folding
Copyright © 2010, Elsevier Inc. All rights Reserved
13
Protein folding.
 It’s believed that misfolded proteins may
be involved in diseases
 such as Huntington’s, Parkinson’s, and
Alzheimer’s,
 but our ability to study configurations of
complex molecules such as proteins is
severely limited by our current
computational power.
Copyright © 2010, Elsevier Inc. All rights Reserved
14
Drug discovery
Copyright © 2010, Elsevier Inc. All rights Reserved
15
Drug discovery.
 There are many ways in which increased computational
power can be used in research into new medical
treatments.
 For example, there are many drugs that are effective in
treating a relatively small fraction of those suffering from
some disease.
 It’s possible that we can devise alternative treatments by
careful analysis of the genomes of the individuals for
whom the known treatment is ineffective.
 This, however, will involve extensive computational
analysis of genomes.
Copyright © 2010, Elsevier Inc. All rights Reserved
16
Energy research
Copyright © 2010, Elsevier Inc. All rights Reserved
17
Energy research
 Increased computational power will make it
possible to program much more detailed
models of technologies
 such as wind turbines, solar cells, and
batteries.
 These programs may provide the
information needed to construct far more
efficient clean energy sources.
Copyright © 2010, Elsevier Inc. All rights Reserved
18
Data analysis
Copyright © 2010, Elsevier Inc. All rights Reserved
19
Data analysis
 We generate huge amounts of data.
 The quantity of data stored worldwide doubles every two
years [28], but the vast majority of it is largely useless
unless it’s analyzed.
 As an example, knowing the sequence of nucleotides in
human DNA is, by itself, of little use.
 Understanding how this sequence affects development
and how it can cause disease requires extensive
analysis.
 In addition to genomics, vast quantities of data are
generated by particle colliders such as the Large Hadron
Collider at CERN, medical imaging, astronomical
research, and Web search engines—to name a few.
Copyright © 2010, Elsevier Inc. All rights Reserved
20
Why we’re building parallel
systems
 Up to now, performance increases have
been attributable to increasing density of
transistors.
 But there are
inherent
problems.
Copyright © 2010, Elsevier Inc. All rights Reserved
21
Why we’re building parallel systems
 Much of the increase in single processor performance
has been driven by the ever-increasing density of
transistors on integrated circuits.
 As the size of transistors decreases, their speed can be
increased, and the overall speed of the integrated circuit
can be increased.
 However, as the speed of transistors increases, their
power consumption also increases.
 Most of this power is spent as heat, and when an
integrated circuit gets too hot, it becomes unreliable.
 Integrated circuits are reaching the limits of their ability to
dissipate heat [26].
Copyright © 2010, Elsevier Inc. All rights Reserved
22
A little physics lesson
 Smaller transistors = faster processors.
 Faster processors = increased power
consumption.
 Increased power consumption = increased
heat.
 Increased heat = unreliable processors.
Copyright © 2010, Elsevier Inc. All rights Reserved
23
Why CPU heat up?
 A computer's CPU works by either enabling
electric signals to pass through its microscopic
transistors or by blocking them.
 As electricity passes through the CPU or gets
blocked inside, it gets turned into heat energy.
 While a processor in a high-performance
workstation may run hot due to heavy use, a
processor in a regular computer that overheats is
almost always a sign of a malfunctioning system.
Copyright © 2010, Elsevier Inc. All rights Reserved
24
Solution
 Move away from single-core systems to
multicore processors.
 “core” = central processing unit (CPU)
Copyright © 2010, Elsevier Inc. All rights Reserved
 Introducing parallelism!!!
25
Parallelism
 How then, can we exploit the continuing increase in
transistor density?
 The answer is parallelism.
 Rather than building ever-faster, more complex,
monolithic processors, the industry has decided to put
multiple, relatively simple, complete processors on a
single chip.
 Such integrated circuits are called multicore processors,
and core has become synonymous with central
processing unit, or CPU.
 In this setting a conventional processor with one CPU is
often called a single-core system.
Copyright © 2010, Elsevier Inc. All rights Reserved
26
Why we need to write parallel programs
 Most programs that have been written for conventional,
single-core systems cannot exploit the presence of
multiple cores.
 We can run multiple instances of a program on a
multicore system, but this is often of little help.
 For example, being able to run multiple instances of our
favorite game program isn’t really what we want—we
want the program to run faster with more realistic
graphics.
Copyright © 2010, Elsevier Inc. All rights Reserved
27
Why we need to write parallel programs
 In order to do this, we need to either
 rewrite our serial programs so that they’re parallel, so
that they can make use of multiple cores,
 or write translation programs, that is, programs that
will automatically convert serial programs into parallel
programs.
 The bad news is that researchers have had very
limited success writing programs that convert
serial programs in languages such as C and C++
into parallel programs.
Copyright © 2010, Elsevier Inc. All rights Reserved
28
Why we need to write parallel
programs
 Running multiple instances of a serial
program often isn’t very useful.
 Think of running multiple instances of your
favorite game.
 What you really want is for
it to run faster.
Copyright © 2010, Elsevier Inc. All rights Reserved
29
Approaches to the serial problem
 Rewrite serial programs so that they’re
parallel.
 Write translation programs that
automatically convert serial programs into
parallel programs.
 This is very difficult to do.
 Success has been limited.
Copyright © 2010, Elsevier Inc. All rights Reserved
30
More problems
 Some coding constructs can be
recognized by an automatic program
generator, and converted to a parallel
construct.
 However, it’s likely that the result will be a
very inefficient program.
 Sometimes the best parallel solution is to
step back and devise an entirely new
algorithm.
Copyright © 2010, Elsevier Inc. All rights Reserved
31
Example
 Compute n values and add them together.
 Serial solution:
Copyright © 2010, Elsevier Inc. All rights Reserved
32
Example (cont.)
 We have p cores, p much smaller than n.
 Each core performs a partial sum of
approximately n/p values.
Copyright © 2010, Elsevier Inc. All rights Reserved
Each core uses it’s own private variables
and executes this block of code
independently of the other cores.
33
Example (cont.)
 Here the prefix my_ indicates that each
core is using its own, private variables,
 each core can execute this block of code
independently of the other cores.
Copyright © 2010, Elsevier Inc. All rights Reserved
34
Example (cont.)
 After each core completes execution of the
code, is a private variable my_sum
contains the sum of the values computed
by its calls to Compute_next_value.
 Ex., 8 cores, n = 24, then the calls to
Compute_next_value() return:
Copyright © 2010, Elsevier Inc. All rights Reserved
1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
35
Example (cont.)
 Once all the cores are done computing
their private my_sum, they form a global
sum by sending results to a designated
“master” core which adds the final result.
Copyright © 2010, Elsevier Inc. All rights Reserved
36
Example (cont.)
Copyright © 2010, Elsevier Inc. All rights Reserved
my_sum
my_sum
37
Example (cont.)
Copyright © 2010, Elsevier Inc. All rights Reserved
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
sum 95 - - - - - - -
38
Copyright © 2010, Elsevier Inc. All rights Reserved
But wait!
There’s a much better way
to compute the global sum.
39
Better parallel algorithm
 But you can probably see a better way to
do this
 especially if the number of cores is large.
 Instead of making the master core do all
the work of computing the final sum,
 we can pair the cores so that while core 0
adds in the result of core 1, core 2 can add
in the result of core 3, core 4 can add in
the result of core 5 and so on.
Copyright © 2010, Elsevier Inc. All rights Reserved
40
Better parallel algorithm
 Don’t make the master core do all the
work.
 Share it among the other cores.
 Pair the cores so that core 0 adds its result
with core 1’s result.
 Core 2 adds its result with core 3’s result,
etc.
 Work with odd and even numbered pairs of
cores.
Copyright © 2010, Elsevier Inc. All rights Reserved
41
Better parallel algorithm (cont.)
 Repeat the process now with only the
evenly ranked cores.
 Core 0 adds result from core 2.
 Core 4 adds the result from core 6, etc.
 Now cores divisible by 4 repeat the
process, and so forth, until core 0 has the
final result.
Copyright © 2010, Elsevier Inc. All rights Reserved
42
Multiple cores forming a global
sum
Copyright © 2010, Elsevier Inc. All rights Reserved
43
Analysis
 In the first example, the master core
performs 7 receives and 7 additions.
 In the second example, the master core
performs 3 receives and 3 additions.
 The improvement is more than a factor of 2!
Copyright © 2010, Elsevier Inc. All rights Reserved
44
Analysis (cont.)
 The difference is more dramatic with a
larger number of cores.
 If we have 1000 cores:
 The first example would require the master to
perform 999 receives and 999 additions.
 The second example would only require 10
receives and 10 additions.
 That’s an improvement of almost a factor
of 100!
Copyright © 2010, Elsevier Inc. All rights Reserved
45
Analysis
 it’s unlikely that a translation program
would “discover” the second global sum.
 Rather there would more likely be a
predefined efficient global sum that the
translation program would have access to.
 It could “recognize” the original serial loop
and replace it with a parallel global sum.
Copyright © 2010, Elsevier Inc. All rights Reserved
46
How do we write parallel
programs?
 Task parallelism
 Partition various tasks carried out solving the
problem among the cores.
 Data parallelism
 Partition the data used in solving the problem
among the cores.
 Each core carries out similar operations on it’s
part of the data.
Copyright © 2010, Elsevier Inc. All rights Reserved
47
Professor P
Copyright © 2010, Elsevier Inc. All rights Reserved
15 questions
300 exams
48
Professor P’s grading assistants
Copyright © 2010, Elsevier Inc. All rights Reserved
TA#1
TA#2 TA#3
49
Division of work –
data parallelism
Copyright © 2010, Elsevier Inc. All rights Reserved
TA#1
TA#2
TA#3
100 exams
100 exams
100 exams
50
Division of work –
task parallelism
Copyright © 2010, Elsevier Inc. All rights Reserved
TA#1
TA#2
TA#3
Questions 1 - 5
Questions 6 - 10
Questions 11 - 15
51
Division of work – data parallelism
 The first part of the global sum example would probably
be considered an example of data-parallelism.
 The data are the values computed by
Compute_next_value(), and each core carries out the
same operations on its assigned elements: it computes
the required values by calling Compute_next_value() and
adds them together.
Copyright © 2010, Elsevier Inc. All rights Reserved
52
Division of work – task parallelism
 The second part of the first global sum
example might be considered an example
of task-parallelism.
 There are two tasks: receiving and adding
the cores’ partial sums, which is carried
out by the master core, and giving the
partial sum to the master core, which is
carried out by the other cores.
Copyright © 2010, Elsevier Inc. All rights Reserved
53
Division of work –
task parallelism
Copyright © 2010, Elsevier Inc. All rights Reserved
Tasks
1) Receiving
2) Addition
54
Division of work
 When the cores can work independently, writing
a parallel program is much the same as writing a
serial program.
 Things get a good deal more complex when the
cores need to coordinate their work.
 In the second global sum example, although the
tree structure in the diagram is very easy to
understand, writing the actual code is relatively
complex.
 Unfortunately, it’s much more common for the
cores to need coordination.
Copyright © 2010, Elsevier Inc. All rights Reserved
55
Coordination
 Cores usually need to coordinate their work.
 Communication – one or more cores send
their current partial sums to another core.
 Load balancing – share the work evenly
among the cores so that one is not heavily
loaded.
 Synchronization – because each core works
at its own pace, make sure cores do not get
too far ahead of the rest.
Copyright © 2010, Elsevier Inc. All rights Reserved
56
What we’ll be doing
 Learning to write programs that are
explicitly parallel.
 Using the C language.
 Using three different extensions to C.
 Message-Passing Interface (MPI)
 Posix Threads (Pthreads)
 OpenMP
 CUDA
Copyright © 2010, Elsevier Inc. All rights Reserved
57
Type of parallel systems
 You may wonder why we’re learning three
different extensions to C instead of just
one.
 The answer has to do with both the
extensions and parallel systems.
 There are two main types of parallel
systems that we’ll be focusing on:
 shared memory systems
 distributed-memory systems.
Copyright © 2010, Elsevier Inc. All rights Reserved
58
Type of parallel systems
 Shared-memory
 The cores can share access to the computer’s
memory.
 Coordinate the cores by having them examine
and update shared memory locations.
 Distributed-memory
 Each core has its own, private memory.
 The cores must communicate explicitly by
sending messages across a network.
Copyright © 2010, Elsevier Inc. All rights Reserved
59
Type of parallel systems
Copyright © 2010, Elsevier Inc. All rights Reserved
Shared-memory Distributed-memory
60
Terminology
 Concurrent computing – a program is one
in which multiple tasks can be in progress
at any instant.
 Parallel computing – a program is one in
which multiple tasks cooperate closely to
solve a problem
 Distributed computing – a program may
need to cooperate with other programs to
solve a problem.
Copyright © 2010, Elsevier Inc. All rights Reserved
61
Terminology
 So parallel and distributed programs are
concurrent, but a program such as a
multitasking operating system is also
concurrent, even when it is run on a
machine with only one core, since multiple
tasks can be in progress at any instant.
Copyright © 2010, Elsevier Inc. All rights Reserved
62
Terminology
 There isn’t a clear-cut distinction between
parallel and distributed programs, but a
parallel program usually runs multiple
tasks simultaneously on cores that are
physically close to each other and that
either share the same memory or are
connected by a very high-speed network.
Copyright © 2010, Elsevier Inc. All rights Reserved
63
Terminology
 On the other hand, distributed programs tend
to be more “loosely coupled.”
 The tasks may be executed by multiple
computers that are separated by large distances,
and the tasks themselves are often executed by
programs that were created independently.
 As examples, our two concurrent addition
programs would be considered parallel by most
authors, while a Web search program would be
considered distributed.
Copyright © 2010, Elsevier Inc. All rights Reserved
64
Terminology
 But beware, there isn’t general agreement
on these terms.
 For example, many authors consider
shared-memory programs to be “parallel”
and distributed-memory programs to be
“distributed.”
 As our title suggests, we’ll be interested in
parallel programs—programs in which
closely coupled tasks cooperate to solve a
problem.
Copyright © 2010, Elsevier Inc. All rights Reserved
65
Concluding Remarks (1)
 The laws of physics have brought us to the
doorstep of multicore technology.
 Serial programs typically don’t benefit from
multiple cores.
 Automatic parallel program generation
from serial program code isn’t the most
efficient approach to get high performance
from multicore computers.
Copyright © 2010, Elsevier Inc. All rights Reserved
66
Concluding Remarks (2)
 Learning to write parallel programs
involves learning how to coordinate the
cores.
 Parallel programs are usually very
complex and therefore, require novel
program techniques and development.
Copyright © 2010, Elsevier Inc. All rights Reserved

More Related Content

PPT
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
PPT
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
PDF
Introduction into the problems of developing parallel programs
PDF
Real-world Concurrency : Notes
DOCX
group project
PDF
CC LECTURE NOTES (1).pdf
PDF
Isometric Making Essay
PPTX
Knee Osteoarthritis Detection and its Severity.pptx
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
Introduction into the problems of developing parallel programs
Real-world Concurrency : Notes
group project
CC LECTURE NOTES (1).pdf
Isometric Making Essay
Knee Osteoarthritis Detection and its Severity.pptx

Similar to Chapter_1.ppt, why parallel computing? parallel algorithms (20)

PDF
High Performance Computing on the Intel Xeon Phi 2014th Edition Endong Wang
PDF
Cisco Unified Computing System- sneak peak
PDF
Multicore Software Development Techniques Applications Tips and Tricks 1st Ed...
PDF
Multicore Software Development Techniques Applications Tips and Tricks 1st Ed...
PPTX
CH02-Computer Organization and Architecture 10e.pptx
PDF
Disruptive Innovation: how do you use these theories to manage your IT?
PDF
Uniprocessors to multiprocessors Uniprocessors to multiprocessors
PDF
PeerToPeerComputing (1)
PDF
6monitor_NYMIIS
PDF
Development of resource-intensive applications in Visual C++
DOCX
Automatic water dispenser arduino
PPTX
Multi processor
PDF
Development of resource-intensive applications in Visual C++
PPTX
A Beginners Guide To Legacy Systems
PPT
Six Principles of Software Design to Empower Scientists
PDF
Multicore Software Development Techniques Applications Tips and Tricks 1st Ed...
PDF
dist_systems.pdf
DOC
Database project edi
PPTX
2016 05 sanger
PPTX
The free lunch is over
High Performance Computing on the Intel Xeon Phi 2014th Edition Endong Wang
Cisco Unified Computing System- sneak peak
Multicore Software Development Techniques Applications Tips and Tricks 1st Ed...
Multicore Software Development Techniques Applications Tips and Tricks 1st Ed...
CH02-Computer Organization and Architecture 10e.pptx
Disruptive Innovation: how do you use these theories to manage your IT?
Uniprocessors to multiprocessors Uniprocessors to multiprocessors
PeerToPeerComputing (1)
6monitor_NYMIIS
Development of resource-intensive applications in Visual C++
Automatic water dispenser arduino
Multi processor
Development of resource-intensive applications in Visual C++
A Beginners Guide To Legacy Systems
Six Principles of Software Design to Empower Scientists
Multicore Software Development Techniques Applications Tips and Tricks 1st Ed...
dist_systems.pdf
Database project edi
2016 05 sanger
The free lunch is over
Ad

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Welding lecture in detail for understanding
PPTX
web development for engineering and engineering
PDF
composite construction of structures.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
additive manufacturing of ss316l using mig welding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Model Code of Practice - Construction Work - 21102022 .pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Lecture Notes Electrical Wiring System Components
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Welding lecture in detail for understanding
web development for engineering and engineering
composite construction of structures.pdf
Sustainable Sites - Green Building Construction
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mechanical Engineering MATERIALS Selection
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
573137875-Attendance-Management-System-original
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Ad

Chapter_1.ppt, why parallel computing? parallel algorithms

  • 1. 1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco
  • 2. 2 Copyright © 2010, Elsevier Inc. All rights Reserved Roadmap  Why we need ever-increasing performance.  Why we’re building parallel systems.  Why we need to write parallel programs.  How do we write parallel programs?  What we’ll be doing.  Concurrent, parallel, distributed! # Chapter Subtitle
  • 3. 3 Changing times Copyright © 2010, Elsevier Inc. All rights Reserved  From 1986 – 2002, microprocessors were speeding like a rocket, increasing in performance an average of 50% per year.  Since then, it’s dropped to about 20% increase per year.
  • 4. 4 Changing times  this difference in performance increase has been associated with a dramatic change in processor design.  By 2005, manufacturers of microprocessors had decided that the road to rapidly increasing performance lay in the direction of parallelism.  Rather than trying to continue to develop ever- faster sequential processors, manufacturers started putting multiple complete processors on a single integrated circuit. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 5. 5 An intelligent solution Copyright © 2010, Elsevier Inc. All rights Reserved  Instead of designing and building faster microprocessors, put multiple processors on a single integrated circuit.
  • 6. 6 An intelligent solution  This change has a very important consequence for software developers:  simply adding more processors will not improve the performance of the vast majority of serial programs,  that is, programs that were written to run on a single processor.  Such programs are unaware of the existence of multiple processors, and the performance of such a program on a system with multiple processors will be the same as its performance on a single processor of the multiprocessor system. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 7. 7 Now it’s up to the programmers  Adding more processors doesn’t help much if programmers aren’t aware of them…  … or don’t know how to use them.  Serial programs don’t benefit from this approach (in most cases). Copyright © 2010, Elsevier Inc. All rights Reserved
  • 8. 8 Questions All of this raises a number of questions:  1. Why do we care? Aren’t single processor systems fast enough? After all, 20% per year is still a pretty significant performance improvement.  2. Why can’t microprocessor manufacturers continue to develop much faster single processor systems? Why build parallel systems? Why build systems with multiple processors?  3. Why can’t we write programs that will automatically convert serial programs into parallel programs, that is, programs that take advantage of the presence of multiple processors? Copyright © 2010, Elsevier Inc. All rights Reserved
  • 9. 9 Why we need ever-increasing performance  Computational power is increasing, but so are our computation problems and needs.  Problems we never dreamed of have been solved because of past increases, such as decoding the human genome.  More complex problems are still waiting to be solved. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 10. 10 Climate modeling Copyright © 2010, Elsevier Inc. All rights Reserved
  • 11. 11 Climate modeling  In order to better understand climate change, we need far more accurate computer models,  models that include interactions between the atmosphere, the oceans, solid land, and the ice caps at the poles.  We also need to be able to make detailed studies of how various interventions might affect the global climate. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 12. 12 Protein folding Copyright © 2010, Elsevier Inc. All rights Reserved
  • 13. 13 Protein folding.  It’s believed that misfolded proteins may be involved in diseases  such as Huntington’s, Parkinson’s, and Alzheimer’s,  but our ability to study configurations of complex molecules such as proteins is severely limited by our current computational power. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 14. 14 Drug discovery Copyright © 2010, Elsevier Inc. All rights Reserved
  • 15. 15 Drug discovery.  There are many ways in which increased computational power can be used in research into new medical treatments.  For example, there are many drugs that are effective in treating a relatively small fraction of those suffering from some disease.  It’s possible that we can devise alternative treatments by careful analysis of the genomes of the individuals for whom the known treatment is ineffective.  This, however, will involve extensive computational analysis of genomes. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 16. 16 Energy research Copyright © 2010, Elsevier Inc. All rights Reserved
  • 17. 17 Energy research  Increased computational power will make it possible to program much more detailed models of technologies  such as wind turbines, solar cells, and batteries.  These programs may provide the information needed to construct far more efficient clean energy sources. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 18. 18 Data analysis Copyright © 2010, Elsevier Inc. All rights Reserved
  • 19. 19 Data analysis  We generate huge amounts of data.  The quantity of data stored worldwide doubles every two years [28], but the vast majority of it is largely useless unless it’s analyzed.  As an example, knowing the sequence of nucleotides in human DNA is, by itself, of little use.  Understanding how this sequence affects development and how it can cause disease requires extensive analysis.  In addition to genomics, vast quantities of data are generated by particle colliders such as the Large Hadron Collider at CERN, medical imaging, astronomical research, and Web search engines—to name a few. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 20. 20 Why we’re building parallel systems  Up to now, performance increases have been attributable to increasing density of transistors.  But there are inherent problems. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 21. 21 Why we’re building parallel systems  Much of the increase in single processor performance has been driven by the ever-increasing density of transistors on integrated circuits.  As the size of transistors decreases, their speed can be increased, and the overall speed of the integrated circuit can be increased.  However, as the speed of transistors increases, their power consumption also increases.  Most of this power is spent as heat, and when an integrated circuit gets too hot, it becomes unreliable.  Integrated circuits are reaching the limits of their ability to dissipate heat [26]. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 22. 22 A little physics lesson  Smaller transistors = faster processors.  Faster processors = increased power consumption.  Increased power consumption = increased heat.  Increased heat = unreliable processors. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 23. 23 Why CPU heat up?  A computer's CPU works by either enabling electric signals to pass through its microscopic transistors or by blocking them.  As electricity passes through the CPU or gets blocked inside, it gets turned into heat energy.  While a processor in a high-performance workstation may run hot due to heavy use, a processor in a regular computer that overheats is almost always a sign of a malfunctioning system. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 24. 24 Solution  Move away from single-core systems to multicore processors.  “core” = central processing unit (CPU) Copyright © 2010, Elsevier Inc. All rights Reserved  Introducing parallelism!!!
  • 25. 25 Parallelism  How then, can we exploit the continuing increase in transistor density?  The answer is parallelism.  Rather than building ever-faster, more complex, monolithic processors, the industry has decided to put multiple, relatively simple, complete processors on a single chip.  Such integrated circuits are called multicore processors, and core has become synonymous with central processing unit, or CPU.  In this setting a conventional processor with one CPU is often called a single-core system. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 26. 26 Why we need to write parallel programs  Most programs that have been written for conventional, single-core systems cannot exploit the presence of multiple cores.  We can run multiple instances of a program on a multicore system, but this is often of little help.  For example, being able to run multiple instances of our favorite game program isn’t really what we want—we want the program to run faster with more realistic graphics. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 27. 27 Why we need to write parallel programs  In order to do this, we need to either  rewrite our serial programs so that they’re parallel, so that they can make use of multiple cores,  or write translation programs, that is, programs that will automatically convert serial programs into parallel programs.  The bad news is that researchers have had very limited success writing programs that convert serial programs in languages such as C and C++ into parallel programs. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 28. 28 Why we need to write parallel programs  Running multiple instances of a serial program often isn’t very useful.  Think of running multiple instances of your favorite game.  What you really want is for it to run faster. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 29. 29 Approaches to the serial problem  Rewrite serial programs so that they’re parallel.  Write translation programs that automatically convert serial programs into parallel programs.  This is very difficult to do.  Success has been limited. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 30. 30 More problems  Some coding constructs can be recognized by an automatic program generator, and converted to a parallel construct.  However, it’s likely that the result will be a very inefficient program.  Sometimes the best parallel solution is to step back and devise an entirely new algorithm. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 31. 31 Example  Compute n values and add them together.  Serial solution: Copyright © 2010, Elsevier Inc. All rights Reserved
  • 32. 32 Example (cont.)  We have p cores, p much smaller than n.  Each core performs a partial sum of approximately n/p values. Copyright © 2010, Elsevier Inc. All rights Reserved Each core uses it’s own private variables and executes this block of code independently of the other cores.
  • 33. 33 Example (cont.)  Here the prefix my_ indicates that each core is using its own, private variables,  each core can execute this block of code independently of the other cores. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 34. 34 Example (cont.)  After each core completes execution of the code, is a private variable my_sum contains the sum of the values computed by its calls to Compute_next_value.  Ex., 8 cores, n = 24, then the calls to Compute_next_value() return: Copyright © 2010, Elsevier Inc. All rights Reserved 1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
  • 35. 35 Example (cont.)  Once all the cores are done computing their private my_sum, they form a global sum by sending results to a designated “master” core which adds the final result. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 36. 36 Example (cont.) Copyright © 2010, Elsevier Inc. All rights Reserved my_sum my_sum
  • 37. 37 Example (cont.) Copyright © 2010, Elsevier Inc. All rights Reserved Core 0 1 2 3 4 5 6 7 my_sum 8 19 7 15 7 13 12 14 Global sum 8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95 Core 0 1 2 3 4 5 6 7 my_sum 8 19 7 15 7 13 12 14 sum 95 - - - - - - -
  • 38. 38 Copyright © 2010, Elsevier Inc. All rights Reserved But wait! There’s a much better way to compute the global sum.
  • 39. 39 Better parallel algorithm  But you can probably see a better way to do this  especially if the number of cores is large.  Instead of making the master core do all the work of computing the final sum,  we can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result of core 5 and so on. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 40. 40 Better parallel algorithm  Don’t make the master core do all the work.  Share it among the other cores.  Pair the cores so that core 0 adds its result with core 1’s result.  Core 2 adds its result with core 3’s result, etc.  Work with odd and even numbered pairs of cores. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 41. 41 Better parallel algorithm (cont.)  Repeat the process now with only the evenly ranked cores.  Core 0 adds result from core 2.  Core 4 adds the result from core 6, etc.  Now cores divisible by 4 repeat the process, and so forth, until core 0 has the final result. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 42. 42 Multiple cores forming a global sum Copyright © 2010, Elsevier Inc. All rights Reserved
  • 43. 43 Analysis  In the first example, the master core performs 7 receives and 7 additions.  In the second example, the master core performs 3 receives and 3 additions.  The improvement is more than a factor of 2! Copyright © 2010, Elsevier Inc. All rights Reserved
  • 44. 44 Analysis (cont.)  The difference is more dramatic with a larger number of cores.  If we have 1000 cores:  The first example would require the master to perform 999 receives and 999 additions.  The second example would only require 10 receives and 10 additions.  That’s an improvement of almost a factor of 100! Copyright © 2010, Elsevier Inc. All rights Reserved
  • 45. 45 Analysis  it’s unlikely that a translation program would “discover” the second global sum.  Rather there would more likely be a predefined efficient global sum that the translation program would have access to.  It could “recognize” the original serial loop and replace it with a parallel global sum. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 46. 46 How do we write parallel programs?  Task parallelism  Partition various tasks carried out solving the problem among the cores.  Data parallelism  Partition the data used in solving the problem among the cores.  Each core carries out similar operations on it’s part of the data. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 47. 47 Professor P Copyright © 2010, Elsevier Inc. All rights Reserved 15 questions 300 exams
  • 48. 48 Professor P’s grading assistants Copyright © 2010, Elsevier Inc. All rights Reserved TA#1 TA#2 TA#3
  • 49. 49 Division of work – data parallelism Copyright © 2010, Elsevier Inc. All rights Reserved TA#1 TA#2 TA#3 100 exams 100 exams 100 exams
  • 50. 50 Division of work – task parallelism Copyright © 2010, Elsevier Inc. All rights Reserved TA#1 TA#2 TA#3 Questions 1 - 5 Questions 6 - 10 Questions 11 - 15
  • 51. 51 Division of work – data parallelism  The first part of the global sum example would probably be considered an example of data-parallelism.  The data are the values computed by Compute_next_value(), and each core carries out the same operations on its assigned elements: it computes the required values by calling Compute_next_value() and adds them together. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 52. 52 Division of work – task parallelism  The second part of the first global sum example might be considered an example of task-parallelism.  There are two tasks: receiving and adding the cores’ partial sums, which is carried out by the master core, and giving the partial sum to the master core, which is carried out by the other cores. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 53. 53 Division of work – task parallelism Copyright © 2010, Elsevier Inc. All rights Reserved Tasks 1) Receiving 2) Addition
  • 54. 54 Division of work  When the cores can work independently, writing a parallel program is much the same as writing a serial program.  Things get a good deal more complex when the cores need to coordinate their work.  In the second global sum example, although the tree structure in the diagram is very easy to understand, writing the actual code is relatively complex.  Unfortunately, it’s much more common for the cores to need coordination. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 55. 55 Coordination  Cores usually need to coordinate their work.  Communication – one or more cores send their current partial sums to another core.  Load balancing – share the work evenly among the cores so that one is not heavily loaded.  Synchronization – because each core works at its own pace, make sure cores do not get too far ahead of the rest. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 56. 56 What we’ll be doing  Learning to write programs that are explicitly parallel.  Using the C language.  Using three different extensions to C.  Message-Passing Interface (MPI)  Posix Threads (Pthreads)  OpenMP  CUDA Copyright © 2010, Elsevier Inc. All rights Reserved
  • 57. 57 Type of parallel systems  You may wonder why we’re learning three different extensions to C instead of just one.  The answer has to do with both the extensions and parallel systems.  There are two main types of parallel systems that we’ll be focusing on:  shared memory systems  distributed-memory systems. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 58. 58 Type of parallel systems  Shared-memory  The cores can share access to the computer’s memory.  Coordinate the cores by having them examine and update shared memory locations.  Distributed-memory  Each core has its own, private memory.  The cores must communicate explicitly by sending messages across a network. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 59. 59 Type of parallel systems Copyright © 2010, Elsevier Inc. All rights Reserved Shared-memory Distributed-memory
  • 60. 60 Terminology  Concurrent computing – a program is one in which multiple tasks can be in progress at any instant.  Parallel computing – a program is one in which multiple tasks cooperate closely to solve a problem  Distributed computing – a program may need to cooperate with other programs to solve a problem. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 61. 61 Terminology  So parallel and distributed programs are concurrent, but a program such as a multitasking operating system is also concurrent, even when it is run on a machine with only one core, since multiple tasks can be in progress at any instant. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 62. 62 Terminology  There isn’t a clear-cut distinction between parallel and distributed programs, but a parallel program usually runs multiple tasks simultaneously on cores that are physically close to each other and that either share the same memory or are connected by a very high-speed network. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 63. 63 Terminology  On the other hand, distributed programs tend to be more “loosely coupled.”  The tasks may be executed by multiple computers that are separated by large distances, and the tasks themselves are often executed by programs that were created independently.  As examples, our two concurrent addition programs would be considered parallel by most authors, while a Web search program would be considered distributed. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 64. 64 Terminology  But beware, there isn’t general agreement on these terms.  For example, many authors consider shared-memory programs to be “parallel” and distributed-memory programs to be “distributed.”  As our title suggests, we’ll be interested in parallel programs—programs in which closely coupled tasks cooperate to solve a problem. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 65. 65 Concluding Remarks (1)  The laws of physics have brought us to the doorstep of multicore technology.  Serial programs typically don’t benefit from multiple cores.  Automatic parallel program generation from serial program code isn’t the most efficient approach to get high performance from multicore computers. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 66. 66 Concluding Remarks (2)  Learning to write parallel programs involves learning how to coordinate the cores.  Parallel programs are usually very complex and therefore, require novel program techniques and development. Copyright © 2010, Elsevier Inc. All rights Reserved

Editor's Notes

  • #2: 15 December 2023
  • #3: 15 December 2023