SlideShare a Scribd company logo
Parallel Computing
Mohamed Zahran (aka Z)
mzahran@cs.nyu.edu
http://www.mzahran.com
CSCI-UA.0480-003
MPI - III
Many slides of this
lecture are adopted
and slightly modified from:
• Gerassimos Barlas
• Peter S. Pacheco
Collective
point-to-point
Data distributions
Copyright © 2010, Elsevier Inc.
All rights Reserved
Sequential version
Different partitions of a 12-
component vector among 3 processes
• Block: Assign blocks of consecutive components to each process.
• Cyclic: Assign components in a round robin fashion.
• Block-cyclic: Use a cyclic distribution of blocks of components.
Parallel implementation of
vector addition
Copyright © 2010, Elsevier Inc.
All rights Reserved
How will you distribute parts of x[] and y[] to processes?
Scatter
• Read an entire vector on process 0
• MPI_Scatter sends the needed
components to each of the other
processes.
# data items going
to each process
Important:
• All arguments are important for the source process (process 0 in our example)
• For all other processes, only recv_buf_p, recv_count, recv_type, src_proc,
and comm are important
Reading and distributing a vector
Copyright © 2010, Elsevier Inc.
All rights Reserved
process 0 itself
also receives data.
• send_buf_p
– is not used except by the sender.
– However, it must be defined or NULL on others to make the
code correct.
– Must have at least communicator size * send_count elements
• All processes must call MPI_Scatter, not only the sender.
• send_count the number of data items sent to each process.
• recv_buf_p must have at least send_count elements
• MPI_Scatter uses block distribution
0 21 3 4 5 6 7 0
0 1 2
3 4 5
6 7 8
Process 0
Process 0
Process 1
Process 2
Gather
• MPI_Gather collects all of the
components of the vector onto process
dest process, ordered in rank order.
Important:
• All arguments are important for the destination process.
• For all other processes, only send_buf_p, send_count, send_type, dest_proc,
and comm are important
number of elements
for any single receive
number of elements
in send_buf_p
Print a distributed vector (1)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Print a distributed vector (2)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Allgather
• Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
• As usual, recv_count is the amount of data
being received from each process.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Matrix-vector multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
i-th component of y
Dot product of the ith
row of A with x.
Matrix-vector multiplication
Pseudo-code Serial Version
C style arrays
Copyright © 2010, Elsevier Inc.
All rights Reserved
stored as
Serial matrix-vector multiplication
Let’s assume x[] is distributed among the different processes
An MPI matrix-vector
multiplication function (1)
Copyright © 2010, Elsevier Inc.
All rights Reserved
An MPI matrix-vector
multiplication function (2)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Keep in mind …
• In distributed memory systems,
communication is more expensive than
computation.
• Distributing a fixed amount of data
among several messages is more
expensive than sending a single big
message.
Derived datatypes
• Used to represent any collection of data
items
• If a function that sends data knows this
information about a collection of data items,
it can collect the items from memory before
they are sent.
• A function that receives data can distribute
the items into their correct destinations in
memory when they’re received.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Derived datatypes
• A sequence of basic MPI data types
together with a displacement for each
of the data types.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Address in memory where the variables are stored
a and b are double; n is int
displacement from the beginning of the type
(We assume we start with a.)
MPI_Type create_struct
• Builds a derived datatype that consists
of individual elements that have
different basic types.
From the address of item 0an integer type that is big enough
to store an address on the system.
Number of elements in the type
Before you start using your new data type
Allows the MPI implementation to
optimize its internal representation of
the datatype for use in communication
functions.
When you are finished with your new type
This frees any additional storage used.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Example (1)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Example (2)
Copyright © 2010, Elsevier Inc.
All rights Reserved
Example (3)
Copyright © 2010, Elsevier Inc.
All rights Reserved
The receiving end can use
the received complex data
item as if it is a structure.
MEASURING TIME IN MPI
We have seen in the past …
• time in Linux
• clock() inside your code
• Does MPI offer anything else?
Elapsed parallel time
• Returns the number of seconds that
have elapsed since some time in the
past.
Elapsed time for
the calling process
How to Sync Processes?
MPI_Barrier
• Ensures that no process will return from
calling it until every process in the
communicator has started calling it.
Copyright © 2010, Elsevier Inc.
All rights Reserved
Let’s see how we can analyze the
performance of an MPI program
The matrix-vector multiplication
Run-times of serial and parallel
matrix-vector multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
(Seconds)
Speedups of Parallel Matrix-
Vector Multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
Efficiencies of Parallel Matrix-
Vector Multiplication
Copyright © 2010, Elsevier Inc.
All rights Reserved
Conclusions
• Reducing messages is a good
performance strategy!
– Collective vs point-to-point
• Distributing a fixed amount of data
among several messages is more
expensive than sending a single big
message.
Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)

More Related Content

PPT
Introduction to data structures and Algorithm
PPT
Cupdf.com introduction to-data-structures-and-algorithm
PPTX
Tensor flow
PPTX
Multi layered perceptron (mlp)
PPTX
Process of algorithm evaluation
PDF
Introduction to Tensor Flow for Optical Character Recognition (OCR)
PPTX
Raspberry pi Part 6
PPTX
Neural Networks with Google TensorFlow
Introduction to data structures and Algorithm
Cupdf.com introduction to-data-structures-and-algorithm
Tensor flow
Multi layered perceptron (mlp)
Process of algorithm evaluation
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Raspberry pi Part 6
Neural Networks with Google TensorFlow

What's hot (20)

PDF
Introduction to TensorFlow
PPTX
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
PPT
Dynamic Memory Allocation
PPTX
Introduction to Machine Learning with TensorFlow
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
PDF
TensorFlow and Keras: An Overview
PPTX
Intellectual technologies
PPTX
Data Structures - Lecture 1 [introduction]
ODP
Parallel Programming on the ANDC cluster
PDF
How to use tensorflow
PPTX
Heap Management
PDF
INET for Starters
PDF
Scientific Python
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PDF
TensorFlow Tutorial | Deep Learning Using TensorFlow | TensorFlow Tutorial Py...
PPTX
Tensorflow windows installation
PDF
Kaggle tokyo 2018
PPTX
C dynamic ppt
PPTX
An Introduction to TensorFlow architecture
PPT
Memory allocation
Introduction to TensorFlow
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Dynamic Memory Allocation
Introduction to Machine Learning with TensorFlow
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
TensorFlow and Keras: An Overview
Intellectual technologies
Data Structures - Lecture 1 [introduction]
Parallel Programming on the ANDC cluster
How to use tensorflow
Heap Management
INET for Starters
Scientific Python
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
TensorFlow Tutorial | Deep Learning Using TensorFlow | TensorFlow Tutorial Py...
Tensorflow windows installation
Kaggle tokyo 2018
C dynamic ppt
An Introduction to TensorFlow architecture
Memory allocation
Ad

Similar to MPI - 3 (20)

PPTX
Distributed Memory Programming with MPI
PDF
MPI - 4
PDF
MPI - 2
PPTX
6-9-2017-slides-vFinal.pptx
PDF
Parallel Computing - Lec 4
ODP
Debugging With Id
PPTX
Programming using MPI and OpenMP
PDF
Safetty systems intro_embedded_c
PPTX
3. Process Concept in operating system.pptx
PDF
Parallel Computing - Lec 6
PDF
Inter-Process Communication in distributed systems
PDF
Distributed computing
PDF
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック
PPTX
Technical Interview
PPTX
Linux System Programming - Advanced File I/O
PPT
Module-6 process managedf;jsovj;ksdv;sdkvnksdnvldknvlkdfsment.ppt
PPT
Chapter 3 - Processes
PDF
Application of code composer studio in digital signal processing
PPTX
3 processes
Distributed Memory Programming with MPI
MPI - 4
MPI - 2
6-9-2017-slides-vFinal.pptx
Parallel Computing - Lec 4
Debugging With Id
Programming using MPI and OpenMP
Safetty systems intro_embedded_c
3. Process Concept in operating system.pptx
Parallel Computing - Lec 6
Inter-Process Communication in distributed systems
Distributed computing
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック
Technical Interview
Linux System Programming - Advanced File I/O
Module-6 process managedf;jsovj;ksdv;sdkvnksdnvldknvlkdfsment.ppt
Chapter 3 - Processes
Application of code composer studio in digital signal processing
3 processes
Ad

More from Shah Zaib (6)

PDF
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
PDF
MPI - 1
PDF
Parallel Computing - Lec 5
PDF
Parallel Computing - Lec 3
PDF
Parallel Computing - Lec 2
PPTX
Mpi collective communication operations
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
MPI - 1
Parallel Computing - Lec 5
Parallel Computing - Lec 3
Parallel Computing - Lec 2
Mpi collective communication operations

Recently uploaded (20)

PPTX
Embeded System for Artificial intelligence 2.pptx
PPTX
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
PPTX
unit1d-communitypharmacy-240815170017-d032dce8.pptx
PPTX
code of ethics.pptxdvhwbssssSAssscasascc
PPTX
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
PPTX
Wireless and Mobile Backhaul Market.pptx
PPTX
Lecture-3-Computer-programming for BS InfoTech
PPT
Lines and angles cbse class 9 math chemistry
PPT
Hypersensitivity Namisha1111111111-WPS.ppt
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
PPTX
"Fundamentals of Digital Image Processing: A Visual Approach"
PDF
-DIGITAL-INDIA.pdf one of the most prominent
PPTX
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
PPTX
sdn_based_controller_for_mobile_network_traffic_management1.pptx
PPTX
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
PPTX
quadraticequations-111211090004-phpapp02.pptx
PDF
Dynamic Checkweighers and Automatic Weighing Machine Solutions
PPTX
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
PDF
How NGOs Save Costs with Affordable IT Rentals
Embeded System for Artificial intelligence 2.pptx
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
unit1d-communitypharmacy-240815170017-d032dce8.pptx
code of ethics.pptxdvhwbssssSAssscasascc
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
Wireless and Mobile Backhaul Market.pptx
Lecture-3-Computer-programming for BS InfoTech
Lines and angles cbse class 9 math chemistry
Hypersensitivity Namisha1111111111-WPS.ppt
Smarter Security: How Door Access Control Works with Alarms & CCTV
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
"Fundamentals of Digital Image Processing: A Visual Approach"
-DIGITAL-INDIA.pdf one of the most prominent
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
sdn_based_controller_for_mobile_network_traffic_management1.pptx
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
quadraticequations-111211090004-phpapp02.pptx
Dynamic Checkweighers and Automatic Weighing Machine Solutions
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
How NGOs Save Costs with Affordable IT Rentals

MPI - 3

  • 1. Parallel Computing Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-UA.0480-003 MPI - III Many slides of this lecture are adopted and slightly modified from: • Gerassimos Barlas • Peter S. Pacheco
  • 3. Data distributions Copyright © 2010, Elsevier Inc. All rights Reserved Sequential version
  • 4. Different partitions of a 12- component vector among 3 processes • Block: Assign blocks of consecutive components to each process. • Cyclic: Assign components in a round robin fashion. • Block-cyclic: Use a cyclic distribution of blocks of components.
  • 5. Parallel implementation of vector addition Copyright © 2010, Elsevier Inc. All rights Reserved How will you distribute parts of x[] and y[] to processes?
  • 6. Scatter • Read an entire vector on process 0 • MPI_Scatter sends the needed components to each of the other processes. # data items going to each process Important: • All arguments are important for the source process (process 0 in our example) • For all other processes, only recv_buf_p, recv_count, recv_type, src_proc, and comm are important
  • 7. Reading and distributing a vector Copyright © 2010, Elsevier Inc. All rights Reserved process 0 itself also receives data.
  • 8. • send_buf_p – is not used except by the sender. – However, it must be defined or NULL on others to make the code correct. – Must have at least communicator size * send_count elements • All processes must call MPI_Scatter, not only the sender. • send_count the number of data items sent to each process. • recv_buf_p must have at least send_count elements • MPI_Scatter uses block distribution
  • 9. 0 21 3 4 5 6 7 0 0 1 2 3 4 5 6 7 8 Process 0 Process 0 Process 1 Process 2
  • 10. Gather • MPI_Gather collects all of the components of the vector onto process dest process, ordered in rank order. Important: • All arguments are important for the destination process. • For all other processes, only send_buf_p, send_count, send_type, dest_proc, and comm are important number of elements for any single receive number of elements in send_buf_p
  • 11. Print a distributed vector (1) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 12. Print a distributed vector (2) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 13. Allgather • Concatenates the contents of each process’ send_buf_p and stores this in each process’ recv_buf_p. • As usual, recv_count is the amount of data being received from each process. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 14. Matrix-vector multiplication Copyright © 2010, Elsevier Inc. All rights Reserved i-th component of y Dot product of the ith row of A with x.
  • 16. C style arrays Copyright © 2010, Elsevier Inc. All rights Reserved stored as
  • 17. Serial matrix-vector multiplication Let’s assume x[] is distributed among the different processes
  • 18. An MPI matrix-vector multiplication function (1) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 19. An MPI matrix-vector multiplication function (2) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 20. Keep in mind … • In distributed memory systems, communication is more expensive than computation. • Distributing a fixed amount of data among several messages is more expensive than sending a single big message.
  • 21. Derived datatypes • Used to represent any collection of data items • If a function that sends data knows this information about a collection of data items, it can collect the items from memory before they are sent. • A function that receives data can distribute the items into their correct destinations in memory when they’re received. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 22. Derived datatypes • A sequence of basic MPI data types together with a displacement for each of the data types. Copyright © 2010, Elsevier Inc. All rights Reserved Address in memory where the variables are stored a and b are double; n is int displacement from the beginning of the type (We assume we start with a.)
  • 23. MPI_Type create_struct • Builds a derived datatype that consists of individual elements that have different basic types. From the address of item 0an integer type that is big enough to store an address on the system. Number of elements in the type
  • 24. Before you start using your new data type Allows the MPI implementation to optimize its internal representation of the datatype for use in communication functions.
  • 25. When you are finished with your new type This frees any additional storage used. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 26. Example (1) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 27. Example (2) Copyright © 2010, Elsevier Inc. All rights Reserved
  • 28. Example (3) Copyright © 2010, Elsevier Inc. All rights Reserved The receiving end can use the received complex data item as if it is a structure.
  • 30. We have seen in the past … • time in Linux • clock() inside your code • Does MPI offer anything else?
  • 31. Elapsed parallel time • Returns the number of seconds that have elapsed since some time in the past. Elapsed time for the calling process
  • 32. How to Sync Processes?
  • 33. MPI_Barrier • Ensures that no process will return from calling it until every process in the communicator has started calling it. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 34. Let’s see how we can analyze the performance of an MPI program The matrix-vector multiplication
  • 35. Run-times of serial and parallel matrix-vector multiplication Copyright © 2010, Elsevier Inc. All rights Reserved (Seconds)
  • 36. Speedups of Parallel Matrix- Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved
  • 37. Efficiencies of Parallel Matrix- Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved
  • 38. Conclusions • Reducing messages is a good performance strategy! – Collective vs point-to-point • Distributing a fixed amount of data among several messages is more expensive than sending a single big message. Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)