SlideShare a Scribd company logo
Introduction to OpenMP
OpenMP is an Application Program Interface (API) for
• explicit
• portable
• shared-memory parallel programming
• in C/C++ and Fortran.
OpenMP consists of
• compiler directives,
• runtime calls and
• environment variables.
It is supported by all major compilers on Unix and
Windows platforms
GNU, IBM, Oracle, Intel, PGI, Absoft, Lahey/Fujitsu,
PathScale, HP, MS, Cray
OpenMP : What is it?
OpenMP Programming Model
➢ Designed for multi-processor/core, shared
memory machines.
➢ OpenMP programs accomplish parallelism
exclusively through the use of threads.
➢ Programmer has full control over
parallelization.
➢ Consists of a set of #pragmas (Compiler
Instructions/ Directives) that control how the
program works.
OpenMP: Core Elements
 Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
 User level runtime functions & Env. variables
Thread Creation/Fork-Join
All OpenMP programs begin as a single process: the master
thread.
The master thread executes sequentially until the
first parallel region construct is encountered.
FORK: the master thread then creates a team of
parallel threads.
The statements in the program that are enclosed by the
parallel region construct are then executed in parallel
among the various team threads.
JOIN: When the team threads complete the statements in
the parallel region construct, they synchronize and
terminate, leaving only the master thread.
Thread Creation/Fork-Join
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goals are
met: i.e. the sequential program evolves into a parallel
program
OpenMP Run Time Variables
❖Modify/check/get info about the number of threads
omp_get_num_threads() //number of threads in use
omp_get_thread_num() //tells which thread you are
omp_get_max_threads() //max threads that can be used
❖Are we in a parallel region?
omp_in_parallel()
❖How many processors in the system?
omp_get_num_procs()
❖Explicit locks
omp_[set|unset]_lock()
And several more...
OpenMP: Few Syntax Details
❖Most of the constructs in OpenMP are compiler directives or
pragmas
For C/C++ the pragmas take the form
#pragma omp construct [clause [clause]…]
For Fortran, the directives take one of the forms
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
❖Header File or Fortran 90 module
#include omp.h
use omp_lib
Parallel Region and basic functions
Compiling OpenMP code
❖Same code can run on single-core or multi-core machines
❖Compiler directives are picked up ONLY when thee
program is instructed to be compiled in OpenMP mode.
❖Method depends on the compiler
G++
$ g++ -o foo foo.c -fopenmp
ICC
$ icc -o foo foo.c -fopenmp
Running OpenMP code
❖Controlling the number of threads at runtime
 The default number of threads = number of online
processors on the machine.
 C shell : setenv OMP_NUM_THREADS number
 Bash shell: export OMP_NUM_THREADS = number
 Runtime OpenMP function omp_set_num_threads(4)
 Clause in #pragma for parallel region
❖Execution Timing #include omp.h
stime = omp_get_wtime();
longfunction();
etime = omp_get_wtime();
total = etime-stime;
To create a 4 thread Parallel region :
Each thread calls pooh(ID,A) for ID = 0 to 3
Thread Creation/Fork-Join
OpenMP: Core Elements
 Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
 User level runtime functions & Env. variables
Data vs. Task Parallelism
Data parallelism
Large amount of data elements and each data element
(or possibly a subset of elements) needs to be processed
to produce a result. When this processing can be done in
parallel, we have data parallelism
Task parallelism
A collection of tasks that need to be completed. If
these tasks can be performed in parallel you are faced
with a task parallel job
OpenMP: Work Sharing
A work-sharing construct divides the
execution of the enclosed code region among
different Threads
categories of work sharing in OpenMP
• omp for
• omp sections
Threads are assigned
independent sets of iterations.
Threads must wait at the end
of the work sharing construct.
#pragma omp for
#pragma omp parallel for
Work Sharing: omp for
Schedule Clause
Data
Sharing/Scope
Schedule Clause
How is the work is divided among threads?
Directives for work distribution
OpenMP for Parallelization
for (int i = 2; i < 10; i++)
{
x[i] = a * x[i-1] + b
}
Can all loops be parallelized?
Loop iterations have to be independent.
Simple Test: If the results differ when the code is executed
backwards, the loop cannot by parallelized!
Between 2 Synchronization points, if atleast 1 thread
writes to a memory location, that atleast 1 other thread
reads from => The result is non-deterministic
Work Sharing: sections
SECTIONS directive is a non-iterative work-sharing
construct.
➢ It specifies that the enclosed section(s) of code are to be
divided among the threads in the team.
➢ Each SECTION is executed ONCE by a thread in the
team.
Work Sharing: sections
OpenMP: Core Elements
 Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
 User level runtime functions & Env. variables
Synchronization Constructs
Synchronization is achieved by
1) Barriers (Task Dependencies)
Implicit : Sync points exist at the end of
parallel –necessary barrier – cant be removed
for – can be removed by using the nowait clause
sections – can be removed by using the nowait clause
single – can be removed by using the nowait clause
Explicit : Must be used when ordering is required
#pragma omp barrier
each thread waits until all threads arrive at the barrier
Explicit Barrier
Implicit Barrier at end
of parallel region
No Barrier
nowait cancels barrier
creation
Synchronization: Barrier
Data Dependencies
OpenMP assumes that there is NO data-
dependency across jobs running in parallel
When the omp parallel directive is placed around
a code block, it is the programmer’s
responsibility to make sure data dependency is
ruled out
Race Condition
Non Deterministic Behaviour
Two or more threads access a shared variable at the same time.
Both Threads A and B are executing
Synchronization Constructs
2) Mutual Exclusion (Data Dependencies)
Critical Sections : Protect access to shared & modifiable data,
allowing ONLY ONE thread to enter it at a given time
#pragma omp critical
#pragma omp atomic – special case of critical, less overhead
Locks
Only one thread
updates this at a
time
Synchronization Constructs
A section of code can only be
executed by one thread at a time
OpenMP: Core Elements
 Directives & Pragmas
▪ Forking Threads (parallel region)
▪ Work Sharing
▪ Synchronization
▪ Data Environment
 User level runtime functions & Env. variables
OpenMP: Data Scoping
Challenge in Shared Memory Parallelization => Managing Data Environment
Scoping
OpenMP Shared variable : Can be Read/Written by all Threads in the team.
OpenMP Private variable : Each Thread has its own local copy of this variable
int i;
int j;
#pragma omp parallel private(j)
{
int k;
i = …….
j = ……..
k = …
}
Private
Shared
Loop variables in an omp for are private;
Local variables in the parallel region are private.
Alter default behaviour with the {default}
clause:
#pragma omp parallel default(shared)
private(x)
{ ... }
#pragma omp parallel default(private) shared
(matrix)
{ ... }
OpenMP: private Clause
• Reproduce the private variable for each thread.
• Variables are not initialized.
• The value that Thread1 stores in x is different from
the value Thread2 stores in x
OpenMP Parallel Programming
➢ Start with a parallelizable algorithm
Loop level parallelism
➢ Implement Serially : Optimized Serial Program
➢ Test, Debug & Time to solution
➢ Annotate the code with parallelization and
Synchronization directives
➢ Remove Race Conditions, False Sharing***
➢ Test and Debug
➢ Measure speed-up
Problem: Count the Number of times each ASCII character occurs in page of text
Input; ASCII text, stored as an ARRAY of characters, Number of bins (128)
Output: Histogram with 128 buckets – one for each ASCII character
➢Start with a parallelizable algorithm
▪Loop level parallelism?
void compute_histogram_st(char *page, int page_size, int
*histogram)
{
for(int i = 0; i < page_size; i++){
char read_character = page[i];
histogram[read_character]++;
}
}
Can this loop be
parallelized?
Annotate the code with parallelization and
Synchronization directives
void compute_histogram_st(char *page, int page_size, int
*histogram)
{
#pragma omp parallel for
for(int i = 0; i < page_size; i++) {
char read_character = page[i];
histogram[read_character]++;
}
}
omp parallel for
This will not work! Why?
Shared
Mutual Exclusion
Private variable
Critical Section
Problem: Count the Number of times each ASCII character occurs in page of text
Input; ASCII text, stored as an ARRAY of characters, Number of bins (128)
Output: Histogram with 128 buckets – one for each ASCII character
Could be slower than the Serial Code.
Overhead = Critical Section + Parallelization
void compute_histogram_st(char *page, int page_size, int
*histogram)
{
#pragma omp parallel for
for(int i = 0; i < page_size; i++){
char read_character = page[i];
#pragma omp atomic
histogram[read_character]++;
}
}
void compute_histogram (char *page, int page_size, int *histogram, int num_bins)
{
int num_threads = omp_get_max_threads();
#pragma omp parallel
{
int local_histogram [num_bins] = {0};
#pragma omp for
for(int i = 0; i < page_size; i++){
char read_character = page[i];
local_histogram [read_character]++;
}
#pragma omp critical
for(int i = 0; i < num_bins; i++){
histogram[i] += local_histogram [i];
}
}
}
Each Thread Updates
its local copy
Combine from thread locals
to shared variable
local_histogram
Thread0
Thread1
Thread2
Bins 1,2,3,….num_bins ------>
OpenMP: Reduction
One or more variables that are private to each thread are subject of
reduction operation at the end of the parallel region.
#pragma omp for reduction(operator : var)
Operator: + , * , - , & , | , && , ||, ^
Combines multiple local copies of the var from threads into a single
copy at the master.
sum = 0;
#pragma omp parallel for
for (int i = 0; i < 9; i++)
{
sum += a[i]
}
OpenMP: Reduction
sum = 0;
#pragma omp parallel for shared(sum, a) reduction(+: sum)
for (int i = 0; i < 9; i++)
{
sum += a[i]
}
sumloc_1 = a[0] + a[1] + a[2]
sumloc_2 = a[3] + a[4] + a[5]
sumloc_3 = a[6] + a[7] + a[8]
3 Threads
sum = sum_loc1 + sum_loc2 + sum_loc3
Computing ∏ by method of Numerical Integration
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i++)
{
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum
}
Serial Code
Loop
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum
}
Computing ∏ by method of Numerical Integration
#include <omp.h>
#define NUM_THREADS 4
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0 / (double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:sum)
private(x)
for (I = 0; I <= num_steps; i++) {
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum
}
Serial Code Parallel Code
Thank You

More Related Content

PPT
Introduction to MPI
DOCX
Parallel computing persentation
PPTX
Underlying principles of parallel and distributed computing
PDF
parallel Questions &amp; answers
PDF
Processes and Processors in Distributed Systems
PPT
multiprocessors and multicomputers
PPT
Distributed Processing
Introduction to MPI
Parallel computing persentation
Underlying principles of parallel and distributed computing
parallel Questions &amp; answers
Processes and Processors in Distributed Systems
multiprocessors and multicomputers
Distributed Processing

What's hot (20)

PDF
Introduction to OpenMP
PDF
Open mp
PPT
Open MPI
PDF
Monitors
PPTX
Introduction to Parallel and Distributed Computing
PDF
Distributed Operating System_1
PDF
CS9222 ADVANCED OPERATING SYSTEMS
PDF
Daa notes 1
PPT
parallel computing.ppt
PDF
OpenMP Tutorial for Beginners
PDF
Open mp directives
PPTX
Dichotomy of parallel computing platforms
PPT
Distributed Operating System
PPTX
INTER PROCESS COMMUNICATION (IPC).pptx
DOCX
Operating System Process Synchronization
PPT
Os Threads
PDF
Design and analysis of algorithms
PDF
Cuda tutorial
PDF
GPU Programming
Introduction to OpenMP
Open mp
Open MPI
Monitors
Introduction to Parallel and Distributed Computing
Distributed Operating System_1
CS9222 ADVANCED OPERATING SYSTEMS
Daa notes 1
parallel computing.ppt
OpenMP Tutorial for Beginners
Open mp directives
Dichotomy of parallel computing platforms
Distributed Operating System
INTER PROCESS COMMUNICATION (IPC).pptx
Operating System Process Synchronization
Os Threads
Design and analysis of algorithms
Cuda tutorial
GPU Programming
Ad

Similar to Introduction to OpenMP (20)

PPT
openmp.New.intro-unc.edu.ppt
PPT
Programming using Open Mp
PPTX
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
PDF
Introduction to OpenMP (Performance)
PDF
Open mp intro_01
PPT
Lecture6
PPT
Nbvtalkataitamimageprocessingconf
PPT
openmp.ppt
PPT
openmp.ppt
PPT
OPEN MP TO FOR knowing more in the front
PDF
Parallel and Distributed Computing Chapter 5
PPT
OpenMP
PPTX
Intro to OpenMP
PDF
openmpfinal.pdf
PPTX
PPT
OpenMP-Quinn17_L4bOpen <MP_Open MP_Open MP
PPT
Lecture8
PPT
Parllelizaion
PDF
Omp tutorial cpugpu_programming_cdac
PPTX
6-9-2017-slides-vFinal.pptx
openmp.New.intro-unc.edu.ppt
Programming using Open Mp
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Introduction to OpenMP (Performance)
Open mp intro_01
Lecture6
Nbvtalkataitamimageprocessingconf
openmp.ppt
openmp.ppt
OPEN MP TO FOR knowing more in the front
Parallel and Distributed Computing Chapter 5
OpenMP
Intro to OpenMP
openmpfinal.pdf
OpenMP-Quinn17_L4bOpen <MP_Open MP_Open MP
Lecture8
Parllelizaion
Omp tutorial cpugpu_programming_cdac
6-9-2017-slides-vFinal.pptx
Ad

More from Akhila Prabhakaran (7)

PDF
Re Imagining Education
PPTX
Hypothesis testing Part1
PPTX
Statistical Analysis with R- III
PPTX
Statistical Analysis with R -II
PPTX
Statistical Analysis with R -I
PDF
Introduction to MPI
PDF
Introduction to Parallel Computing
Re Imagining Education
Hypothesis testing Part1
Statistical Analysis with R- III
Statistical Analysis with R -II
Statistical Analysis with R -I
Introduction to MPI
Introduction to Parallel Computing

Recently uploaded (20)

PPT
Chemical bonding and molecular structure
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Microbiology with diagram medical studies .pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPT
protein biochemistry.ppt for university classes
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
Chemical bonding and molecular structure
Cell Membrane: Structure, Composition & Functions
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
The scientific heritage No 166 (166) (2025)
Microbiology with diagram medical studies .pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ECG_Course_Presentation د.محمد صقران ppt
INTRODUCTION TO EVS | Concept of sustainability
protein biochemistry.ppt for university classes
Biophysics 2.pdffffffffffffffffffffffffff
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
. Radiology Case Scenariosssssssssssssss
Comparative Structure of Integument in Vertebrates.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
Phytochemical Investigation of Miliusa longipes.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
AlphaEarth Foundations and the Satellite Embedding dataset

Introduction to OpenMP

  • 2. OpenMP is an Application Program Interface (API) for • explicit • portable • shared-memory parallel programming • in C/C++ and Fortran. OpenMP consists of • compiler directives, • runtime calls and • environment variables. It is supported by all major compilers on Unix and Windows platforms GNU, IBM, Oracle, Intel, PGI, Absoft, Lahey/Fujitsu, PathScale, HP, MS, Cray OpenMP : What is it?
  • 3. OpenMP Programming Model ➢ Designed for multi-processor/core, shared memory machines. ➢ OpenMP programs accomplish parallelism exclusively through the use of threads. ➢ Programmer has full control over parallelization. ➢ Consists of a set of #pragmas (Compiler Instructions/ Directives) that control how the program works.
  • 4. OpenMP: Core Elements  Directives & Pragmas ▪ Forking Threads (parallel region) ▪ Work Sharing ▪ Synchronization ▪ Data Environment  User level runtime functions & Env. variables
  • 5. Thread Creation/Fork-Join All OpenMP programs begin as a single process: the master thread. The master thread executes sequentially until the first parallel region construct is encountered. FORK: the master thread then creates a team of parallel threads. The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads. JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread.
  • 6. Thread Creation/Fork-Join Master thread spawns a team of threads as needed. Parallelism added incrementally until performance goals are met: i.e. the sequential program evolves into a parallel program
  • 7. OpenMP Run Time Variables ❖Modify/check/get info about the number of threads omp_get_num_threads() //number of threads in use omp_get_thread_num() //tells which thread you are omp_get_max_threads() //max threads that can be used ❖Are we in a parallel region? omp_in_parallel() ❖How many processors in the system? omp_get_num_procs() ❖Explicit locks omp_[set|unset]_lock() And several more...
  • 8. OpenMP: Few Syntax Details ❖Most of the constructs in OpenMP are compiler directives or pragmas For C/C++ the pragmas take the form #pragma omp construct [clause [clause]…] For Fortran, the directives take one of the forms C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] ❖Header File or Fortran 90 module #include omp.h use omp_lib
  • 9. Parallel Region and basic functions
  • 10. Compiling OpenMP code ❖Same code can run on single-core or multi-core machines ❖Compiler directives are picked up ONLY when thee program is instructed to be compiled in OpenMP mode. ❖Method depends on the compiler G++ $ g++ -o foo foo.c -fopenmp ICC $ icc -o foo foo.c -fopenmp
  • 11. Running OpenMP code ❖Controlling the number of threads at runtime  The default number of threads = number of online processors on the machine.  C shell : setenv OMP_NUM_THREADS number  Bash shell: export OMP_NUM_THREADS = number  Runtime OpenMP function omp_set_num_threads(4)  Clause in #pragma for parallel region ❖Execution Timing #include omp.h stime = omp_get_wtime(); longfunction(); etime = omp_get_wtime(); total = etime-stime;
  • 12. To create a 4 thread Parallel region : Each thread calls pooh(ID,A) for ID = 0 to 3 Thread Creation/Fork-Join
  • 13. OpenMP: Core Elements  Directives & Pragmas ▪ Forking Threads (parallel region) ▪ Work Sharing ▪ Synchronization ▪ Data Environment  User level runtime functions & Env. variables
  • 14. Data vs. Task Parallelism Data parallelism Large amount of data elements and each data element (or possibly a subset of elements) needs to be processed to produce a result. When this processing can be done in parallel, we have data parallelism Task parallelism A collection of tasks that need to be completed. If these tasks can be performed in parallel you are faced with a task parallel job
  • 15. OpenMP: Work Sharing A work-sharing construct divides the execution of the enclosed code region among different Threads categories of work sharing in OpenMP • omp for • omp sections
  • 16. Threads are assigned independent sets of iterations. Threads must wait at the end of the work sharing construct. #pragma omp for #pragma omp parallel for
  • 17. Work Sharing: omp for Schedule Clause Data Sharing/Scope
  • 18. Schedule Clause How is the work is divided among threads? Directives for work distribution
  • 19. OpenMP for Parallelization for (int i = 2; i < 10; i++) { x[i] = a * x[i-1] + b } Can all loops be parallelized? Loop iterations have to be independent. Simple Test: If the results differ when the code is executed backwards, the loop cannot by parallelized! Between 2 Synchronization points, if atleast 1 thread writes to a memory location, that atleast 1 other thread reads from => The result is non-deterministic
  • 20. Work Sharing: sections SECTIONS directive is a non-iterative work-sharing construct. ➢ It specifies that the enclosed section(s) of code are to be divided among the threads in the team. ➢ Each SECTION is executed ONCE by a thread in the team.
  • 22. OpenMP: Core Elements  Directives & Pragmas ▪ Forking Threads (parallel region) ▪ Work Sharing ▪ Synchronization ▪ Data Environment  User level runtime functions & Env. variables
  • 23. Synchronization Constructs Synchronization is achieved by 1) Barriers (Task Dependencies) Implicit : Sync points exist at the end of parallel –necessary barrier – cant be removed for – can be removed by using the nowait clause sections – can be removed by using the nowait clause single – can be removed by using the nowait clause Explicit : Must be used when ordering is required #pragma omp barrier each thread waits until all threads arrive at the barrier
  • 24. Explicit Barrier Implicit Barrier at end of parallel region No Barrier nowait cancels barrier creation Synchronization: Barrier
  • 25. Data Dependencies OpenMP assumes that there is NO data- dependency across jobs running in parallel When the omp parallel directive is placed around a code block, it is the programmer’s responsibility to make sure data dependency is ruled out
  • 26. Race Condition Non Deterministic Behaviour Two or more threads access a shared variable at the same time. Both Threads A and B are executing
  • 27. Synchronization Constructs 2) Mutual Exclusion (Data Dependencies) Critical Sections : Protect access to shared & modifiable data, allowing ONLY ONE thread to enter it at a given time #pragma omp critical #pragma omp atomic – special case of critical, less overhead Locks Only one thread updates this at a time
  • 28. Synchronization Constructs A section of code can only be executed by one thread at a time
  • 29. OpenMP: Core Elements  Directives & Pragmas ▪ Forking Threads (parallel region) ▪ Work Sharing ▪ Synchronization ▪ Data Environment  User level runtime functions & Env. variables
  • 30. OpenMP: Data Scoping Challenge in Shared Memory Parallelization => Managing Data Environment Scoping OpenMP Shared variable : Can be Read/Written by all Threads in the team. OpenMP Private variable : Each Thread has its own local copy of this variable int i; int j; #pragma omp parallel private(j) { int k; i = ……. j = …….. k = … } Private Shared Loop variables in an omp for are private; Local variables in the parallel region are private. Alter default behaviour with the {default} clause: #pragma omp parallel default(shared) private(x) { ... } #pragma omp parallel default(private) shared (matrix) { ... }
  • 31. OpenMP: private Clause • Reproduce the private variable for each thread. • Variables are not initialized. • The value that Thread1 stores in x is different from the value Thread2 stores in x
  • 32. OpenMP Parallel Programming ➢ Start with a parallelizable algorithm Loop level parallelism ➢ Implement Serially : Optimized Serial Program ➢ Test, Debug & Time to solution ➢ Annotate the code with parallelization and Synchronization directives ➢ Remove Race Conditions, False Sharing*** ➢ Test and Debug ➢ Measure speed-up
  • 33. Problem: Count the Number of times each ASCII character occurs in page of text Input; ASCII text, stored as an ARRAY of characters, Number of bins (128) Output: Histogram with 128 buckets – one for each ASCII character ➢Start with a parallelizable algorithm ▪Loop level parallelism? void compute_histogram_st(char *page, int page_size, int *histogram) { for(int i = 0; i < page_size; i++){ char read_character = page[i]; histogram[read_character]++; } } Can this loop be parallelized?
  • 34. Annotate the code with parallelization and Synchronization directives void compute_histogram_st(char *page, int page_size, int *histogram) { #pragma omp parallel for for(int i = 0; i < page_size; i++) { char read_character = page[i]; histogram[read_character]++; } } omp parallel for This will not work! Why? Shared Mutual Exclusion Private variable Critical Section
  • 35. Problem: Count the Number of times each ASCII character occurs in page of text Input; ASCII text, stored as an ARRAY of characters, Number of bins (128) Output: Histogram with 128 buckets – one for each ASCII character Could be slower than the Serial Code. Overhead = Critical Section + Parallelization void compute_histogram_st(char *page, int page_size, int *histogram) { #pragma omp parallel for for(int i = 0; i < page_size; i++){ char read_character = page[i]; #pragma omp atomic histogram[read_character]++; } }
  • 36. void compute_histogram (char *page, int page_size, int *histogram, int num_bins) { int num_threads = omp_get_max_threads(); #pragma omp parallel { int local_histogram [num_bins] = {0}; #pragma omp for for(int i = 0; i < page_size; i++){ char read_character = page[i]; local_histogram [read_character]++; } #pragma omp critical for(int i = 0; i < num_bins; i++){ histogram[i] += local_histogram [i]; } } } Each Thread Updates its local copy Combine from thread locals to shared variable local_histogram Thread0 Thread1 Thread2 Bins 1,2,3,….num_bins ------>
  • 37. OpenMP: Reduction One or more variables that are private to each thread are subject of reduction operation at the end of the parallel region. #pragma omp for reduction(operator : var) Operator: + , * , - , & , | , && , ||, ^ Combines multiple local copies of the var from threads into a single copy at the master. sum = 0; #pragma omp parallel for for (int i = 0; i < 9; i++) { sum += a[i] }
  • 38. OpenMP: Reduction sum = 0; #pragma omp parallel for shared(sum, a) reduction(+: sum) for (int i = 0; i < 9; i++) { sum += a[i] } sumloc_1 = a[0] + a[1] + a[2] sumloc_2 = a[3] + a[4] + a[5] sumloc_3 = a[6] + a[7] + a[8] 3 Threads sum = sum_loc1 + sum_loc2 + sum_loc3
  • 39. Computing ∏ by method of Numerical Integration
  • 40. static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0 / (double) num_steps; for (I = 0; I <= num_steps; i++) { x = (I + 0.5) * step; sum = sum + 4.0 / (1.0 + x*x); } pi = step * sum } Serial Code Loop
  • 41. static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0 / (double) num_steps; for (I = 0; I <= num_steps; i++) { x = (I + 0.5) * step; sum = sum + 4.0 / (1.0 + x*x); } pi = step * sum } Computing ∏ by method of Numerical Integration #include <omp.h> #define NUM_THREADS 4 static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0 / (double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for reduction(+:sum) private(x) for (I = 0; I <= num_steps; i++) { x = (I + 0.5) * step; sum = sum + 4.0 / (1.0 + x*x); } pi = step * sum } Serial Code Parallel Code