SlideShare a Scribd company logo
Analysis of Parallel Algorithms for Energy
Conservation in Scalable Multicore Architecture
Problem statement
An important concern for computing, as for society, is conserving energy. Moreover, energy
conservation is critical in mobile devices for practical reasons. The relation between parallel
applications and their energy requirements on scalable multicore processors are examined. It
believe this sort of analysis can provide programmers with intuitions about the energy required by
the parallel algorithms they are using–thus guiding the choice of algorithm, architecture and the
number of cores used for a particular application. Researchers have studied performance scalability
of parallel algorithms for some time [10]. However, it is important to note that performance
scalability of a parallel algorithm is not the same as its energy scalability. This difference between
performance scalability and energy scalability is due to two important factors. There is a nonlinear
relationship between power and frequency at which the cores operate in multicore processors. In
fact, the power consumed by a core is (typically) proportional to the cube of its frequency.
Executing parallel algorithms typically involves communication (or shared memory accesses) as well
as computation. The power and performance characteristics of communication and computation
may be different. For example, in many algorithms, communication time may be masked by
overlapping communication and computation (e.g., [1]). However, the power required for
communication would be unaffected whether communication overlaps with the computation or not.
Objective
(a) To analyse energy characteristics of parallel algorithms executed on scalable multicore
processors.
(b) To study the sensitivity of the analysis to changes in parameters such as the ratio of power
required for computation versus power required for communication.
(c) To determine how many cores to use in order to minimize energy consumption.
Methodology
The methodology is to evaluate energy scalability under iso-performance of parallel applications.
Partitioning Step
Find the critical path of the parallel algorithm. The critical path is the longest path through the task
dependency graph (where edges represents task serialization) of the parallel algorithm. Note that
the critical path length gives a lower bound on execution time of the parallel algorithm. In the
parallel algorithm, the critical path is easy to find. It is the execution path of the core that has the
sum of all numbers at the end. Half of the cores send the sum they compute to the other half so that
no core receives a sum from more than one core. The receiving core then add the number the local
sum they have computed. We perform the same step recursively until there is only one core left. At
the end of computation, one core will store the sum of all N numbers.
Communication and computation steps
Partition the critical path into communication and computation steps. We can see that there are log
(M) communication steps and ((N=M) � 1 + log (M)) computation steps. Consider a simple parallel
algorithm to add N numbers using M cores. Initially all N numbers are equally distributed among the
M cores; at the end of the computation, one of the cores stores their sum. Without loss of
generality, assume that the number of cores available is some power of two. The algorithm runs in
log (M) steps.
Adding N numbers using 4 actors; Left most
line represents the critical path; embarrassingly
parallel application but represents a broad
class of tree algorithms
Agglomeration step
Scale computation steps of the critical path so that the parallel performance matches the
performance requirement. Scaling the computation time of the critical path to the difference of the
required performance and the communication time of the critical path, and obtain the new reduced
frequency at which all M cores should run. Reduced frequency obtained at which all M cores should
run to complete in time T.
Where β represents number of cycles required per addition. In order to achieve energy savings, we
require 0 < X0 < F. This restriction provides a lower bound on the input size as a function of M and Kc.
Evaluate the message complexity (total number of messages processed) of the parallel algorithm.
The example algorithms discuss before show that the message complexity of some parallel
algorithms may depend only on the number of cores, while for others it depends on both the input
size and the number of cores used. It is trivial to see that number of message transfers for this
parallel algorithm when running on M cores is (M-1). In the algorithm, the message complexity is
only dependent on M and not on the input size N.
Evaluate the total idle time at all the cores assuming the frequency obtained, running at new
frequency X’. Scaling the parallel algorithm critical path may lead to an increase in idle time in other
paths at other cores. Total idle time is:
Where the first term represents the total idle time spent by idle cores while other cores are busy
computing and second term represents the total idle time spent by idle cores while other cores are
involved in message communication.
Step 6 Frame an expression for energy consumption of the parallel algorithm using the energy
model. The energy expression is the sum of the energy consumed by Computation, Ecomp,
Communication, Ecomm and Idling (static power), Eidle
Ecomp is lower if the cores run at a lower frequency, while Eidle may increase as the busy cores take
longer to finish. Ecomm may increase as more cores are used since the computation is more
distributed. Observe that (N-1) is the total number of computation steps. The energy consumed for
computation, communication and idling while the algorithm is running on M cores at reduced
frequency X’ is:
Mapping step
Analyse the equation to obtain the number of cores required for minimum energy consumption as a
function of input size. The energy expression is dependent on many variables such as N (Input Size),
M (Number of cores), (Number of instruction per addition), Kc (no of cycles executed at maximum
frequency for single message communication time), Em (energy consumed for single message
communication between cores), Ps (static power) and the maximum frequency of a core. In most
architectures, the number of cycles involved per addition is just one, so we assume = 1. Set idle
energy consumed per cycle as (Ps=F) = 1, where the cycle is at the maximum frequency F. Express all
energy values with respect to this normalized energy value.
Figure 1 plot for any input size N, initially energy decreases with increasing M and later on increases
with increasing M. Energy for computation decreases with an increase in number of cores running at
reduced frequencies, and energy for communication increases with increasing cores. We can see
that increasing the input size leads to an increase in the optimal number of cores required for
minimum energy consumption. We now consider the sensitivity of this analysis with respect to the
ratio k. Figure 2 plots the optimal number of cores required for minimum energy consumption by
varying input size and k.
Figure 2 shows that for a fixed input size, the optimal number of cores required for minimum energy
consumption decreases with increasing k. Moreover, it is evident that with increasing input size, the
trend remains the same (approximating a negative exponential curve with negative coefficient).

More Related Content

PDF
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
PDF
Energy efficient resources allocations for wireless communication systems
DOCX
Ijircee paper-template
PPT
Chap3 slides
PDF
Harmony Search Algorithm Based Optimal Placement of Static Capacitors for Los...
PDF
BER Analysis ofImpulse Noise inOFDM System Using LMS,NLMS&RLS
PDF
Economic Load Dispatch Problem with Valve – Point Effect Using a Binary Bat A...
PDF
Cost Allocation of Reactive Power Using Matrix Methodology in Transmission Ne...
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
Energy efficient resources allocations for wireless communication systems
Ijircee paper-template
Chap3 slides
Harmony Search Algorithm Based Optimal Placement of Static Capacitors for Los...
BER Analysis ofImpulse Noise inOFDM System Using LMS,NLMS&RLS
Economic Load Dispatch Problem with Valve – Point Effect Using a Binary Bat A...
Cost Allocation of Reactive Power Using Matrix Methodology in Transmission Ne...

What's hot (19)

PDF
Enriched Firefly Algorithm for Solving Reactive Power Problem
PPT
Lut optimization for memory based computation
PPTX
Energy Efficient Geographical Forwarding Algorithm For Wireless Ad Hoc And Se...
PDF
Presentation: Wind Speed Prediction using Radial Basis Function Neural Network
PDF
Kalman Filter Algorithm for Mitigation of Power System Harmonics
PDF
A Threshold Enhancement Technique for Chaotic On-Off Keying Scheme
PDF
Economic/Emission Load Dispatch Using Artificial Bee Colony Algorithm
PDF
Design & Implementation of LUT Based Multiplier Using APCOMS Technique
PDF
A study of dipole antennas using mat lab
PDF
E0812730
PDF
A reduced complexity and an efficient channel
PPTX
Economic load dispatch problem solving using "Cuckoo Search"
PDF
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PDF
Reduction of Active Power Loss byUsing Adaptive Cat Swarm Optimization
PPTX
Wireless Sensor Network Security Analytics
PDF
Combination of Immune Genetic Particle Swarm Optimization algorithm with BP a...
PDF
Performance of CBR Traffic on Node Overutilization in MANETs
PDF
Adaptive Channel Equalization for Nonlinear Channels using Signed Regressor F...
PDF
Performance Analysis of Ultra Wideband Receivers for High Data Rate Wireless ...
Enriched Firefly Algorithm for Solving Reactive Power Problem
Lut optimization for memory based computation
Energy Efficient Geographical Forwarding Algorithm For Wireless Ad Hoc And Se...
Presentation: Wind Speed Prediction using Radial Basis Function Neural Network
Kalman Filter Algorithm for Mitigation of Power System Harmonics
A Threshold Enhancement Technique for Chaotic On-Off Keying Scheme
Economic/Emission Load Dispatch Using Artificial Bee Colony Algorithm
Design & Implementation of LUT Based Multiplier Using APCOMS Technique
A study of dipole antennas using mat lab
E0812730
A reduced complexity and an efficient channel
Economic load dispatch problem solving using "Cuckoo Search"
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
Reduction of Active Power Loss byUsing Adaptive Cat Swarm Optimization
Wireless Sensor Network Security Analytics
Combination of Immune Genetic Particle Swarm Optimization algorithm with BP a...
Performance of CBR Traffic on Node Overutilization in MANETs
Adaptive Channel Equalization for Nonlinear Channels using Signed Regressor F...
Performance Analysis of Ultra Wideband Receivers for High Data Rate Wireless ...
Ad

Similar to Analysis of parallel algorithms for energy consumption (20)

PDF
Parallel Algorithms
PDF
Distribution systems efficiency
PDF
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
PDF
Parallelising Dynamic Programming
PDF
Power aware compilation
PDF
Parallel Algorithms
PPT
Chap5 slides
PPT
Parallel algorithms
PPT
Parallel algorithms
PDF
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
PPT
Parallel algorithms
PDF
PPTX
Compiler design
PDF
Runtime Methods to Improve Energy Efficiency in HPC Applications
PPT
Lecture1
PDF
optimization and preparation processes.pdf
ZIP
Analysis
PDF
Knapp_Masterarbeit
PDF
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
PPTX
Design & Analysis of Algorithm course .pptx
Parallel Algorithms
Distribution systems efficiency
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
Parallelising Dynamic Programming
Power aware compilation
Parallel Algorithms
Chap5 slides
Parallel algorithms
Parallel algorithms
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel algorithms
Compiler design
Runtime Methods to Improve Energy Efficiency in HPC Applications
Lecture1
optimization and preparation processes.pdf
Analysis
Knapp_Masterarbeit
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
Design & Analysis of Algorithm course .pptx
Ad

More from are you (20)

DOCX
Business plan (air umbrella)
DOCX
A* algorithm
PPT
Poster
DOCX
Dijkstra algorithm
DOCX
Application of least square method
PPTX
Comparative study of fish harvesting
PPTX
Business idea
DOCX
Effective determinants of corporate nano
DOCX
Govida chocolate
PDF
Yourprezi
DOCX
Skrip Arabic (الكتابة العربية)
PPTX
Report ent600
DOCX
Mgt project
PPTX
MyBride Presentation
PPTX
Prezi
PPTX
Levi's jeans presentation
PPTX
Portable Printer
PPTX
Air craft control system of parallel processing
PPTX
Application of Parallel Processing
DOCX
Mybride gallery
Business plan (air umbrella)
A* algorithm
Poster
Dijkstra algorithm
Application of least square method
Comparative study of fish harvesting
Business idea
Effective determinants of corporate nano
Govida chocolate
Yourprezi
Skrip Arabic (الكتابة العربية)
Report ent600
Mgt project
MyBride Presentation
Prezi
Levi's jeans presentation
Portable Printer
Air craft control system of parallel processing
Application of Parallel Processing
Mybride gallery

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Lesson notes of climatology university.
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Institutional Correction lecture only . . .
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Basic Mud Logging Guide for educational purpose
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Cell Types and Its function , kingdom of life
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O7-L3 Supply Chain Operations - ICLT Program
Microbial diseases, their pathogenesis and prophylaxis
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Lesson notes of climatology university.
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Anesthesia in Laparoscopic Surgery in India
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Sports Quiz easy sports quiz sports quiz
Institutional Correction lecture only . . .
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pharma ospi slides which help in ospi learning
TR - Agricultural Crops Production NC III.pdf
RMMM.pdf make it easy to upload and study
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Classroom Observation Tools for Teachers
Basic Mud Logging Guide for educational purpose
Computing-Curriculum for Schools in Ghana
Cell Types and Its function , kingdom of life
Supply Chain Operations Speaking Notes -ICLT Program
O7-L3 Supply Chain Operations - ICLT Program

Analysis of parallel algorithms for energy consumption

  • 1. Analysis of Parallel Algorithms for Energy Conservation in Scalable Multicore Architecture Problem statement An important concern for computing, as for society, is conserving energy. Moreover, energy conservation is critical in mobile devices for practical reasons. The relation between parallel applications and their energy requirements on scalable multicore processors are examined. It believe this sort of analysis can provide programmers with intuitions about the energy required by the parallel algorithms they are using–thus guiding the choice of algorithm, architecture and the number of cores used for a particular application. Researchers have studied performance scalability of parallel algorithms for some time [10]. However, it is important to note that performance scalability of a parallel algorithm is not the same as its energy scalability. This difference between performance scalability and energy scalability is due to two important factors. There is a nonlinear relationship between power and frequency at which the cores operate in multicore processors. In fact, the power consumed by a core is (typically) proportional to the cube of its frequency. Executing parallel algorithms typically involves communication (or shared memory accesses) as well as computation. The power and performance characteristics of communication and computation may be different. For example, in many algorithms, communication time may be masked by overlapping communication and computation (e.g., [1]). However, the power required for communication would be unaffected whether communication overlaps with the computation or not. Objective (a) To analyse energy characteristics of parallel algorithms executed on scalable multicore processors. (b) To study the sensitivity of the analysis to changes in parameters such as the ratio of power required for computation versus power required for communication. (c) To determine how many cores to use in order to minimize energy consumption.
  • 2. Methodology The methodology is to evaluate energy scalability under iso-performance of parallel applications. Partitioning Step Find the critical path of the parallel algorithm. The critical path is the longest path through the task dependency graph (where edges represents task serialization) of the parallel algorithm. Note that the critical path length gives a lower bound on execution time of the parallel algorithm. In the parallel algorithm, the critical path is easy to find. It is the execution path of the core that has the sum of all numbers at the end. Half of the cores send the sum they compute to the other half so that no core receives a sum from more than one core. The receiving core then add the number the local sum they have computed. We perform the same step recursively until there is only one core left. At the end of computation, one core will store the sum of all N numbers. Communication and computation steps Partition the critical path into communication and computation steps. We can see that there are log (M) communication steps and ((N=M) � 1 + log (M)) computation steps. Consider a simple parallel algorithm to add N numbers using M cores. Initially all N numbers are equally distributed among the M cores; at the end of the computation, one of the cores stores their sum. Without loss of generality, assume that the number of cores available is some power of two. The algorithm runs in log (M) steps. Adding N numbers using 4 actors; Left most line represents the critical path; embarrassingly parallel application but represents a broad class of tree algorithms
  • 3. Agglomeration step Scale computation steps of the critical path so that the parallel performance matches the performance requirement. Scaling the computation time of the critical path to the difference of the required performance and the communication time of the critical path, and obtain the new reduced frequency at which all M cores should run. Reduced frequency obtained at which all M cores should run to complete in time T. Where β represents number of cycles required per addition. In order to achieve energy savings, we require 0 < X0 < F. This restriction provides a lower bound on the input size as a function of M and Kc. Evaluate the message complexity (total number of messages processed) of the parallel algorithm. The example algorithms discuss before show that the message complexity of some parallel algorithms may depend only on the number of cores, while for others it depends on both the input size and the number of cores used. It is trivial to see that number of message transfers for this parallel algorithm when running on M cores is (M-1). In the algorithm, the message complexity is only dependent on M and not on the input size N. Evaluate the total idle time at all the cores assuming the frequency obtained, running at new frequency X’. Scaling the parallel algorithm critical path may lead to an increase in idle time in other paths at other cores. Total idle time is: Where the first term represents the total idle time spent by idle cores while other cores are busy computing and second term represents the total idle time spent by idle cores while other cores are involved in message communication. Step 6 Frame an expression for energy consumption of the parallel algorithm using the energy model. The energy expression is the sum of the energy consumed by Computation, Ecomp, Communication, Ecomm and Idling (static power), Eidle Ecomp is lower if the cores run at a lower frequency, while Eidle may increase as the busy cores take longer to finish. Ecomm may increase as more cores are used since the computation is more distributed. Observe that (N-1) is the total number of computation steps. The energy consumed for computation, communication and idling while the algorithm is running on M cores at reduced frequency X’ is:
  • 4. Mapping step Analyse the equation to obtain the number of cores required for minimum energy consumption as a function of input size. The energy expression is dependent on many variables such as N (Input Size), M (Number of cores), (Number of instruction per addition), Kc (no of cycles executed at maximum frequency for single message communication time), Em (energy consumed for single message communication between cores), Ps (static power) and the maximum frequency of a core. In most architectures, the number of cycles involved per addition is just one, so we assume = 1. Set idle energy consumed per cycle as (Ps=F) = 1, where the cycle is at the maximum frequency F. Express all energy values with respect to this normalized energy value. Figure 1 plot for any input size N, initially energy decreases with increasing M and later on increases with increasing M. Energy for computation decreases with an increase in number of cores running at reduced frequencies, and energy for communication increases with increasing cores. We can see that increasing the input size leads to an increase in the optimal number of cores required for minimum energy consumption. We now consider the sensitivity of this analysis with respect to the ratio k. Figure 2 plots the optimal number of cores required for minimum energy consumption by varying input size and k. Figure 2 shows that for a fixed input size, the optimal number of cores required for minimum energy consumption decreases with increasing k. Moreover, it is evident that with increasing input size, the trend remains the same (approximating a negative exponential curve with negative coefficient).