SlideShare a Scribd company logo
Operand Value Based Modeling of Dynamic
Energy Consumption of Soft Processors In FPGA
Zaid Al-Khatib, Samar Abdi
Presented at the Applied Reconfigurable Computing Conference
Ruhr University, Bochum
15 April 2015
# 2
Soft Processors in FPGAs
compared to using function specific hardware
• Advantage
– High programmability in FPGA fabric, can execute complex SW on
a small footprint.
• Short development time.
• Easy to reuse libraries.
• Drawbacks
– Can be very slow.
– May consume more energy.
# 3
Soft Processors in FPGAs
Drawbacks Mitigation Approach
1. Analyze the software execution for time / energy
consumption.
2. Identify the functions that consume the most time /
energy.
3. Examine SW optimizations or implementing the
function using HW accelerators.
4. Repeat until design meets requirements.
# 4
Energy Consumption Analysis
Measure or Estimate?
• For ASIC processors, physical measurement is
possible. Not for FPGA
• It would measure the energy consumed by the entire
FPGA chip, not the resources implementing the soft
processor
[Bazzaz, M. et al., IEEE Trans. On Instrumentation and Measurement, 20013]
# 5
Processor Power Model Description Accuracy / Speed
1 Transistor Level
2 Gate Level
3 RT Level
4 Pipeline state aware
5 Instruction Level
6 Analytical, instruction-class based
7 Function Level Macro Model
8 Mode Based
Processor Power Model Description Accuracy / Speed
1 Transistor Level
2 Gate Level
3 RT Level
4 Pipeline state aware
5 Instruction Level
6 Analytical, instruction-class based
7 Function Level Macro Model
8 Mode Based
[Bansal, N. et al., VLSI Design, 2005]
AbstractionlevelEstimating the Energy Consumption
Model Abstraction Levels
# 6
First Order Estimate nJ Instruction
1.5 lwi r4, r19, 8
1.5 lwi r3, r19, 4
1.25 mul r3, r4, r3
1.4 swi r3, r19, 12
5.65 nJ Total
First Order Estimate (nJ) Instruction
1.5 lwi r4, r19, 8
1.5 lwi r3, r19, 4
1.25 mul r3, r4, r3
1.4 swi r3, r19, 12
5.65 nJ Total
Instruction
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
Instruction Level Models
• First Order Model
Average energy for each instruction
• Two Types of Instruction Level Models:
# 7
First Order Estimate (nJ) Instruction Second Order Estimate (nJ)
1.5 lwi r4, r19, 8 1.5
1.5 lwi r3, r19, 4 0.8
1.25 mul r3, r4, r3 1.25
1.4 swi r3, r19, 12 1.4
5.65 nJ Total 4.95 nJ
Instruction
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
• Second Order Model
Inter Instruction Energy Effect
E( load, load) < E( load, mul)
• First Order Model
Average energy for each instruction
• Two Types of Instruction Level Models:
Instruction Level Models
# 8
Motivation for a new Instruction
Level Model
• When Tested to model the Energy consumed by a Microblaze soft
processor in Virtex5 FPGA, Instruction Level Models failed
because:
– Poorly designed instruction characterization techniques
Assumes the average power of an instruction is equal to the power executing it in an
infinite loop.
ex. E(add) = E (add in an infinite loop) – E(empty infinite loop)
– No account for operand value
Assumes E(add 0,0) = E(add 0x7fffffff, 0x7fffffff)
# 9
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
New Instruction Energy estimation
Method
Reference Application
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
$L2
add r6, r7, r8
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
$L2
lwi r4, r19, 8
add r6, r7, r8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
$L2
lwi r4, r19, 8
lwi r3, r19, 4
add r6, r7, r8
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
0
0.1
0.2
0.3
0.4
lwi lwi mul swi
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
add r6, r7, r8
swi r3, r19, 12
...
bri $L2
Location Based Instruction Energy Profiling
# 10
Energy Profiles of Instructions
-0.2
0.3
0.8
Instruction
Energy(nJ)
muli
-0.2
0.3
0.8
Instruction
Energy(nJ)
lwi
-0.2
0.3
0.8
Instruction
Energy(nJ)
Location of inserted instruction in benchmarking loop
srl
-0.2
0.3
0.8
Instruction
Energy(nJ)
addk
# 11
Instruction Classes
-0.2
0.3
0.8
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
Instruction
Energy(nJ)
lwi
-0.2
0.3
0.8
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
Instruction
Energy(nJ)
Location of inserted instruction in benchmarking loop
srl
• Three instruction classes
– Arithmetic and Logic
– Memory Load and Store
– Shift Operations
-0.2
0.3
0.8
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
Instruction
Energy(nJ)
addk
# 12
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
add 0.1147
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608
Instruction Base Energy
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory
add 0.1147 0.4882
…
mul r3,r4,r3
swi r3,r19,12
lwi r3,r19,12
xori r3,r3,589994
...
…
mul r3,r4,r3
swi r3,r19,12
addk r6,r7,r8
lwi r3,r19,12
xori r3,r3,589994
...
Original Loop
Loop with inserted addk
instruction
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
• Add Instruction Base Energy from Location Based Energy Profile.
• Accounting for inter-instruction energy effect
# 13
Instruction Base Energy
…
mul r3,r4,r3
swi r3,r19,12
lwi r3,r19,12
xori r3,r3,589994
...
…
mul r3,r4,r3
swi r3,r19,12
addk r6,r7,r8
lwi r3,r19,12
xori r3,r3,589994
...
Original Loop
Loop with inserted addk
instruction
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608
Load word 0.7680 0.33536 0.9858
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
lwi
• Load word Instruction Base Energy from Location Based Energy Profile.
# 14
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk minimum profile
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk minimum profile addk maximum profile
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk Energy Variance addk minimum profile addk maximum profile
Operand Value Effect
• Energy Variance of instruction: The maximum energy consumed result of
non-zero operand values
...
lwi r4, r19, 8
add r6, r7, r8
lwi r3, r19, 4
...
• Minimum Profile: r7 = r8 = 0• Minimum Profile: r7 = r8 = 0
• Maximum Profile: r7 = r8 = 0x7fffffff
• Minimum Profile: r7 = r8 = 0
• Maximum Profile: r7 = r8 = 0x7fffffff
• Energy Variance = Max profile – Min profile
• Instruction energy range – depending on operand value:
# 15
• Values of input array contain: a single 1 and 31x 0’s• Values of input array contain: 2x 1’s and 30x 0’s• Values of input array contain: 3x 1’s and 29x 0’s• Values of input array contain: 31x 1’s and a single 0
– Increased energy consumption by approx. %20
#define size 10
int main(){
int temp, arr_in[size]=
{1024, 4194304, 67108864, 2048, 128, 256, 2, 8388608, 32,
268435456};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{33554433, 67109888, 524416, 4196352, 134217736, 671088640,
1073750016, 8388612, 20971520, 67141632};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{1600, 1073809408, 268435496, 36872, 8413184, 135176, 11010048,
33560576, 138, 301990400};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{2147418111, 2147481599, 2147482623, 2145386495, 2147467263,
2147352575, 2147483643, 1879048191, 2147475455, 2147221503};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
Operand Value – Energy Impact
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
# 16
Operand Value – Energy Impact
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
• Impact of operand density:
– Energy is linearly dependent on operand value density
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
193 nJ = ∑ Base Energy
of Instructions
𝑬(𝒊) = 𝑬 𝒃𝒂𝒔𝒆 + 𝒌 ∙ 𝑬𝑽 𝒊
k: fraction of Energy Variance
𝒌 = 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃
Linear function of Operand
Density OD.
# 17
• Energy of an instruction
– Instruction Energy = Base energy + Operand Impact
Operand Value Based Model
• Model Parameters:
– The linear parameters (m
and b)
– For each instruction
• Three values of Base
Energy Ebase (one for each
class)
• Maximum Energy Variance
per instruction
𝑬 𝒊 = 𝑬 𝒃𝒂𝒔𝒆 𝒊, 𝒋 + 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 ∙ 𝑬𝑽 𝒊
Instruction
Base energy after instruction from class
(nJ) Max. Energy
VarianceArithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608 1.0034
rsubk 0.3461 1.0352 0.7762 0.7872
mul 0.1233 0.4819 0.4019 0.9795
idiv 0.1850 0.5401 0.4419 0.7602
and 0.0892 0.5306 0.4213 0.6977
xori 0.3257 0.6345 0.5921 0.6977
cmp 0.1821 0.7108 0.5727 1.0456
nop 0.1343 0.4808 0.1959 0
lwi 0.7680 0.3536 0.9858 0.5310
swi 0.8159 0.4108 0.9761 0.2208
srl 0.1628 0.5550 0.1124 1.0782
sra 0.1571 0.5836 0.1899 1.0373
Operand Value Impact - Linear Fit Parameters
𝑚 0.016
𝑏 -0.061
# 18
Estimation Tool
Application
C / C++
Processor
Energy Model
Energy Report
Annotated
Executable
Target
Device
Phase II
Ones
Densities
Execution Trace
[Basic Block Sequence]
List of Basic
Blocks
Phase I• Phase 1 – Generate model
inputs:
• Instruction in Basic Blocks
• annotated application for:
• Execution trace
• Densities of operand values
• Phase 2 – Estimate Energy
• Estimate energy of each instruction,
and each basic block
• Estimate total energy consumed and
generate energy report
# 19
Estimated Energy Report
• Basic Block 43 consumes 41% of the energy
• Focus optimization on Basic Blocks 43 and 44
0
10
20
30
1
43
43
43
43
43
43
43
2
43
43
43
43
43
43
43
3
43
43
43
43
43
43
43
43
69
83
49
48
46
49
48
46
49
87
65
23
52
61
14
69
31
EstimatedDynamic
Energy(μJ)
Execution Trace - Basic Block IDs
0.0
0.5
1.0
1.5
1
43
43
43
43
43
43
43
2
43
43
43
43
43
43
43
3
43
43
43
43
43
43
43
43
69
83
49
48
46
49
48
46
49
87
65
23
52
61
14
69
31
EstimatedDynamic
Energy(mJ)
Execution Trace - Basic Block IDs
Contribution of each Basic Blocks
to total energy
BB#43
41%BB#44
14%
BB#50
6%
BB#49
5%
• Dhrystone Energy Report
– Consists of 91 basic blocks
– In total, 333 basic blocks executed
# 20
Estimation Accuracy
Application Time (µs) Power (mW) Energy (mJ)
Dhrystone 39.35 33.35 1.31
Quicksort 164.20 33.78 5.55
ReadBMPBlock 251.61 39.96 10.05
DCT 166.68 30.84 5.14
Quantize 58.20 25.52 1.49
Zigzag 25.33 30.98 0.78
Huffman
Encode
471.95 40.70 19.21
JPEG 973.77 37.66 36.67
• Tested the tool with a diverse group of benchmarks
• Accurate estimation used as reference (XPA)
# 21
Instruction Level Models Accuracy
Application First order Model Second order Model
E (mJ) Err E (mJ) Err
Dhrystone 3.6 171% 3.3 155%
QuickSort 15.80 185% 12.63 128%
ReadBMP 24.6 145% 21.7 116%
DCT 18.2 253% 18.2 253%
Quantize 6.4 329% 4.0 169%
Zigzag 2.3 195% 2.3 194%
Huffman Enc. 50.7 164% 47.7 148%
JPEG 102.2 179% 93.8 156%
Average error 216% 156%
Std. Deviation of error 51.6% 35.0%
• State of the art instruction level models
 Large Errors
# 22
Instruction Level Models Accuracy
• State of the art instruction level models
 Large Errors
 Can be calibrated using the error of the first benchmark estimate
Application First order Model Second order Model
E (mJ) Err E* (mJ) Err E (mJ) Err E* (mJ) Err
Dhrystone 3.6 171% 1.31 0.0%** 3.3 155% 1.31 0.0%**
QuickSort 15.80 185% 5.07 -8.7% 12.63 128% 4.95 -10.7%
ReadBMP 24.6 145% 7.90 -21.4% 21.7 116% 8.50 -15%
DCT 18.2 253% 5.82 13.2% 18.2 253% 7.12 38.5%
Quantize 6.4 329% 2.04 38% 4.0 169% 1.57 5.4%
Zigzag 2.3 195% 0.74 -5.3% 2.3 194% 0.90 15.3%
Huffman Enc. 50.7 164% 16.3 -15.4% 47.7 148% 18.7 -2.7%
JPEG 102.2 179% 32.8 -10.7% 93.8 156% 36.8 0.3%
Average error 216% 12.6% 156% 9.5%
Std. Deviation of error 51.6% 10.6% 35.0% 10.4%
# 23
Instruction Level Models Accuracy
• State of the art instruction level models
 Even with calibration,
 OVBM is more than twice as accurate
Application First order Model Second order Model OVBM
E* (mJ) Err E* (mJ) Err E (mJ) Err
Dhrystone 1.31 0.0%** 1.31 0.0%** 1.30 -0.7%
QuickSort 5.07 -8.7% 4.95 -10.7% 5.37 -3.2%
ReadBMP 7.90 -21.4% 8.50 -15% 8.82 -12%
DCT 5.82 13.2% 7.12 38.5% 4.96 -3.5%
Quantize 2.04 38% 1.57 5.4% 1.47 -0.9%
Zigzag 0.74 -5.3% 0.90 15.3% 0.78 -0.6%
Huffman Enc. 16.3 -15.4% 18.7 -2.7% 17.64 -8.2%
JPEG 32.8 -10.7% 36.8 0.3% 33.67 -8.2%
Average error 12.6% 9.5% 4.2%
Std. Deviation of error 10.6% 10.4% 3.5%
# 24
Estimation Speed
Application OVBM Tool (Seconds)
XPA (Hours)
Host Target Total
Dhrystone 0.03 7.49 7.53 1.2
Quicksort 0.01 23.08 23.09 2.5
ReadBMPBlock 0.21 5.88 6.08 3.4
DCT 0.03 10.85 10.88 2.5
Quantize 0.01 8.40 8.41 1.4
Zigzag 0.01 4.41 4.42 1.1
Huffman Encode 0.07 65.04 65.11 5.7
JPEG 0.28 104.24 104.52 10.6
• OVBM tool is 3 orders of magnitude faster than accurate XPA tool
• Speed of OVBM depends on speed of Target Device
# 25
Limitations
• The generated model is specific to a single
implementation and processor configuration.
• The source code of the application is required to
annotate, and trace operand value metrics.
ARC2015_I_Slides

More Related Content

PPTX
Declarative Experimentation in Information Retrieval using PyTerrier
PDF
The L2F Spoken Web Search system for Mediaeval 2012
PDF
Phase-Locked Loop (PLL) and Carrier Synchronization Fuyun Ling_v1.3
PDF
Pll carrier synch f-ling_v1.2
PDF
Initial acquisition in digital communication systems by Fuyun Ling, v1.2
PPTX
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
PDF
Ibfd presentation
PDF
Timing synchronization F Ling_v1.2
Declarative Experimentation in Information Retrieval using PyTerrier
The L2F Spoken Web Search system for Mediaeval 2012
Phase-Locked Loop (PLL) and Carrier Synchronization Fuyun Ling_v1.3
Pll carrier synch f-ling_v1.2
Initial acquisition in digital communication systems by Fuyun Ling, v1.2
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
Ibfd presentation
Timing synchronization F Ling_v1.2

What's hot (9)

PDF
Timing synchronization F Ling_v1
PDF
Initial acquisition in digital communication systems
PPT
PPT
Pipelining slides
PDF
Fast Fourier Transform
PDF
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
PDF
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
PDF
Design Of 10 gbps
PDF
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Timing synchronization F Ling_v1
Initial acquisition in digital communication systems
Pipelining slides
Fast Fourier Transform
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Design Of 10 gbps
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Ad

Similar to ARC2015_I_Slides (20)

PPT
Instruction level power analysis
PPTX
Ph.D. Thesis presentation
PDF
Aw26312325
PDF
A Software Approach for Lower Power Consumption.pdf
PDF
Power aware compilation
PPT
Dynamic Power Consumption In Large FPGAs.ppt
PDF
Speedup Your Java Apps with Hardware Counters
PDF
ALEA:Fine-grain Energy Profiling with Basic Block sampling
PPT
COMPILER_DESIGN_CLASS 2.ppt
PPTX
COMPILER_DESIGN_CLASS 1.pptx
PDF
A Survey on Machine Learning Applications in VLSI CAD
PDF
A Survey on Machine Learning Applications in VLSI CAD
PDF
A SURVEY ON MACHINE LEARNING APPLICATIONS IN VLSI CAD
PDF
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
PDF
Towards Automated Design Space Exploration and Code Generation using Type Tra...
PDF
23_Advanced_Processors controller system
PPT
PDF
Power Consumption Prediction based on Statistical Learning Techniques - David...
PDF
PPU Optimisation Lesson
PPTX
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Instruction level power analysis
Ph.D. Thesis presentation
Aw26312325
A Software Approach for Lower Power Consumption.pdf
Power aware compilation
Dynamic Power Consumption In Large FPGAs.ppt
Speedup Your Java Apps with Hardware Counters
ALEA:Fine-grain Energy Profiling with Basic Block sampling
COMPILER_DESIGN_CLASS 2.ppt
COMPILER_DESIGN_CLASS 1.pptx
A Survey on Machine Learning Applications in VLSI CAD
A Survey on Machine Learning Applications in VLSI CAD
A SURVEY ON MACHINE LEARNING APPLICATIONS IN VLSI CAD
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
Towards Automated Design Space Exploration and Code Generation using Type Tra...
23_Advanced_Processors controller system
Power Consumption Prediction based on Statistical Learning Techniques - David...
PPU Optimisation Lesson
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Ad

ARC2015_I_Slides

  • 1. Operand Value Based Modeling of Dynamic Energy Consumption of Soft Processors In FPGA Zaid Al-Khatib, Samar Abdi Presented at the Applied Reconfigurable Computing Conference Ruhr University, Bochum 15 April 2015
  • 2. # 2 Soft Processors in FPGAs compared to using function specific hardware • Advantage – High programmability in FPGA fabric, can execute complex SW on a small footprint. • Short development time. • Easy to reuse libraries. • Drawbacks – Can be very slow. – May consume more energy.
  • 3. # 3 Soft Processors in FPGAs Drawbacks Mitigation Approach 1. Analyze the software execution for time / energy consumption. 2. Identify the functions that consume the most time / energy. 3. Examine SW optimizations or implementing the function using HW accelerators. 4. Repeat until design meets requirements.
  • 4. # 4 Energy Consumption Analysis Measure or Estimate? • For ASIC processors, physical measurement is possible. Not for FPGA • It would measure the energy consumed by the entire FPGA chip, not the resources implementing the soft processor [Bazzaz, M. et al., IEEE Trans. On Instrumentation and Measurement, 20013]
  • 5. # 5 Processor Power Model Description Accuracy / Speed 1 Transistor Level 2 Gate Level 3 RT Level 4 Pipeline state aware 5 Instruction Level 6 Analytical, instruction-class based 7 Function Level Macro Model 8 Mode Based Processor Power Model Description Accuracy / Speed 1 Transistor Level 2 Gate Level 3 RT Level 4 Pipeline state aware 5 Instruction Level 6 Analytical, instruction-class based 7 Function Level Macro Model 8 Mode Based [Bansal, N. et al., VLSI Design, 2005] AbstractionlevelEstimating the Energy Consumption Model Abstraction Levels
  • 6. # 6 First Order Estimate nJ Instruction 1.5 lwi r4, r19, 8 1.5 lwi r3, r19, 4 1.25 mul r3, r4, r3 1.4 swi r3, r19, 12 5.65 nJ Total First Order Estimate (nJ) Instruction 1.5 lwi r4, r19, 8 1.5 lwi r3, r19, 4 1.25 mul r3, r4, r3 1.4 swi r3, r19, 12 5.65 nJ Total Instruction lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 Instruction Level Models • First Order Model Average energy for each instruction • Two Types of Instruction Level Models:
  • 7. # 7 First Order Estimate (nJ) Instruction Second Order Estimate (nJ) 1.5 lwi r4, r19, 8 1.5 1.5 lwi r3, r19, 4 0.8 1.25 mul r3, r4, r3 1.25 1.4 swi r3, r19, 12 1.4 5.65 nJ Total 4.95 nJ Instruction lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 • Second Order Model Inter Instruction Energy Effect E( load, load) < E( load, mul) • First Order Model Average energy for each instruction • Two Types of Instruction Level Models: Instruction Level Models
  • 8. # 8 Motivation for a new Instruction Level Model • When Tested to model the Energy consumed by a Microblaze soft processor in Virtex5 FPGA, Instruction Level Models failed because: – Poorly designed instruction characterization techniques Assumes the average power of an instruction is equal to the power executing it in an infinite loop. ex. E(add) = E (add in an infinite loop) – E(empty infinite loop) – No account for operand value Assumes E(add 0,0) = E(add 0x7fffffff, 0x7fffffff)
  • 9. # 9 $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 New Instruction Energy estimation Method Reference Application $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 $L2 add r6, r7, r8 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 $L2 lwi r4, r19, 8 add r6, r7, r8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 $L2 lwi r4, r19, 8 lwi r3, r19, 4 add r6, r7, r8 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 0 0.1 0.2 0.3 0.4 lwi lwi mul swi InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 add r6, r7, r8 swi r3, r19, 12 ... bri $L2 Location Based Instruction Energy Profiling
  • 10. # 10 Energy Profiles of Instructions -0.2 0.3 0.8 Instruction Energy(nJ) muli -0.2 0.3 0.8 Instruction Energy(nJ) lwi -0.2 0.3 0.8 Instruction Energy(nJ) Location of inserted instruction in benchmarking loop srl -0.2 0.3 0.8 Instruction Energy(nJ) addk
  • 11. # 11 Instruction Classes -0.2 0.3 0.8 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti Instruction Energy(nJ) lwi -0.2 0.3 0.8 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti Instruction Energy(nJ) Location of inserted instruction in benchmarking loop srl • Three instruction classes – Arithmetic and Logic – Memory Load and Store – Shift Operations -0.2 0.3 0.8 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti Instruction Energy(nJ) addk
  • 12. # 12 Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory Shift add Instruction Base energy after instruction from class (nJ) Arithmetic & Logic add 0.1147 Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory Shift add 0.1147 0.4882 0.1608 Instruction Base Energy Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory add 0.1147 0.4882 … mul r3,r4,r3 swi r3,r19,12 lwi r3,r19,12 xori r3,r3,589994 ... … mul r3,r4,r3 swi r3,r19,12 addk r6,r7,r8 lwi r3,r19,12 xori r3,r3,589994 ... Original Loop Loop with inserted addk instruction -0.2 0 0.2 0.4 0.6 0.8 1 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk • Add Instruction Base Energy from Location Based Energy Profile. • Accounting for inter-instruction energy effect
  • 13. # 13 Instruction Base Energy … mul r3,r4,r3 swi r3,r19,12 lwi r3,r19,12 xori r3,r3,589994 ... … mul r3,r4,r3 swi r3,r19,12 addk r6,r7,r8 lwi r3,r19,12 xori r3,r3,589994 ... Original Loop Loop with inserted addk instruction -0.2 0 0.2 0.4 0.6 0.8 1 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory Shift add 0.1147 0.4882 0.1608 Load word 0.7680 0.33536 0.9858 -0.2 0 0.2 0.4 0.6 0.8 1 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) lwi • Load word Instruction Base Energy from Location Based Energy Profile.
  • 14. # 14 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk minimum profile -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk minimum profile addk maximum profile -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk Energy Variance addk minimum profile addk maximum profile Operand Value Effect • Energy Variance of instruction: The maximum energy consumed result of non-zero operand values ... lwi r4, r19, 8 add r6, r7, r8 lwi r3, r19, 4 ... • Minimum Profile: r7 = r8 = 0• Minimum Profile: r7 = r8 = 0 • Maximum Profile: r7 = r8 = 0x7fffffff • Minimum Profile: r7 = r8 = 0 • Maximum Profile: r7 = r8 = 0x7fffffff • Energy Variance = Max profile – Min profile • Instruction energy range – depending on operand value:
  • 15. # 15 • Values of input array contain: a single 1 and 31x 0’s• Values of input array contain: 2x 1’s and 30x 0’s• Values of input array contain: 3x 1’s and 29x 0’s• Values of input array contain: 31x 1’s and a single 0 – Increased energy consumption by approx. %20 #define size 10 int main(){ int temp, arr_in[size]= {1024, 4194304, 67108864, 2048, 128, 256, 2, 8388608, 32, 268435456}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} #define size 10 int main(){ int temp, arr_in[size]= {33554433, 67109888, 524416, 4196352, 134217736, 671088640, 1073750016, 8388612, 20971520, 67141632}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} #define size 10 int main(){ int temp, arr_in[size]= {1600, 1073809408, 268435496, 36872, 8413184, 135176, 11010048, 33560576, 138, 301990400}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} #define size 10 int main(){ int temp, arr_in[size]= {2147418111, 2147481599, 2147482623, 2145386495, 2147467263, 2147352575, 2147483643, 1879048191, 2147475455, 2147221503}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} Operand Value – Energy Impact 190 195 200 205 210 215 220 225 230 235 0 10 20 30 dynamicenergyconsumed(nJ) Number of ones in each input array value
  • 16. # 16 Operand Value – Energy Impact 190 195 200 205 210 215 220 225 230 235 0 10 20 30 dynamicenergyconsumed(nJ) Number of ones in each input array value • Impact of operand density: – Energy is linearly dependent on operand value density 190 195 200 205 210 215 220 225 230 235 0 10 20 30 dynamicenergyconsumed(nJ) Number of ones in each input array value 193 nJ = ∑ Base Energy of Instructions 𝑬(𝒊) = 𝑬 𝒃𝒂𝒔𝒆 + 𝒌 ∙ 𝑬𝑽 𝒊 k: fraction of Energy Variance 𝒌 = 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 Linear function of Operand Density OD.
  • 17. # 17 • Energy of an instruction – Instruction Energy = Base energy + Operand Impact Operand Value Based Model • Model Parameters: – The linear parameters (m and b) – For each instruction • Three values of Base Energy Ebase (one for each class) • Maximum Energy Variance per instruction 𝑬 𝒊 = 𝑬 𝒃𝒂𝒔𝒆 𝒊, 𝒋 + 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 ∙ 𝑬𝑽 𝒊 Instruction Base energy after instruction from class (nJ) Max. Energy VarianceArithmetic & Logic Memory Shift add 0.1147 0.4882 0.1608 1.0034 rsubk 0.3461 1.0352 0.7762 0.7872 mul 0.1233 0.4819 0.4019 0.9795 idiv 0.1850 0.5401 0.4419 0.7602 and 0.0892 0.5306 0.4213 0.6977 xori 0.3257 0.6345 0.5921 0.6977 cmp 0.1821 0.7108 0.5727 1.0456 nop 0.1343 0.4808 0.1959 0 lwi 0.7680 0.3536 0.9858 0.5310 swi 0.8159 0.4108 0.9761 0.2208 srl 0.1628 0.5550 0.1124 1.0782 sra 0.1571 0.5836 0.1899 1.0373 Operand Value Impact - Linear Fit Parameters 𝑚 0.016 𝑏 -0.061
  • 18. # 18 Estimation Tool Application C / C++ Processor Energy Model Energy Report Annotated Executable Target Device Phase II Ones Densities Execution Trace [Basic Block Sequence] List of Basic Blocks Phase I• Phase 1 – Generate model inputs: • Instruction in Basic Blocks • annotated application for: • Execution trace • Densities of operand values • Phase 2 – Estimate Energy • Estimate energy of each instruction, and each basic block • Estimate total energy consumed and generate energy report
  • 19. # 19 Estimated Energy Report • Basic Block 43 consumes 41% of the energy • Focus optimization on Basic Blocks 43 and 44 0 10 20 30 1 43 43 43 43 43 43 43 2 43 43 43 43 43 43 43 3 43 43 43 43 43 43 43 43 69 83 49 48 46 49 48 46 49 87 65 23 52 61 14 69 31 EstimatedDynamic Energy(μJ) Execution Trace - Basic Block IDs 0.0 0.5 1.0 1.5 1 43 43 43 43 43 43 43 2 43 43 43 43 43 43 43 3 43 43 43 43 43 43 43 43 69 83 49 48 46 49 48 46 49 87 65 23 52 61 14 69 31 EstimatedDynamic Energy(mJ) Execution Trace - Basic Block IDs Contribution of each Basic Blocks to total energy BB#43 41%BB#44 14% BB#50 6% BB#49 5% • Dhrystone Energy Report – Consists of 91 basic blocks – In total, 333 basic blocks executed
  • 20. # 20 Estimation Accuracy Application Time (µs) Power (mW) Energy (mJ) Dhrystone 39.35 33.35 1.31 Quicksort 164.20 33.78 5.55 ReadBMPBlock 251.61 39.96 10.05 DCT 166.68 30.84 5.14 Quantize 58.20 25.52 1.49 Zigzag 25.33 30.98 0.78 Huffman Encode 471.95 40.70 19.21 JPEG 973.77 37.66 36.67 • Tested the tool with a diverse group of benchmarks • Accurate estimation used as reference (XPA)
  • 21. # 21 Instruction Level Models Accuracy Application First order Model Second order Model E (mJ) Err E (mJ) Err Dhrystone 3.6 171% 3.3 155% QuickSort 15.80 185% 12.63 128% ReadBMP 24.6 145% 21.7 116% DCT 18.2 253% 18.2 253% Quantize 6.4 329% 4.0 169% Zigzag 2.3 195% 2.3 194% Huffman Enc. 50.7 164% 47.7 148% JPEG 102.2 179% 93.8 156% Average error 216% 156% Std. Deviation of error 51.6% 35.0% • State of the art instruction level models  Large Errors
  • 22. # 22 Instruction Level Models Accuracy • State of the art instruction level models  Large Errors  Can be calibrated using the error of the first benchmark estimate Application First order Model Second order Model E (mJ) Err E* (mJ) Err E (mJ) Err E* (mJ) Err Dhrystone 3.6 171% 1.31 0.0%** 3.3 155% 1.31 0.0%** QuickSort 15.80 185% 5.07 -8.7% 12.63 128% 4.95 -10.7% ReadBMP 24.6 145% 7.90 -21.4% 21.7 116% 8.50 -15% DCT 18.2 253% 5.82 13.2% 18.2 253% 7.12 38.5% Quantize 6.4 329% 2.04 38% 4.0 169% 1.57 5.4% Zigzag 2.3 195% 0.74 -5.3% 2.3 194% 0.90 15.3% Huffman Enc. 50.7 164% 16.3 -15.4% 47.7 148% 18.7 -2.7% JPEG 102.2 179% 32.8 -10.7% 93.8 156% 36.8 0.3% Average error 216% 12.6% 156% 9.5% Std. Deviation of error 51.6% 10.6% 35.0% 10.4%
  • 23. # 23 Instruction Level Models Accuracy • State of the art instruction level models  Even with calibration,  OVBM is more than twice as accurate Application First order Model Second order Model OVBM E* (mJ) Err E* (mJ) Err E (mJ) Err Dhrystone 1.31 0.0%** 1.31 0.0%** 1.30 -0.7% QuickSort 5.07 -8.7% 4.95 -10.7% 5.37 -3.2% ReadBMP 7.90 -21.4% 8.50 -15% 8.82 -12% DCT 5.82 13.2% 7.12 38.5% 4.96 -3.5% Quantize 2.04 38% 1.57 5.4% 1.47 -0.9% Zigzag 0.74 -5.3% 0.90 15.3% 0.78 -0.6% Huffman Enc. 16.3 -15.4% 18.7 -2.7% 17.64 -8.2% JPEG 32.8 -10.7% 36.8 0.3% 33.67 -8.2% Average error 12.6% 9.5% 4.2% Std. Deviation of error 10.6% 10.4% 3.5%
  • 24. # 24 Estimation Speed Application OVBM Tool (Seconds) XPA (Hours) Host Target Total Dhrystone 0.03 7.49 7.53 1.2 Quicksort 0.01 23.08 23.09 2.5 ReadBMPBlock 0.21 5.88 6.08 3.4 DCT 0.03 10.85 10.88 2.5 Quantize 0.01 8.40 8.41 1.4 Zigzag 0.01 4.41 4.42 1.1 Huffman Encode 0.07 65.04 65.11 5.7 JPEG 0.28 104.24 104.52 10.6 • OVBM tool is 3 orders of magnitude faster than accurate XPA tool • Speed of OVBM depends on speed of Target Device
  • 25. # 25 Limitations • The generated model is specific to a single implementation and processor configuration. • The source code of the application is required to annotate, and trace operand value metrics.