SlideShare a Scribd company logo
Krzysztof Rojek, CTO, byteLAKE
krojek@byteLAKE.com
6TH INTERNATIONAL EULAG USERS WORKSHOP
MAY 29, 2018, 14:00-14:45
 Area: Adaptation of a real-life scientific codes to the most advanced
computing architectures.
 Challenge: Device architectures are constantly changing. Current
architectures are very various. Our codes need to be very portable
and flexible.
 Goal: Take HPC to the "Industry 4.0" by implementing smart techniques
to optimize the codes in terms of performance and energy
consumption.
2
 Piz Daint (ranked 3-rd at top 500):
 GPU: NVIDIA Tesla P100 - PASCAL
 1xGPU per node
 Single GPU design
 5320 nodes (up to 36 used in this work)
 Calculation speed: float is 2x faster
than double
 MICLAB:
 GPU: NVIDIA Tesla K80 - KEPLER
 2xGPUs per node
 Dual GPU design
 2 nodes (remaining nodes with Intel
Xeon Phi)
 Calculation speed: float is 3x faster
than double
3
 Size of data transfer between nodes: 2x less using float than double
 No access to sudo user – it makes a problem when your code is based on DVFS
Expectation: Mixed precision arithmetic allows us to reduce the energy consumption
and execution time; It can be used in the real HPC platforms (without special access)
 Stencil-based algorithm for numerical simulation of
geophysical fluids flows on micro-to-planetary scales:
 7 stencils (compressed into 4 kernels) – each
depends on one or more others (343 flops-per-el.)
 Iterative algorithm – a single iteration represents one
time step
 11 matrices:
 x, xP – scalar quantity (i.e. temperature); input/output
matrices between time steps
 v1, v2, v3, v1P, v2P, v3P – velocity vectors in i, j, and k
directions
 h – density matrix
 cp, cn – temporary, intermediate matrices
4
xP
x
xPKernel 3
Kernel 2
Kernel 1
Kernel 0 xP
v1P v2P
cp
xP
cn
v3P
Outputs:
(x,v1,v2,v3,h)
(v1,v2,v3,h,xP)
(x,h,xP,v1P,v2P,v3P)
(x,h,v1P,v2P,v3P,cp,cn)
Inputs:
 Idea: Provide a highly parametrized code in order to easily map the algorithm onto GPU
 Mapping: Select the right values of given code parameters (configuration) in terms of
desired criterion (Energy consumption)
 How to: We build the search space of possible configurations and prune it using our
machine learning module (MLM)
 MLM: It is still the ongoing task. Here we propose to apply the modified random forest
algorithm.
5
 We can use different number of:
 Streams count (SPG)
 Nodes count (NDS)
 With different topologies:
 Topology of streams (TGP)
 Topology of nodes (TDS)
6
1x1 1x2
1x1
2x1
1x1
1x1
2x1
1x2
2x2
1x1 1x2 1x3 1x4
1x1
2x1
3x1
4x1
Stream/Node: 1
Topology: 1
Streams/Nodes: 2
Topologies: 2
Streams/Nodes: 4
Topologies: 3
7
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Synchronize domain
Kernel 0
Kernel 1
Kernel 2
Kernel 3
Data transfer
Kernel 0
Kernel 1
Kernel 2
Kernel 3
Data transfer
vs.
 Each stream
works on
distributed
subdomains
 Computations
are
independent
within a single
time step
 Halo exchange
is required after
each time step
 Each stream
share the same
subdomain
 Computations
depends on the
neighboring
streams
 Halo exchange
is not required
within a single
node
 By selecting the right strategy of
halo exchanging we can focus
on more parallelism or less
operations
 We can use a different strategy
within node and between nodes
8
Halo
exchange
cudaMemcpy
Buffers
GPU direct No GPU direct
No-buffers
Kernels
Single kernel Two kernels
 We takes into consideration also some basic parameters:
 CUDA block sizes for each of 4 kernels
 CUDA blocks are of size X times Y, where
 X*Y mod 32 = 0
 X>=Y
 X mod 16 =0
 X*Y <= 1024
 X<=M and Y<=N, where NxMxL is a size of the grid
 Data alignment and padding within a range from 1 to 4096 B
 Align in: {1, 2, 4, 8, …, 4096}
9
 Assumption: We believe that we can find a really good configuration by testing about 5000
configurations from the search space (more that this is too expensive)
 We consider two possible approaches:
 Positive: Find good solutions and eliminate groups that seem to be worse than ours
 Risk: When we find a branch with a good solution we can eliminate other branches (also quite good) that should
be worse. In fact we can eliminate a branch containing the best solution.
 Negative: Find bad solutions and eliminate them
 Risk: When we find branches with bad solutions we can eliminate them although the worst one can be still in (the
best one also is there).
 Fact: We test random branches (we may not select the best or the worst one); we are searching
for the suboptimal solution.
10
 Precision:
 DOUBLE
 Diameter:
 28.0
 L2 norm:
 0.0746
 Diffusion error:
 1.7503
 Phase error:
 0.7576
11
Halo exchange
(xP or x)
 Precision: DOUBLE
 Diameter: 28.0
 L2 norm: 0.0746
 Diff. err.: 1.7503
 Phase err.: 0.7576
 Precision: FLOAT
 Diameter: 28.0
 L2 norm: 0.1301
 Diff. err.: 2.2439
 Phase err.: 7.5919
12
FloatDouble
 Goal: Reduce the energy consumption
 Condition: Keep the accuracy at a high level (1% loss is
acceptable)
 Assumptions:
 The proposed method is intended to iterative algorithms
 Dynamic approach, self adaptable to a particular
simulation
 Self adaptation is done based on the short training stage
(the first 11 time steps)
13
1. Change
the i-th
matrix from
DP to SP
2. Execute a
single time
step
3. Measure
Energy and
Accuracy
4. Restore
the i-th
matrix to DP
Training stage:
Traditional approach based on static selection of precision arithmetic
is less flexible and may be too restrictive for some simulations
14
0
10
20
30
40
50
60
70
80
90
100
0/4 1/4 2/4 3/4 4/4
ENERGYCONSUMPTION[%]
PART OF A SINGLE SOLID BODY ROTATION
Change the
precision of
the i-th matrix
from DP to SP
0
10
20
30
40
50
60
70
80
90
100
0/4 1/4 2/4 3/4 4/4
ACCURACYLOSS[%]
PART OF A SINGLE SOLID BODY ROTATION
DOUBLE
Change the
precision of
the i-th matrix
from DP to SP
Should be
minimizedShould be
maximized
ΔE
ΔA
15
 Assumptions:
 ΔE – should be maximized
 ΔA – should be minimized
 Conclusion:
 R=ΔE/ΔA – the higher, the better
 Method:
 We estimate the R ratio for each matrix and set
matrices with the highest R from double to float
 This step is repeated until the accuracy loss is
lower than 1%
Set the
matrices with
the highest R
to float until
the accuracy
loss<1%
Sort them
decreasing
by R
Calculate
R=ΔE/ΔA
for each
matrix
The higher,
the better
 Precision: DOUBLE
 Diameter: 28.0
 L2 norm: 0.0746
 Diff. err.: 1.7503
 Phase err.: 0.7576
 Precision: MIXED
 Diameter: 28.0
 L2 norm: 0.0749
 Diff. err.: 1.7504
 Phase err.: 0.7576
16
Float group: x, xP, v3, h, v1P,v3P
Double group: v1, v2, v2P, cp, cn
Double
17
Double Mixed The proposed
method was also
validated for the
other tests
 The difference
between L2 norms
for double and
mixed precision is
0.00001
 The phase is
44.2135 for both
cases
 Test: 512x512x512 – 3909 time steps
18
Double 1 335.48 44
Mixed 1 255.02 1.32 35 19.79
Double 32 27.52 71
Mixed 24 21.65 1.27 48 32.63
Conclusion: Energy consumption is reduced by 33%
Best
performance
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 6 11 16 21 26 31 36
Gflop/s
# of nodes
Double Mixed
 Test: 512x512x512 – 3909 time steps
19
Double 1/2 533.65 80
Mixed 1/2 352.18 1.51 53 34.00
Double 4/8 144.83 87
Mixed 4/8 96.04 1.51 57 33.66
Conclusion: Energy consumption is reduced by 33%
Best
performance
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4
Gflop/s
# of GPUs – 2 GPUs per node
Double Mixed
 The developed implementation of MPDATA is very flexible and portable
 The proposed method allows us to automate the code adaptation even for a very large
number of possible configurations
 Mixed precision arithmetic allows us to reduce the energy consumption and execution
time
 It can be used in the real HPC platforms without special access to the machine
 It has an effect on the computation speed, data transfer, and scalability of the
application
 The proposed method allows us to reduce the energy consumption by 33% without loss
in accuracy
 It has also improved the performance by the factor of 1.27 for Piz Daint and 1.51 for
MICLAB in relation to double precision arithmetic
20
We build Artificial Intelligence
software and integrate that into
products.
We port and optimize algorithms for
parallel, CPU+GPU architectures.
We design and optimize algorithms
for HPC supercomputers.
byteLAKE We are specialists in:
www.byteLAKE.com
Machine Learning
Deep Learning
Computer Vision
High Performance Computing
Heterogeneous Computing
Edge Computing
Our mission:
help industries transform for the era of Artificial Intelligence.
We combine business and academia. Our team consists of experts
schooled by Fortune 500 corporations as well as PhD researchers.
22
Areas of Our Expertise
Computer Vision
Deep Learning
Machine Learning
Learning Optimization
AI for Edge Devices
HPC
Expertise Consultancy
Proof of Concept

More Related Content

PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
PDF
Neural Networks: Least Mean Square (LSM) Algorithm
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PDF
Machine Learning meets DevOps
PDF
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
PDF
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Neural Networks: Least Mean Square (LSM) Algorithm
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Machine Learning meets DevOps
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...

What's hot (20)

PPT
Chapter 3 pc
PDF
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
PDF
Dg34662666
PDF
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PDF
Transfer Learning for Improving Model Predictions in Robotic Systems
PPTX
A Tale of Data Pattern Discovery in Parallel
PDF
deep CNN vs conventional ML
PPTX
Optimal Chain Matrix Multiplication Big Data Perspective
PPTX
Incremental collaborative filtering via evolutionary co clustering
PPTX
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
PPTX
Beyond data and model parallelism for deep neural networks
PDF
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
PDF
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
PDF
Training and Inference for Deep Gaussian Processes
PDF
MobileNet - PR044
PDF
Enterprise Scale Topological Data Analysis Using Spark
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PPTX
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Chapter 3 pc
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Dg34662666
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning for Improving Model Predictions in Robotic Systems
A Tale of Data Pattern Discovery in Parallel
deep CNN vs conventional ML
Optimal Chain Matrix Multiplication Big Data Perspective
Incremental collaborative filtering via evolutionary co clustering
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Beyond data and model parallelism for deep neural networks
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
Training and Inference for Deep Gaussian Processes
MobileNet - PR044
Enterprise Scale Topological Data Analysis Using Spark
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Ad

Similar to AI optimizing HPC simulations (presentation from 6th EULAG Workshop) (20)

PDF
JJ_Thesis
PDF
xlelke00
PDF
Write Python for Speed
PPTX
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
PDF
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
PDF
Architectureaware Optimization Strategies In Realtime Image Processing Ballaa...
PDF
Solving large sparse linear systems on the GPU
PDF
HPC Essentials 0
PDF
Tutorial-on-DNN-07-Co-design-Precision.pdf
PDF
Exascale Computing for Autonomous Driving
PDF
Connected Components Labeling
PDF
Ivo Pavlik - thesis (print version)
PPTX
System mldl meetup
PDF
Machine Learning Project - Neural Network
PDF
GPU HistoPyramid Based Fluid Simulation and Rendering
PDF
Using Raspberry Pi GPU for DNN
PDF
byteLAKE's expertise across NVIDIA architectures and configurations
PDF
Moving Toward Deep Learning Algorithms on HPCC Systems
PPTX
2018 03 25 system ml ai and openpower meetup
PDF
add_2_diplom_main
JJ_Thesis
xlelke00
Write Python for Speed
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Federated Learning of Neural Network Models with Heterogeneous Structures.pdf
Architectureaware Optimization Strategies In Realtime Image Processing Ballaa...
Solving large sparse linear systems on the GPU
HPC Essentials 0
Tutorial-on-DNN-07-Co-design-Precision.pdf
Exascale Computing for Autonomous Driving
Connected Components Labeling
Ivo Pavlik - thesis (print version)
System mldl meetup
Machine Learning Project - Neural Network
GPU HistoPyramid Based Fluid Simulation and Rendering
Using Raspberry Pi GPU for DNN
byteLAKE's expertise across NVIDIA architectures and configurations
Moving Toward Deep Learning Algorithms on HPCC Systems
2018 03 25 system ml ai and openpower meetup
add_2_diplom_main
Ad

More from byteLAKE (20)

PDF
Agent AI (LLM) dla Grupy Prawniczej (WGPR, byteLAKE)
PDF
AI Innovation: Digital Automation with Cognitive Services
PDF
byteLAKE's AI Products (use cases) (short)
PDF
byteLAKE's AI Products (use cases) - presentation
PDF
byteLAKE's AI Products for Industries (2024-02)
PDF
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
PDF
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
PDF
Self-Checkout for Restaurants / AI Restaurants (2024-02)
PDF
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
PDF
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
PDF
Przegląd zastosowań sztucznej inteligencji (2024-01)
PDF
Przegląd zastosowań Sztucznej inteligencjI
PDF
AI Solutions for Industries
PDF
AI-accelerated CFD (Computational Fluid Dynamics)
PDF
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
PDF
AI Solutions for Industries (short)
PDF
Self-Checkout (AI for Restautants)
PDF
Applying Industrial AI Models to Product Quality Inspection
PDF
byteLAKE and Intel Partnership
PDF
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
Agent AI (LLM) dla Grupy Prawniczej (WGPR, byteLAKE)
AI Innovation: Digital Automation with Cognitive Services
byteLAKE's AI Products (use cases) (short)
byteLAKE's AI Products (use cases) - presentation
byteLAKE's AI Products for Industries (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
Self-Checkout for Restaurants / AI Restaurants (2024-02)
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
Przegląd zastosowań sztucznej inteligencji (2024-01)
Przegląd zastosowań Sztucznej inteligencjI
AI Solutions for Industries
AI-accelerated CFD (Computational Fluid Dynamics)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
AI Solutions for Industries (short)
Self-Checkout (AI for Restautants)
Applying Industrial AI Models to Product Quality Inspection
byteLAKE and Intel Partnership
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
573137875-Attendance-Management-System-original
PPT
Project quality management in manufacturing
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Digital Logic Computer Design lecture notes
PPTX
Geodesy 1.pptx...............................................
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Embodied AI: Ushering in the Next Era of Intelligent Systems
Lecture Notes Electrical Wiring System Components
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Structs to JSON How Go Powers REST APIs.pdf
Foundation to blockchain - A guide to Blockchain Tech
573137875-Attendance-Management-System-original
Project quality management in manufacturing
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Digital Logic Computer Design lecture notes
Geodesy 1.pptx...............................................
CH1 Production IntroductoryConcepts.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

AI optimizing HPC simulations (presentation from 6th EULAG Workshop)

  • 1. Krzysztof Rojek, CTO, byteLAKE krojek@byteLAKE.com 6TH INTERNATIONAL EULAG USERS WORKSHOP MAY 29, 2018, 14:00-14:45
  • 2.  Area: Adaptation of a real-life scientific codes to the most advanced computing architectures.  Challenge: Device architectures are constantly changing. Current architectures are very various. Our codes need to be very portable and flexible.  Goal: Take HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. 2
  • 3.  Piz Daint (ranked 3-rd at top 500):  GPU: NVIDIA Tesla P100 - PASCAL  1xGPU per node  Single GPU design  5320 nodes (up to 36 used in this work)  Calculation speed: float is 2x faster than double  MICLAB:  GPU: NVIDIA Tesla K80 - KEPLER  2xGPUs per node  Dual GPU design  2 nodes (remaining nodes with Intel Xeon Phi)  Calculation speed: float is 3x faster than double 3  Size of data transfer between nodes: 2x less using float than double  No access to sudo user – it makes a problem when your code is based on DVFS Expectation: Mixed precision arithmetic allows us to reduce the energy consumption and execution time; It can be used in the real HPC platforms (without special access)
  • 4.  Stencil-based algorithm for numerical simulation of geophysical fluids flows on micro-to-planetary scales:  7 stencils (compressed into 4 kernels) – each depends on one or more others (343 flops-per-el.)  Iterative algorithm – a single iteration represents one time step  11 matrices:  x, xP – scalar quantity (i.e. temperature); input/output matrices between time steps  v1, v2, v3, v1P, v2P, v3P – velocity vectors in i, j, and k directions  h – density matrix  cp, cn – temporary, intermediate matrices 4 xP x xPKernel 3 Kernel 2 Kernel 1 Kernel 0 xP v1P v2P cp xP cn v3P Outputs: (x,v1,v2,v3,h) (v1,v2,v3,h,xP) (x,h,xP,v1P,v2P,v3P) (x,h,v1P,v2P,v3P,cp,cn) Inputs:
  • 5.  Idea: Provide a highly parametrized code in order to easily map the algorithm onto GPU  Mapping: Select the right values of given code parameters (configuration) in terms of desired criterion (Energy consumption)  How to: We build the search space of possible configurations and prune it using our machine learning module (MLM)  MLM: It is still the ongoing task. Here we propose to apply the modified random forest algorithm. 5
  • 6.  We can use different number of:  Streams count (SPG)  Nodes count (NDS)  With different topologies:  Topology of streams (TGP)  Topology of nodes (TDS) 6 1x1 1x2 1x1 2x1 1x1 1x1 2x1 1x2 2x2 1x1 1x2 1x3 1x4 1x1 2x1 3x1 4x1 Stream/Node: 1 Topology: 1 Streams/Nodes: 2 Topologies: 2 Streams/Nodes: 4 Topologies: 3
  • 7. 7 Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Synchronize domain Kernel 0 Kernel 1 Kernel 2 Kernel 3 Data transfer Kernel 0 Kernel 1 Kernel 2 Kernel 3 Data transfer vs.  Each stream works on distributed subdomains  Computations are independent within a single time step  Halo exchange is required after each time step  Each stream share the same subdomain  Computations depends on the neighboring streams  Halo exchange is not required within a single node
  • 8.  By selecting the right strategy of halo exchanging we can focus on more parallelism or less operations  We can use a different strategy within node and between nodes 8 Halo exchange cudaMemcpy Buffers GPU direct No GPU direct No-buffers Kernels Single kernel Two kernels
  • 9.  We takes into consideration also some basic parameters:  CUDA block sizes for each of 4 kernels  CUDA blocks are of size X times Y, where  X*Y mod 32 = 0  X>=Y  X mod 16 =0  X*Y <= 1024  X<=M and Y<=N, where NxMxL is a size of the grid  Data alignment and padding within a range from 1 to 4096 B  Align in: {1, 2, 4, 8, …, 4096} 9
  • 10.  Assumption: We believe that we can find a really good configuration by testing about 5000 configurations from the search space (more that this is too expensive)  We consider two possible approaches:  Positive: Find good solutions and eliminate groups that seem to be worse than ours  Risk: When we find a branch with a good solution we can eliminate other branches (also quite good) that should be worse. In fact we can eliminate a branch containing the best solution.  Negative: Find bad solutions and eliminate them  Risk: When we find branches with bad solutions we can eliminate them although the worst one can be still in (the best one also is there).  Fact: We test random branches (we may not select the best or the worst one); we are searching for the suboptimal solution. 10
  • 11.  Precision:  DOUBLE  Diameter:  28.0  L2 norm:  0.0746  Diffusion error:  1.7503  Phase error:  0.7576 11 Halo exchange (xP or x)
  • 12.  Precision: DOUBLE  Diameter: 28.0  L2 norm: 0.0746  Diff. err.: 1.7503  Phase err.: 0.7576  Precision: FLOAT  Diameter: 28.0  L2 norm: 0.1301  Diff. err.: 2.2439  Phase err.: 7.5919 12 FloatDouble
  • 13.  Goal: Reduce the energy consumption  Condition: Keep the accuracy at a high level (1% loss is acceptable)  Assumptions:  The proposed method is intended to iterative algorithms  Dynamic approach, self adaptable to a particular simulation  Self adaptation is done based on the short training stage (the first 11 time steps) 13 1. Change the i-th matrix from DP to SP 2. Execute a single time step 3. Measure Energy and Accuracy 4. Restore the i-th matrix to DP Training stage: Traditional approach based on static selection of precision arithmetic is less flexible and may be too restrictive for some simulations
  • 14. 14 0 10 20 30 40 50 60 70 80 90 100 0/4 1/4 2/4 3/4 4/4 ENERGYCONSUMPTION[%] PART OF A SINGLE SOLID BODY ROTATION Change the precision of the i-th matrix from DP to SP 0 10 20 30 40 50 60 70 80 90 100 0/4 1/4 2/4 3/4 4/4 ACCURACYLOSS[%] PART OF A SINGLE SOLID BODY ROTATION DOUBLE Change the precision of the i-th matrix from DP to SP Should be minimizedShould be maximized ΔE ΔA
  • 15. 15  Assumptions:  ΔE – should be maximized  ΔA – should be minimized  Conclusion:  R=ΔE/ΔA – the higher, the better  Method:  We estimate the R ratio for each matrix and set matrices with the highest R from double to float  This step is repeated until the accuracy loss is lower than 1% Set the matrices with the highest R to float until the accuracy loss<1% Sort them decreasing by R Calculate R=ΔE/ΔA for each matrix The higher, the better
  • 16.  Precision: DOUBLE  Diameter: 28.0  L2 norm: 0.0746  Diff. err.: 1.7503  Phase err.: 0.7576  Precision: MIXED  Diameter: 28.0  L2 norm: 0.0749  Diff. err.: 1.7504  Phase err.: 0.7576 16 Float group: x, xP, v3, h, v1P,v3P Double group: v1, v2, v2P, cp, cn Double
  • 17. 17 Double Mixed The proposed method was also validated for the other tests  The difference between L2 norms for double and mixed precision is 0.00001  The phase is 44.2135 for both cases
  • 18.  Test: 512x512x512 – 3909 time steps 18 Double 1 335.48 44 Mixed 1 255.02 1.32 35 19.79 Double 32 27.52 71 Mixed 24 21.65 1.27 48 32.63 Conclusion: Energy consumption is reduced by 33% Best performance 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 6 11 16 21 26 31 36 Gflop/s # of nodes Double Mixed
  • 19.  Test: 512x512x512 – 3909 time steps 19 Double 1/2 533.65 80 Mixed 1/2 352.18 1.51 53 34.00 Double 4/8 144.83 87 Mixed 4/8 96.04 1.51 57 33.66 Conclusion: Energy consumption is reduced by 33% Best performance 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 2 3 4 Gflop/s # of GPUs – 2 GPUs per node Double Mixed
  • 20.  The developed implementation of MPDATA is very flexible and portable  The proposed method allows us to automate the code adaptation even for a very large number of possible configurations  Mixed precision arithmetic allows us to reduce the energy consumption and execution time  It can be used in the real HPC platforms without special access to the machine  It has an effect on the computation speed, data transfer, and scalability of the application  The proposed method allows us to reduce the energy consumption by 33% without loss in accuracy  It has also improved the performance by the factor of 1.27 for Piz Daint and 1.51 for MICLAB in relation to double precision arithmetic 20
  • 21. We build Artificial Intelligence software and integrate that into products. We port and optimize algorithms for parallel, CPU+GPU architectures. We design and optimize algorithms for HPC supercomputers. byteLAKE We are specialists in: www.byteLAKE.com Machine Learning Deep Learning Computer Vision High Performance Computing Heterogeneous Computing Edge Computing Our mission: help industries transform for the era of Artificial Intelligence. We combine business and academia. Our team consists of experts schooled by Fortune 500 corporations as well as PhD researchers.
  • 22. 22 Areas of Our Expertise Computer Vision Deep Learning Machine Learning Learning Optimization AI for Edge Devices HPC Expertise Consultancy Proof of Concept