SlideShare a Scribd company logo
ppOpen-AT :
Yet Another Directive-base AT
Language
Takahiro Katagiri,
Supercomputing Research Division,
Information Technology Center,
The University of Tokyo
1
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401
Automatic Application Tuning for HPC Architectures
Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
Collaborators:
Satoshi Ohshima, Masaharu Matsumoto
(Information Technology Center, The University of Tokyo)
QUESTIONS FOR
AT ON SUPERCOMPUTER
IN OPERATION
6
Performance Portability (PP)
7
 Keeping high performance in multiple computer
environments.
◦ Not only multiple CPUs, but also multiple compilers.
◦ Run-time information, such as loop length and
number of threads, is important.
 Auto-tuning (AT) is one of candidates technologies to
establish PP in multiple computer environments.
Questions
 Are open AT infrastructures, including numerical
libraries with AT, available for supercomputers in
operation?
 We should consider with:
◦ Is run-time code generator of AT available for
login-nodes with low-overheads,
and available for dedicated batch-job systems?
 Need to take care about different venders, such as Fujitsu, NEC,
Hitachi, Cray, etc..
◦ Are required software-stacks available for
the systems?
 Scripting languages, such as python, perl, etc.
 In some Japanese supercomputers, very limited script languages are
supported.
 Dedicated compiler, such as CAPS, etc. 8
Questions (Cont’d)
 We should consider with:
◦ Do AT systems require special daemons
or OS kernel modifications?
 Additional daemons are not permitted to
prevent high-loads of login-nodes in
supercomputer.
 OS kernel modification is not permitted
to keep support contract by venders.
 It is more desirable that
all executions for AT perform in user level.
9
RELATED PROJECT
10
ppOpen-HPC (1/3)
• Open Source Infrastructure for development and
execution of large-scale scientific applications on post-
peta-scale supercomputers with automatic tuning (AT)
• “pp” : post-peta-scale
• Five-year project (FY.2011-2015) (since April 2011)
• P.I.: Kengo Nakajima (ITC, The University of Tokyo)
• Part of “Development of System Software Technologies for
Post-Peta Scale High Performance Computing” funded by
JST/CREST (Japan Science and Technology Agency, Core
Research for Evolutional Science and Technology)
• 4.5 M$ for 5 yr.
• Team with 6 institutes, >30 people (5 PDs) from
various fields: Co-Desigin
• ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo
• Kyoto U., JAMSTEC
11
ppOpen-HPC (2/3)
• Source code developed on a PC with a single
processor is linked with these libraries, and generated
parallel code is optimized for post-peta scale system.
• Users don’t have to worry about optimization tuning,
parallelization etc.
• CUDA, OpenGL etc. are hidden.
• Part of MPI codes are also hidden.
• OpenMP, OpenACC could be hidden
– ppOpen-HPC consists of various types of optimized
libraries, which covers various types of procedures for
scientific computations.
• FEM, FDM, FVM, BEM, DEM
12OPL@SC12
ppOpen-HPC covers …
13
PPOPEN-AT
BASICS
19
ppOpen‐AT System
ppOpen‐APPL /*
ppOpen‐AT
Directives
User 
KnowledgeLibrary 
Developer
① Before 
Release‐time
Candidate
1
Candidate
2
Candidate
3
Candidate
nppOpen‐AT
Auto‐Tuner
ppOpen‐APPL / *
Automatic
Code
Generation②
:Target 
Computers
Execution Time④
Library User
③
Library Call
Selection
⑤
⑥
Auto‐tuned
Kernel
Execution
Run‐
time
EARLY EXPERIENCE IN
EXPLICIT METHOD
(FINITE DIFFERENCE
METHOD)
24
Target Application
Seism_3D:
Simulation for seismic wave analysis.
 Developed by Professor Furumura
at the University of Tokyo.
◦ The code is re-constructed as
ppOpen-APPL/FDM.
 Finite Differential Method (FDM)
 3D simulation
◦ 3D arrays are allocated.
 Data type: Single Precision (real*4)
25
An Example of Seism_3D Simulation
 West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14)
 The region of 820km x 410km x 128 km is discretized with 0.4km.
 NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News,
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan.
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
The Heaviest Loop(10%~20% to Total Time)
27
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K)
RM = RIG (I,J,K)
RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
SXX (I,J,K) = ( SXX (I,J,K)+ (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K)+ (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
RMAXY = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I,J+1,K+1))
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT) * QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT) * QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT) * QG
END DO
END DO
END DO
Flow Dependencies
New ppOpen-AT Directives
- Loop Split & Fusion with data-flow dependence
33
!oat$ install LoopFusionSplit region start
!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
!oat$ SplitPointCopyDef region start
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
!oat$ SplitPointCopyDef region end
SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
!oat$ SplitPoint (K, J, I)
STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1)
STMP3 = STMP1 + STMP2
RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))
!oat$ SplitPointCopyInsert
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG
END DO; END DO; END DO
!$omp end parallel do
!oat$ install LoopFusionSplit region end
Re-calculation
is defined in here.
Using the re-calculation
is defined in here.
Loop Split Point
Candidates of Auto-generated Codes
 #1 [Baseline]: Original 3-nested Loop
 #2 [Split]: Loop Splitting with I-loop
 #3 [Split]: Loop Splitting with J-loop
 #4 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops)
 #5 [Split&Fusion]: Loop Fusion with #2
(2-nested loop)
 #6 [Fusion]: Loop Fusion with #1
(loop collapse)
 #7 [Fusion]: Loop Fusion with #1
(2-nested loop) 34
Overview
1. Background and ppOpen-HPC
Project
2. ppOpen-AT Basics
3. Adaptation to an FDM
Application
4. Performance Evaluation
5. Conclusion
35
PERFORMANCE EVALUATION
WITH
PPOPEN-APPL/FDM
IN ALPHA VERSION
36
Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima,
"Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method”
Special Session: Auto-Tuning for Multicore and GPU (ATMG)
(In Conjunction with the IEEE MCSoC-13), National Institute of Informatics,
Tokyo, Japan, September 26-28, 2013
Test Environments
1. FX10 (The Fujitsu PRIMEHPC FX10)
◦ SPARC64 IXfx(1.848 GHz), 16 Cores, Maximum 16 Threads.
◦ Fujitsu Fortran Compiler, Version 1.2.1.
◦ Option:-Kfast, -openmp.
2. T2K (The AMD Quad-core Opteron (Barcelona))
◦ AMD Opteron 8356 (2.3 GHz),16 Cores (4 Sockets),Maximum 16 Threads
◦ Intel Fortran Compiler, Version 11.0.
◦ Option:-fast openmp -mcmodel=medium.
3. Sandy Bridge (Intel Sandy Bridge)
◦ Xeon E5 (Sandy Bridge E5-2687W),(8 Physical Cores, 16 Threads) (3.1
GHz),(Turbo boost off),32 Cores (2 Sockets),Maximum 32 Threads.
◦ Intel Fortran Compiler, Version 12.1.
◦ Option:-fast –openmp -mcmodel=medium.
4. SR16K (HITACHI SR16000/M1)
◦ IBM Power7 (3.83 GHz),32 Cores (4 Sockets),Maximum 64 Threads (SMT)
◦ HITACHI Optimization Fortran,Version. 03-01-/B.
◦ Option: -opt=ss –omp. 37
AT Effect: Very Small and Small
0
2
4
6
8
10
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
39
(A) FX10 (VERY SMALL, #REPEAT = 100,000)
#Threads
Time
In Seconds
0
2
4
6
8
10
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
(B)T2K (VERY SMALL, #REPEAT = 100,000)
0
0.1
0.2
0.3
0.4
0.5
1 8 16 32
#1 #2 #3 #4 #5 #6 #7
#Threads
Time
In Seconds
(C)SANDY BRIDGE (SMALL, #REPEAT = 1,000)
0
0.1
0.2
0.3
0.4
0.5
1 8 32 64
#1 #2 #3 #4 #5 #6 #7
(D)SR16K (SMALL, #REPEAT = 1,000)
#2, #5 are the best.
#4, #5, #7 are the best.
#2, #3, #4, #5 are the best.#2, #4, #5 are the best.
#5 and #7 were the best
when the number of threads was increase.
AT Effect: Large Size
0
2
4
6
8
10
12
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
41
(A) FX10 (#REPEAT = 10)
#Threads
Time
In Seconds
0
1
2
3
4
5
6
1 4 8 16
#1 #2 #3 #4 #5 #6 #7
(B)T2K (#REPEAT = 10)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 8 16 32
#1 #2 #3 #4 #5 #6 #7
#Threads
Time
In Seconds
(C)SANDY BRIDGE (#REPEAT = 10)
0
0.2
0.4
0.6
0.8
1
1 8 32 64
#1 #2 #3 #4 #5 #6 #7
(D)SR16K (#REPEAT = 10)
#2, #3, #5 are the best.#2, #7 are the best.
#5 are the best.
#4 are the best.
One fixed implementation was the best.
With AT(Speedups to the case without AT)
Pure MPI
Types of hybrid MPI‐OpenMP Execution
2.5
AT Effect for Hybrid OpenMP‐MPI 
Original without AT
Pure MPI
Speedup to pure MPI Execution
Types of hybrid MPI‐OpenMP Execution
The FX10, Kernel: update_stress
1
No merit for 
Hybrid MPI‐OpenMPI Executions. 1
Effect on pure MPI Execution
Gain by using MPI‐OpenMPI Executions.
By adapting loop transformation from the AT, we obtained:
 Maximum 1.5x speedup to pure MPI (without Thread execution)
 Maximum 2.5x speedup to pure MPI in hybrid MPI‐OpenMP execution.
PXTY :X Processes, Y Threads / Process
ANSWER
AND
PLANS FOR THE FUTURE
50
Current Answers to AT systems
Minimum software-stack
requirement is important to use
AT facility in supercomputers in
operation.
Since we have no standardization
for AT functions, efforts for AT
with full user-level execution are
required.
51
Future Direction
 The standardization of AT functions for
supercomputers is important future direction,
such as:
◦ Performance monitors.
◦ Code generators, esp. dynamic code generators.
◦ Job schedulers, such as batch-job systems.
◦ Compiler optimizations including directives and compiler
options.
◦ Defining AT targets, such as execution speed, memory
amounts, or power consumption, etc..
◦ etc.
 Making standardization strategy for AT functions
with venders is important.
◦ Message Passing Interface (MPI) standardization in MPI
Forum is one of success examples for the
standardization.
◦ Why not make standardization and forum for AT? 52

More Related Content

PDF
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
PDF
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
PDF
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
PDF
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
PDF
Improving initial generations in pso algorithm for transportation network des...
PDF
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
PDF
Early Application experiences on Summit
PDF
B010430814
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme an...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Improving initial generations in pso algorithm for transportation network des...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Early Application experiences on Summit
B010430814

What's hot (19)

PDF
Cb32492496
PDF
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
PDF
Tele4653 l4
PPT
B Eng Final Year Project Presentation
PPTX
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
PDF
PyTorch for Deep Learning Practitioners
PDF
FPGA FIR filter implementation (Audio signal processing)
PDF
Artificial Neural Networks Lect8: Neural networks for constrained optimization
PDF
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
PPTX
Mining of time series data base using fuzzy neural information systems
PPT
Multilayer Neuronal network hardware implementation
PDF
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
PDF
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
PDF
B010341317
PPTX
Neural Network - Feed Forward - Back Propagation Visualization
PDF
Tele4653 l5
PDF
SchNet: A continuous-filter convolutional neural network for modeling quantum...
PPT
Krish final
PPT
Exploring Gpgpu Workloads
Cb32492496
A Simple Design to Mitigate Problems of Conventional Digital Phase Locked Loop
Tele4653 l4
B Eng Final Year Project Presentation
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
PyTorch for Deep Learning Practitioners
FPGA FIR filter implementation (Audio signal processing)
Artificial Neural Networks Lect8: Neural networks for constrained optimization
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
Mining of time series data base using fuzzy neural information systems
Multilayer Neuronal network hardware implementation
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
B010341317
Neural Network - Feed Forward - Back Propagation Visualization
Tele4653 l5
SchNet: A continuous-filter convolutional neural network for modeling quantum...
Krish final
Exploring Gpgpu Workloads
Ad

Viewers also liked (12)

PDF
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
PDF
iWAPT2015_katagiri
PDF
ATTA2014基盤B導入(片桐)
PDF
自動チューニングとビックデータ:機械学習の適用の可能性
PDF
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
PDF
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
PDF
Ase20 20151016 hp
PDF
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
PDF
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
PDF
SCG-AT:静的コード生成のみによる自動チューニング実現方式
PDF
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
PDF
ソフトウェア自動チューニング研究紹介
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...
iWAPT2015_katagiri
ATTA2014基盤B導入(片桐)
自動チューニングとビックデータ:機械学習の適用の可能性
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Ase20 20151016 hp
ppOpen-ATによる静的コード生成で実現する 自動チューニング方式の評価
Auto‐Tuning of Hierarchical Computations with ppOpen‐AT
SCG-AT:静的コード生成のみによる自動チューニング実現方式
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ソフトウェア自動チューニング研究紹介
Ad

Similar to ppOpen-AT : Yet Another Directive-base AT Language (20)

PDF
Learning Erlang (from a Prolog dropout's perspective)
PDF
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
PDF
Combinatorial testing in Japan
PPTX
19. algorithms and-complexity
PDF
Efficient Implementation of Low Power 2-D DCT Architecture
PPTX
Time and Space Complexity Analysis.pptx
PDF
材料科学とスーパーコンピュータ: 基礎編
PDF
Quick sort,bubble sort,heap sort and merge sort
PPTX
DAA-vfyjvtfjtfvfvfthjccghcdhrtchgdT&S_C.pptx
PDF
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PDF
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
PPT
Interm codegen
PDF
time_complexity_list_02_04_2024_22_pages.pdf
PDF
Project report of ustos
PDF
RTOS implementation
PPT
Introduction to Algorithms
PDF
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
PDF
IP Core Design of Hight Lightweight Cipher and its Implementation
PPTX
Embedded JavaScript
PDF
Course-Notes__Advanced-DSP.pdf
Learning Erlang (from a Prolog dropout's perspective)
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
Combinatorial testing in Japan
19. algorithms and-complexity
Efficient Implementation of Low Power 2-D DCT Architecture
Time and Space Complexity Analysis.pptx
材料科学とスーパーコンピュータ: 基礎編
Quick sort,bubble sort,heap sort and merge sort
DAA-vfyjvtfjtfvfvfthjccghcdhrtchgdT&S_C.pptx
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Interm codegen
time_complexity_list_02_04_2024_22_pages.pdf
Project report of ustos
RTOS implementation
Introduction to Algorithms
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP Core Design of Hight Lightweight Cipher and its Implementation
Embedded JavaScript
Course-Notes__Advanced-DSP.pdf

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Getting Started with Data Integration: FME Form 101
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
1. Introduction to Computer Programming.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Network Security Unit 5.pdf for BCA BBA.
cloud_computing_Infrastucture_as_cloud_p
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
Getting Started with Data Integration: FME Form 101
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
SOPHOS-XG Firewall Administrator PPT.pptx
1. Introduction to Computer Programming.pptx

ppOpen-AT : Yet Another Directive-base AT Language

  • 1. ppOpen-AT : Yet Another Directive-base AT Language Takahiro Katagiri, Supercomputing Research Division, Information Technology Center, The University of Tokyo 1 29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401 Automatic Application Tuning for HPC Architectures Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013. Collaborators: Satoshi Ohshima, Masaharu Matsumoto (Information Technology Center, The University of Tokyo)
  • 2. QUESTIONS FOR AT ON SUPERCOMPUTER IN OPERATION 6
  • 3. Performance Portability (PP) 7  Keeping high performance in multiple computer environments. ◦ Not only multiple CPUs, but also multiple compilers. ◦ Run-time information, such as loop length and number of threads, is important.  Auto-tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.
  • 4. Questions  Are open AT infrastructures, including numerical libraries with AT, available for supercomputers in operation?  We should consider with: ◦ Is run-time code generator of AT available for login-nodes with low-overheads, and available for dedicated batch-job systems?  Need to take care about different venders, such as Fujitsu, NEC, Hitachi, Cray, etc.. ◦ Are required software-stacks available for the systems?  Scripting languages, such as python, perl, etc.  In some Japanese supercomputers, very limited script languages are supported.  Dedicated compiler, such as CAPS, etc. 8
  • 5. Questions (Cont’d)  We should consider with: ◦ Do AT systems require special daemons or OS kernel modifications?  Additional daemons are not permitted to prevent high-loads of login-nodes in supercomputer.  OS kernel modification is not permitted to keep support contract by venders.  It is more desirable that all executions for AT perform in user level. 9
  • 7. ppOpen-HPC (1/3) • Open Source Infrastructure for development and execution of large-scale scientific applications on post- peta-scale supercomputers with automatic tuning (AT) • “pp” : post-peta-scale • Five-year project (FY.2011-2015) (since April 2011) • P.I.: Kengo Nakajima (ITC, The University of Tokyo) • Part of “Development of System Software Technologies for Post-Peta Scale High Performance Computing” funded by JST/CREST (Japan Science and Technology Agency, Core Research for Evolutional Science and Technology) • 4.5 M$ for 5 yr. • Team with 6 institutes, >30 people (5 PDs) from various fields: Co-Desigin • ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo • Kyoto U., JAMSTEC 11
  • 8. ppOpen-HPC (2/3) • Source code developed on a PC with a single processor is linked with these libraries, and generated parallel code is optimized for post-peta scale system. • Users don’t have to worry about optimization tuning, parallelization etc. • CUDA, OpenGL etc. are hidden. • Part of MPI codes are also hidden. • OpenMP, OpenACC could be hidden – ppOpen-HPC consists of various types of optimized libraries, which covers various types of procedures for scientific computations. • FEM, FDM, FVM, BEM, DEM 12OPL@SC12
  • 12. EARLY EXPERIENCE IN EXPLICIT METHOD (FINITE DIFFERENCE METHOD) 24
  • 13. Target Application Seism_3D: Simulation for seismic wave analysis.  Developed by Professor Furumura at the University of Tokyo. ◦ The code is re-constructed as ppOpen-APPL/FDM.  Finite Differential Method (FDM)  3D simulation ◦ 3D arrays are allocated.  Data type: Single Precision (real*4) 25
  • 14. An Example of Seism_3D Simulation  West part earthquake in Tottori prefecture in Japan at year 2000. ([1], pp.14)  The region of 820km x 410km x 128 km is discretized with 0.4km.  NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1. [1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese. Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
  • 15. The Heaviest Loop(10%~20% to Total Time) 27 DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL = LAM (I,J,K) RM = RIG (I,J,K) RM2 = RM + RM RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K) SXX (I,J,K) = ( SXX (I,J,K)+ (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG SYY (I,J,K) = ( SYY (I,J,K)+ (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG RMAXY = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K)) RMAXZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I+1,J,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I+1,J,K+1)) RMAYZ = 4.0/(1.0/RIG(I,J,K) + 1.0/RIG(I,J+1,K) + 1.0/RIG(I,J,K+1) + 1.0/RIG(I,J+1,K+1)) SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT) * QG SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT) * QG SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT) * QG END DO END DO END DO Flow Dependencies
  • 16. New ppOpen-AT Directives - Loop Split & Fusion with data-flow dependence 33 !oat$ install LoopFusionSplit region start !$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG) DO K = 1, NZ DO J = 1, NY DO I = 1, NX RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL !oat$ SplitPointCopyDef region start QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K) !oat$ SplitPointCopyDef region end SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG !oat$ SplitPoint (K, J, I) STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1) STMP3 = STMP1 + STMP2 RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K)) RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1)) RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1)) !oat$ SplitPointCopyInsert SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG END DO; END DO; END DO !$omp end parallel do !oat$ install LoopFusionSplit region end Re-calculation is defined in here. Using the re-calculation is defined in here. Loop Split Point
  • 17. Candidates of Auto-generated Codes  #1 [Baseline]: Original 3-nested Loop  #2 [Split]: Loop Splitting with I-loop  #3 [Split]: Loop Splitting with J-loop  #4 [Split]: Loop Splitting with K-loop (Separated, two 3-nested loops)  #5 [Split&Fusion]: Loop Fusion with #2 (2-nested loop)  #6 [Fusion]: Loop Fusion with #1 (loop collapse)  #7 [Fusion]: Loop Fusion with #1 (2-nested loop) 34
  • 18. Overview 1. Background and ppOpen-HPC Project 2. ppOpen-AT Basics 3. Adaptation to an FDM Application 4. Performance Evaluation 5. Conclusion 35
  • 19. PERFORMANCE EVALUATION WITH PPOPEN-APPL/FDM IN ALPHA VERSION 36 Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima, "Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method” Special Session: Auto-Tuning for Multicore and GPU (ATMG) (In Conjunction with the IEEE MCSoC-13), National Institute of Informatics, Tokyo, Japan, September 26-28, 2013
  • 20. Test Environments 1. FX10 (The Fujitsu PRIMEHPC FX10) ◦ SPARC64 IXfx(1.848 GHz), 16 Cores, Maximum 16 Threads. ◦ Fujitsu Fortran Compiler, Version 1.2.1. ◦ Option:-Kfast, -openmp. 2. T2K (The AMD Quad-core Opteron (Barcelona)) ◦ AMD Opteron 8356 (2.3 GHz),16 Cores (4 Sockets),Maximum 16 Threads ◦ Intel Fortran Compiler, Version 11.0. ◦ Option:-fast openmp -mcmodel=medium. 3. Sandy Bridge (Intel Sandy Bridge) ◦ Xeon E5 (Sandy Bridge E5-2687W),(8 Physical Cores, 16 Threads) (3.1 GHz),(Turbo boost off),32 Cores (2 Sockets),Maximum 32 Threads. ◦ Intel Fortran Compiler, Version 12.1. ◦ Option:-fast –openmp -mcmodel=medium. 4. SR16K (HITACHI SR16000/M1) ◦ IBM Power7 (3.83 GHz),32 Cores (4 Sockets),Maximum 64 Threads (SMT) ◦ HITACHI Optimization Fortran,Version. 03-01-/B. ◦ Option: -opt=ss –omp. 37
  • 21. AT Effect: Very Small and Small 0 2 4 6 8 10 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 39 (A) FX10 (VERY SMALL, #REPEAT = 100,000) #Threads Time In Seconds 0 2 4 6 8 10 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 (B)T2K (VERY SMALL, #REPEAT = 100,000) 0 0.1 0.2 0.3 0.4 0.5 1 8 16 32 #1 #2 #3 #4 #5 #6 #7 #Threads Time In Seconds (C)SANDY BRIDGE (SMALL, #REPEAT = 1,000) 0 0.1 0.2 0.3 0.4 0.5 1 8 32 64 #1 #2 #3 #4 #5 #6 #7 (D)SR16K (SMALL, #REPEAT = 1,000) #2, #5 are the best. #4, #5, #7 are the best. #2, #3, #4, #5 are the best.#2, #4, #5 are the best. #5 and #7 were the best when the number of threads was increase.
  • 22. AT Effect: Large Size 0 2 4 6 8 10 12 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 41 (A) FX10 (#REPEAT = 10) #Threads Time In Seconds 0 1 2 3 4 5 6 1 4 8 16 #1 #2 #3 #4 #5 #6 #7 (B)T2K (#REPEAT = 10) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 8 16 32 #1 #2 #3 #4 #5 #6 #7 #Threads Time In Seconds (C)SANDY BRIDGE (#REPEAT = 10) 0 0.2 0.4 0.6 0.8 1 1 8 32 64 #1 #2 #3 #4 #5 #6 #7 (D)SR16K (#REPEAT = 10) #2, #3, #5 are the best.#2, #7 are the best. #5 are the best. #4 are the best. One fixed implementation was the best.
  • 23. With AT(Speedups to the case without AT) Pure MPI Types of hybrid MPI‐OpenMP Execution 2.5 AT Effect for Hybrid OpenMP‐MPI  Original without AT Pure MPI Speedup to pure MPI Execution Types of hybrid MPI‐OpenMP Execution The FX10, Kernel: update_stress 1 No merit for  Hybrid MPI‐OpenMPI Executions. 1 Effect on pure MPI Execution Gain by using MPI‐OpenMPI Executions. By adapting loop transformation from the AT, we obtained:  Maximum 1.5x speedup to pure MPI (without Thread execution)  Maximum 2.5x speedup to pure MPI in hybrid MPI‐OpenMP execution. PXTY :X Processes, Y Threads / Process
  • 25. Current Answers to AT systems Minimum software-stack requirement is important to use AT facility in supercomputers in operation. Since we have no standardization for AT functions, efforts for AT with full user-level execution are required. 51
  • 26. Future Direction  The standardization of AT functions for supercomputers is important future direction, such as: ◦ Performance monitors. ◦ Code generators, esp. dynamic code generators. ◦ Job schedulers, such as batch-job systems. ◦ Compiler optimizations including directives and compiler options. ◦ Defining AT targets, such as execution speed, memory amounts, or power consumption, etc.. ◦ etc.  Making standardization strategy for AT functions with venders is important. ◦ Message Passing Interface (MPI) standardization in MPI Forum is one of success examples for the standardization. ◦ Why not make standardization and forum for AT? 52