SlideShare a Scribd company logo
HCQC - HPC compiler quality checker
Masaki Arai
masaki.arai@linaro.org
LEG HPC-SIG
arai.masaki@jp.fujitsu.com
FUJITSU LABORATORIES LTD.
1
Background and Purpose
• The quality of the kernel part is important in HPC
applications(number crunching on supercomputers).
• Make it easy to check the quality of compiler
optimizations and acquire data to improve them
HCQC:HPC compiler quality checker
2
Subject of Quality Check
• Configuration file defines the subject of quality check.
• Main items:
 Compiler
 Compiler version
 Optimization flags
{
“DISTRIBUTION" : "OpenSUSE Tumbleweed",
"ARCH" : “aarch64",
"CPU" : "AMD Opteron A1100 Cortex A57",
"LANGUAGE" : "C",
"COMPILER" : “GCC",
"COMMAND" : "/usr/bin/gcc",
"VERSION" : “7.1.1",
"OPT_FLAGS" : ["-O2"],
"ASM_FLAGS" : ["-S“, “-fverbose-asm”],
“FLAG_DB” : [[“?DEBUG_FLAG", “-g”],
[“?C99_STANDARD", “-std=c99”]]
}
3
Example of configuration file
Metrics for Quality Evaluation
• HCQC has the following metrics:
 op : # of mnemonics in an assembly code
 kind : The kind of mnemonics in an assembly code(memory, branch,
other)
 regalloc : The quality of register allocation(# of spill in/out instructions)
 height : The height of instruction dependence graph
 ilp :Instruction level parallelism by instruction scheduler
 vectorize : Vectrization/SIMDization situation
 swpl : # of initiation interval by software pipelining
4
These data are basically static data at compile time.
Investigation Result
5
ARCH : aarch64
CPU : AMD Opteron A1100 Cortex A57
LANGUAGE : C
COMPILER : ClangLLVM
COMMAND : /usr/bin/clang
VERSION :4.0.1
OPT_FLAGS : -O2
TEST_PROGRAM: sample
KERNEL_FUNCTION_NAME : kernel
DATE: 2017/11/07
ilp swpl
memorybranch other spill in spill out IPC II kind mem arith other
BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0
BB1 0 0 0 4 0 0 0.5 0 0 0
LBB0_2 1 3 0 5 0 1 0.7 0 0 0
LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0
BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1
LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1
LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2
LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2
LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1
BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0
LBB0_11 0 0 0 1 0 0 0.2 0 0 0
*SUMMARY* 25 8 39 2 6 2 7 7
vectorizekind regalloc
CFG DEPTH
Quality Evaluation by Comparison
• One investigation result has little meaning.
• Typical comparison examples:
GCC vs. LLVM(on AArch64)
LLVM 4.0.0 vs. LLVM 5.0.0(on AArch64)
LLVM with –O2 vs. LLVM with –O3(on AArch64)
LLVM on AArch64 vs. LLVM on x86_64
Missing optimizations on AArch64
LLVM on AArch64 vs. ICC on x86_84
Optimization hints for SVE from AVX codes
6
Example of Comparison(1)
• HimenoBMT-dynamic(regalloc)
 GCC is better than Clang/LLVM.
7
CFG DEPTH spill in spill out CFG DEPTH spill in spill out
jacobi cond .L18 0 0 7 jacobi cond .LBB0_32 0 0 12
0 0 5 0 0 6
.L3 cond .L16 1 1 0 .LBB0_2 cond .LBB0_31 1 1 0
1 3 2 1 0 2
.L8 cond .L28 2 1 0 .LBB0_4 cond .LBB0_11 2 0 1
2 4 6 2 0 1
.L11 cond .L7 3 0 1 .LBB0_6 cond .LBB0_10 4 0 0
3 10 0 3 13 7
.L4 cond .L4 4 0 0 .LBB0_8 cond .LBB0_8 4 2 0
.L7 cond .L11 3 2 0 cond .LBB0_6 3 3 0
.L5 cond .L8 2 5 1 goto .LBB0_11 2 0 0
1 5 0 .LBB0_10 cond .LBB0_6 3 0 0
.L9 cond .L13 2 0 0 .LBB0_11 cond .LBB0_4 2 3 2
2 0 0 cond .LBB0_30 1 1 0
.L17 cond .L15 3 0 0 1 3 0
3 0 0 .LBB0_14 cond .LBB0_29 2 0 0
.L12 cond .L12 4 0 0 goto .LBB0_19 2 0 0
.L15 cond .L17 3 0 0 .LBB0_16 3 0 0
.L13 cond .L9 2 0 0 .LBB0_17 cond .LBB0_17 4 0 0
.L16 cond .L3 1 2 1 cond .LBB0_28 3 0 0
end 0 0 0 goto .LBB0_26 3 0 0
.L28 goto .L5 2 1 2 .LBB0_19 cond .LBB0_28 3 0 0
.L18 end 0 0 0 cond .LBB0_22 3 1 0
*SUMMARY* - 34 25 goto .LBB0_26 3 0 0
.LBB0_22 cond .LBB0_25 3 0 0
cond .LBB0_16 3 0 0
cond .LBB0_16 3 0 0
.LBB0_25 3 0 0
.LBB0_26 3 0 0
.LBB0_27 cond .LBB0_27 4 0 0
.LBB0_28 cond .LBB0_19 3 0 0
.LBB0_29 cond .LBB0_14 2 1 0
goto .LBB0_31 1 0 0
.LBB0_30 1 1 0
.LBB0_31 cond .LBB0_2 1 0 0
goto .LBB0_33 0 0 0
.LBB0_32 0 0 0
.LBB0_33 end 0 8 0
*SUMMARY* - 37 31
LLVMGCC
LLVMGCC
Example of Comparison(2)
• mVMC-mini-calculateNewPfMTwo_child(op)
– GCC generates codes with better addressing modes.
8
CFG DEPTH lsl
calculateNewPfMTwo_child cond .LBB0_3 0 0
0 0
.LBB0_2 cond .LBB0_2 1 1
.LBB0_3 cond .LBB0_11 0 0
0 0
.LBB0_5 cond .LBB0_5 1 0
cond .LBB0_12 0 0
0 1
.LBB0_8 1 0
.LBB0_9 cond .LBB0_9 2 0
cond .LBB0_8 1 0
goto .LBB0_13 0 0
.LBB0_11 0 0
.LBB0_12 0 0
.LBB0_13 end 0 0
*SUMMARY* - 2
CFG DEPTH lsl
calculateNewPfMTwo_child cond .L2 0
0
.L3 cond .L3 1
.L2 cond .L8 0
0
.L5 cond .L5 1
0
.L7 1
.L6 cond .L6 2
cond .L7 1
0
.L4 end 0
.L8 goto .L4 0
*SUMMARY* -
LLVM
GCC
.LBB0_2:
ldr w1, [x5, x16, lsl #2]
sdiv w2, w16, w13
lsl x3, x16, #3
add x16, x16, #1
madd w1, w17, w2, w1
sbfiz x1, x1, #3, #32
ldr x2, [x18, x1]
cmp x11, x16
str x2, [x10, x3]
ldr x1, [x0, x1]
str x1, [x9, x3]
b.ne .LBB0_2 // x3 is dead
LLVM
Example of Comparison(3)
• HimenoBMT-static(height)
 Clang/LLVM is not good for register usage.
9
CFG DEPTH # height
.LBB1_8 cond .LBB1_8 4 71 25
CFG DEPTH # height
.L19 cond .L19 4 62 17
LLVM
GCC
71/25 = 2.84
62/17 = 3.647
.LBB1_8:
add x17, x4, x14
mov v6.16b, v17.16b
ldr s17, [x17, #8844]
add x16, x3, x14
mov v7.16b, v16.16b
ldr s16, [x16, #8844]
add x17, x25, x14
ldr s21, [x17, x7]
fmul s0, s17, s0
ldr s17, [x17, x21]
add x16, x5, x14
mov v5.16b, v18.16b
ldr s18, [x16, #8844]
……
LLVM
The metric `height’ is not committed on GitHub yet.
Workflow of HCQC
① Compile one test program
② Run the executable file and verify its result by
comparing output and answer data
③ Generate the assembly code file
④ Make the control flow graph of the kernel part from
the assembly code
⑤ Get result data using metric programs
⑥ Make the report file from data
10
% hcqc config test metric+
% hcqc-report config test metric+
Workflow of HCQC
11
COMMAND:/usr/bin/clang
OPT_FLAGS:-O2
Configuration FileTest Program
Executable File
out.data resut.data
in.data Assembly Code File
cfg.py
CFG+DEPTH
hcqc-report
diff check DISTRIBUTION : OpenSUSE Tumbleweed
ARCH : aarch64
CPU : AMD Opteron A1100 Cortex A57
LANGUAGE : C
COMPILER : ClangLLVM
COMMAND : /usr/bin/clang
VERSION :4.0.1
OPT_FLAGS : -O2
TEST_PROGRAM: sample
KERNEL=FUNCTION=NAME : kernel
DATE: 2017/11/07
ilp swpl
mem branch other spill in spill out IPC II kind mem arith other
BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0
BB1 0 0 0 4 0 0 0.5 0 0 0
LBB0_2 1 3 0 5 0 1 0.7 0 0 0
LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0
BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1
LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1
LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2
LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2
LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1
BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0
LBB0_11 0 0 0 1 0 0 0.2 0 0 0
*SUMMARY* 25 8 39 2 6 2 7 7
vectorizekind regalloc
CFG DEPTH
Report file(csv)
Result Data
Result Data
Result Data
Metric Program
Metric Program
Metric Program
Test Program
Info File
①
②
③
④
⑤
⑥
JSON format file
Debug Information?
Test Programs for HCQC
• Generate from programs that were problematic in Fujitsu's
production compilers in the past
• Extract kernel parts and modify them to use under HCQC
– Extract hot spots
– If it is Fortran program, then convert them to C
language(for comparison between GCC and Clang/LLVM)
– Prepare the data to run and check those kernel parts
12
Test Programs for HCQC
• All original benchmarks are
publically available.
• I/O data for HCQC is being prepared.
– Some data file sizes are very large.
– For different architectures or different
optimization levels, error tolerance is required.
13
benchark name kernel name
HimenoBMT-dynamic jacobi
HimenoBMT-static jacobi
hpcg-3.0 ComputeSYMGS_ref
ccs-qcd bicgstab_hmc
ccs-qcd clover
ffb-mini CALAX3
ffb-mini FLD3X2
ffb-mini GRAD3X
ffvc-mini poi_residual
ffvc-mini psor2sma_core
mVMC-mini calculateNewPfMTwo_child
mVMC-mini updateMAllTwo_child
mVMC-mini updateMAll_child
ngsa-mini bwt_match_exact_alt
ngsa-mini bwt_match_gap
nicam-dc-mini vi_path2
Future Work
• Add supports for SVE(if available in GCC or LLVM)
• Implement metric programs:
– vectorization(vectorize)
– software pipelining(swpl)
– instruction level parallelism(ilp)
• Add features for comparing with x86_64(SVE vs. AVX)
• Add tools for automatic and intelligent comparison
14
15
URL https://github.com/Linaro/hcqc
Thank you very much!
Any comments or suggestions are welcome.

More Related Content

PDF
Porting and Optimization of Numerical Libraries for ARM SVE
PDF
An Overview of the IHK/McKernel Multi-kernel Operating System
PDF
ebpf and IO Visor: The What, how, and what next!
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PDF
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
PDF
CETH for XDP [Linux Meetup Santa Clara | July 2016]
PDF
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
PPTX
Staring into the eBPF Abyss
Porting and Optimization of Numerical Libraries for ARM SVE
An Overview of the IHK/McKernel Multi-kernel Operating System
ebpf and IO Visor: The What, how, and what next!
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
CETH for XDP [Linux Meetup Santa Clara | July 2016]
QNIBTerminal Plus InfiniBand - Containerized MPI Workloads
Staring into the eBPF Abyss

What's hot (20)

PDF
LCA14: LCA14-412: GPGPU on ARM SoC session
PDF
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
PDF
Performance optimization 101 - Erlang Factory SF 2014
PDF
Bpf performance tools chapter 4 bcc
PDF
Debugging node in prod
PDF
p4alu: Arithmetic Logic Unit in P4
PDF
P4, EPBF, and Linux TC Offload
PDF
Linux Performance 2018 (PerconaLive keynote)
PPTX
Onnc intro
POTX
Performance Tuning EC2 Instances
PDF
eBPF Perf Tools 2019
PDF
Post-K: Building the Arm HPC Ecosystem
PPTX
Kernel Proc Connector and Containers
PDF
Make Your Containers Faster: Linux Container Performance Tools
PDF
Staying Afloat with Buoy: A High-Performance HTTP Client
PDF
Lustre Best Practices
PDF
Post-K: Building the Arm HPC Ecosystem
PDF
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
PDF
RxNetty vs Tomcat Performance Results
PDF
Xdp and ebpf_maps
LCA14: LCA14-412: GPGPU on ARM SoC session
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Performance optimization 101 - Erlang Factory SF 2014
Bpf performance tools chapter 4 bcc
Debugging node in prod
p4alu: Arithmetic Logic Unit in P4
P4, EPBF, and Linux TC Offload
Linux Performance 2018 (PerconaLive keynote)
Onnc intro
Performance Tuning EC2 Instances
eBPF Perf Tools 2019
Post-K: Building the Arm HPC Ecosystem
Kernel Proc Connector and Containers
Make Your Containers Faster: Linux Container Performance Tools
Staying Afloat with Buoy: A High-Performance HTTP Client
Lustre Best Practices
Post-K: Building the Arm HPC Ecosystem
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
RxNetty vs Tomcat Performance Results
Xdp and ebpf_maps
Ad

Similar to HCQC : HPC Compiler Quality Checker (20)

PPTX
Virtual Separation of Concerns (2011 Update)
PDF
Haskell Symposium 2010: An LLVM backend for GHC
PPTX
GCC Summit 2010
PDF
Boosting Developer Productivity with Clang
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
PDF
TMPA-2017: Vellvm - Verifying the LLVM
PDF
May2010 hex-core-opt
PPTX
Feature and platform testing with CMake
PDF
Performance_Programming
PDF
PVS-Studio delved into the FreeBSD kernel
PDF
Appsec obfuscator reloaded
PDF
Static Code Analysis and Cppcheck
PDF
Deeper Look Into HSAIL And It's Runtime
PDF
pracfinal
PDF
Parsing and Type checking all 2^10000 configurations of the Linux kernel
PDF
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
PDF
The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
PDF
Clang: More than just a C/C++ Compiler
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PDF
Finding bugs in the code of LLVM project with the help of PVS-Studio
Virtual Separation of Concerns (2011 Update)
Haskell Symposium 2010: An LLVM backend for GHC
GCC Summit 2010
Boosting Developer Productivity with Clang
Cray XT Porting, Scaling, and Optimization Best Practices
TMPA-2017: Vellvm - Verifying the LLVM
May2010 hex-core-opt
Feature and platform testing with CMake
Performance_Programming
PVS-Studio delved into the FreeBSD kernel
Appsec obfuscator reloaded
Static Code Analysis and Cppcheck
Deeper Look Into HSAIL And It's Runtime
pracfinal
Parsing and Type checking all 2^10000 configurations of the Linux kernel
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
Clang: More than just a C/C++ Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Finding bugs in the code of LLVM project with the help of PVS-Studio
Ad

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectroscopy.pptx food analysis technology
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

HCQC : HPC Compiler Quality Checker

  • 1. HCQC - HPC compiler quality checker Masaki Arai masaki.arai@linaro.org LEG HPC-SIG arai.masaki@jp.fujitsu.com FUJITSU LABORATORIES LTD. 1
  • 2. Background and Purpose • The quality of the kernel part is important in HPC applications(number crunching on supercomputers). • Make it easy to check the quality of compiler optimizations and acquire data to improve them HCQC:HPC compiler quality checker 2
  • 3. Subject of Quality Check • Configuration file defines the subject of quality check. • Main items:  Compiler  Compiler version  Optimization flags { “DISTRIBUTION" : "OpenSUSE Tumbleweed", "ARCH" : “aarch64", "CPU" : "AMD Opteron A1100 Cortex A57", "LANGUAGE" : "C", "COMPILER" : “GCC", "COMMAND" : "/usr/bin/gcc", "VERSION" : “7.1.1", "OPT_FLAGS" : ["-O2"], "ASM_FLAGS" : ["-S“, “-fverbose-asm”], “FLAG_DB” : [[“?DEBUG_FLAG", “-g”], [“?C99_STANDARD", “-std=c99”]] } 3 Example of configuration file
  • 4. Metrics for Quality Evaluation • HCQC has the following metrics:  op : # of mnemonics in an assembly code  kind : The kind of mnemonics in an assembly code(memory, branch, other)  regalloc : The quality of register allocation(# of spill in/out instructions)  height : The height of instruction dependence graph  ilp :Instruction level parallelism by instruction scheduler  vectorize : Vectrization/SIMDization situation  swpl : # of initiation interval by software pipelining 4 These data are basically static data at compile time.
  • 5. Investigation Result 5 ARCH : aarch64 CPU : AMD Opteron A1100 Cortex A57 LANGUAGE : C COMPILER : ClangLLVM COMMAND : /usr/bin/clang VERSION :4.0.1 OPT_FLAGS : -O2 TEST_PROGRAM: sample KERNEL_FUNCTION_NAME : kernel DATE: 2017/11/07 ilp swpl memorybranch other spill in spill out IPC II kind mem arith other BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0 BB1 0 0 0 4 0 0 0.5 0 0 0 LBB0_2 1 3 0 5 0 1 0.7 0 0 0 LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0 BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1 LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1 LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2 LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2 LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1 BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0 LBB0_11 0 0 0 1 0 0 0.2 0 0 0 *SUMMARY* 25 8 39 2 6 2 7 7 vectorizekind regalloc CFG DEPTH
  • 6. Quality Evaluation by Comparison • One investigation result has little meaning. • Typical comparison examples: GCC vs. LLVM(on AArch64) LLVM 4.0.0 vs. LLVM 5.0.0(on AArch64) LLVM with –O2 vs. LLVM with –O3(on AArch64) LLVM on AArch64 vs. LLVM on x86_64 Missing optimizations on AArch64 LLVM on AArch64 vs. ICC on x86_84 Optimization hints for SVE from AVX codes 6
  • 7. Example of Comparison(1) • HimenoBMT-dynamic(regalloc)  GCC is better than Clang/LLVM. 7 CFG DEPTH spill in spill out CFG DEPTH spill in spill out jacobi cond .L18 0 0 7 jacobi cond .LBB0_32 0 0 12 0 0 5 0 0 6 .L3 cond .L16 1 1 0 .LBB0_2 cond .LBB0_31 1 1 0 1 3 2 1 0 2 .L8 cond .L28 2 1 0 .LBB0_4 cond .LBB0_11 2 0 1 2 4 6 2 0 1 .L11 cond .L7 3 0 1 .LBB0_6 cond .LBB0_10 4 0 0 3 10 0 3 13 7 .L4 cond .L4 4 0 0 .LBB0_8 cond .LBB0_8 4 2 0 .L7 cond .L11 3 2 0 cond .LBB0_6 3 3 0 .L5 cond .L8 2 5 1 goto .LBB0_11 2 0 0 1 5 0 .LBB0_10 cond .LBB0_6 3 0 0 .L9 cond .L13 2 0 0 .LBB0_11 cond .LBB0_4 2 3 2 2 0 0 cond .LBB0_30 1 1 0 .L17 cond .L15 3 0 0 1 3 0 3 0 0 .LBB0_14 cond .LBB0_29 2 0 0 .L12 cond .L12 4 0 0 goto .LBB0_19 2 0 0 .L15 cond .L17 3 0 0 .LBB0_16 3 0 0 .L13 cond .L9 2 0 0 .LBB0_17 cond .LBB0_17 4 0 0 .L16 cond .L3 1 2 1 cond .LBB0_28 3 0 0 end 0 0 0 goto .LBB0_26 3 0 0 .L28 goto .L5 2 1 2 .LBB0_19 cond .LBB0_28 3 0 0 .L18 end 0 0 0 cond .LBB0_22 3 1 0 *SUMMARY* - 34 25 goto .LBB0_26 3 0 0 .LBB0_22 cond .LBB0_25 3 0 0 cond .LBB0_16 3 0 0 cond .LBB0_16 3 0 0 .LBB0_25 3 0 0 .LBB0_26 3 0 0 .LBB0_27 cond .LBB0_27 4 0 0 .LBB0_28 cond .LBB0_19 3 0 0 .LBB0_29 cond .LBB0_14 2 1 0 goto .LBB0_31 1 0 0 .LBB0_30 1 1 0 .LBB0_31 cond .LBB0_2 1 0 0 goto .LBB0_33 0 0 0 .LBB0_32 0 0 0 .LBB0_33 end 0 8 0 *SUMMARY* - 37 31 LLVMGCC LLVMGCC
  • 8. Example of Comparison(2) • mVMC-mini-calculateNewPfMTwo_child(op) – GCC generates codes with better addressing modes. 8 CFG DEPTH lsl calculateNewPfMTwo_child cond .LBB0_3 0 0 0 0 .LBB0_2 cond .LBB0_2 1 1 .LBB0_3 cond .LBB0_11 0 0 0 0 .LBB0_5 cond .LBB0_5 1 0 cond .LBB0_12 0 0 0 1 .LBB0_8 1 0 .LBB0_9 cond .LBB0_9 2 0 cond .LBB0_8 1 0 goto .LBB0_13 0 0 .LBB0_11 0 0 .LBB0_12 0 0 .LBB0_13 end 0 0 *SUMMARY* - 2 CFG DEPTH lsl calculateNewPfMTwo_child cond .L2 0 0 .L3 cond .L3 1 .L2 cond .L8 0 0 .L5 cond .L5 1 0 .L7 1 .L6 cond .L6 2 cond .L7 1 0 .L4 end 0 .L8 goto .L4 0 *SUMMARY* - LLVM GCC .LBB0_2: ldr w1, [x5, x16, lsl #2] sdiv w2, w16, w13 lsl x3, x16, #3 add x16, x16, #1 madd w1, w17, w2, w1 sbfiz x1, x1, #3, #32 ldr x2, [x18, x1] cmp x11, x16 str x2, [x10, x3] ldr x1, [x0, x1] str x1, [x9, x3] b.ne .LBB0_2 // x3 is dead LLVM
  • 9. Example of Comparison(3) • HimenoBMT-static(height)  Clang/LLVM is not good for register usage. 9 CFG DEPTH # height .LBB1_8 cond .LBB1_8 4 71 25 CFG DEPTH # height .L19 cond .L19 4 62 17 LLVM GCC 71/25 = 2.84 62/17 = 3.647 .LBB1_8: add x17, x4, x14 mov v6.16b, v17.16b ldr s17, [x17, #8844] add x16, x3, x14 mov v7.16b, v16.16b ldr s16, [x16, #8844] add x17, x25, x14 ldr s21, [x17, x7] fmul s0, s17, s0 ldr s17, [x17, x21] add x16, x5, x14 mov v5.16b, v18.16b ldr s18, [x16, #8844] …… LLVM The metric `height’ is not committed on GitHub yet.
  • 10. Workflow of HCQC ① Compile one test program ② Run the executable file and verify its result by comparing output and answer data ③ Generate the assembly code file ④ Make the control flow graph of the kernel part from the assembly code ⑤ Get result data using metric programs ⑥ Make the report file from data 10 % hcqc config test metric+ % hcqc-report config test metric+
  • 11. Workflow of HCQC 11 COMMAND:/usr/bin/clang OPT_FLAGS:-O2 Configuration FileTest Program Executable File out.data resut.data in.data Assembly Code File cfg.py CFG+DEPTH hcqc-report diff check DISTRIBUTION : OpenSUSE Tumbleweed ARCH : aarch64 CPU : AMD Opteron A1100 Cortex A57 LANGUAGE : C COMPILER : ClangLLVM COMMAND : /usr/bin/clang VERSION :4.0.1 OPT_FLAGS : -O2 TEST_PROGRAM: sample KERNEL=FUNCTION=NAME : kernel DATE: 2017/11/07 ilp swpl mem branch other spill in spill out IPC II kind mem arith other BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0 BB1 0 0 0 4 0 0 0.5 0 0 0 LBB0_2 1 3 0 5 0 1 0.7 0 0 0 LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0 BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1 LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1 LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2 LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2 LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1 BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0 LBB0_11 0 0 0 1 0 0 0.2 0 0 0 *SUMMARY* 25 8 39 2 6 2 7 7 vectorizekind regalloc CFG DEPTH Report file(csv) Result Data Result Data Result Data Metric Program Metric Program Metric Program Test Program Info File ① ② ③ ④ ⑤ ⑥ JSON format file Debug Information?
  • 12. Test Programs for HCQC • Generate from programs that were problematic in Fujitsu's production compilers in the past • Extract kernel parts and modify them to use under HCQC – Extract hot spots – If it is Fortran program, then convert them to C language(for comparison between GCC and Clang/LLVM) – Prepare the data to run and check those kernel parts 12
  • 13. Test Programs for HCQC • All original benchmarks are publically available. • I/O data for HCQC is being prepared. – Some data file sizes are very large. – For different architectures or different optimization levels, error tolerance is required. 13 benchark name kernel name HimenoBMT-dynamic jacobi HimenoBMT-static jacobi hpcg-3.0 ComputeSYMGS_ref ccs-qcd bicgstab_hmc ccs-qcd clover ffb-mini CALAX3 ffb-mini FLD3X2 ffb-mini GRAD3X ffvc-mini poi_residual ffvc-mini psor2sma_core mVMC-mini calculateNewPfMTwo_child mVMC-mini updateMAllTwo_child mVMC-mini updateMAll_child ngsa-mini bwt_match_exact_alt ngsa-mini bwt_match_gap nicam-dc-mini vi_path2
  • 14. Future Work • Add supports for SVE(if available in GCC or LLVM) • Implement metric programs: – vectorization(vectorize) – software pipelining(swpl) – instruction level parallelism(ilp) • Add features for comparing with x86_64(SVE vs. AVX) • Add tools for automatic and intelligent comparison 14
  • 15. 15 URL https://github.com/Linaro/hcqc Thank you very much! Any comments or suggestions are welcome.