SlideShare a Scribd company logo
1
Using GCC
Auto-Vectorizer
Ira Rosen <ira.rosen@linaro.org>
Michael Hope <michael.hope@linaro.org>
r1
bzr branch lp:~michaelh1/+junk/using-the-vectorizer
2
Using GCC Vectorizer
● Vectorization is enabled by the flag -ftree-vectorize and by
default at -O3:
● gcc –O2 –ftree-vectorize myloop.c
● or gcc –O3 myloop.c
● To enable NEON:
-mfpu=neon -mfloat-abi=softfp or -mfloat-abi=hard
● Information on which loops got vectorized, and which
didn’t and why:
● -fdump-tree-vect(-details)
– dumps information into myloop.c.##t.vect
● -ftree-vectorizer-verbose=[X]
– dumps to stderr
● More information:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
3
Other useful flags
● -ffast-math - if operating on floats in a reduction
computation (to allow the vectorizer to change the
order of the computation)
● -funsafe-loop-optimizations - if using "unsigned int"
loop counters (can be assumed not to overflow)
● -ftree-loop-if-convert-stores - more aggressive
if-conversion
● --param min-vect-loop-bound=[X] - if have loops with a
short trip-count
● -fno-vect-loop-version- if worried about code size
4
What's vectorizable
● Innermost loops
● countable
● no control flow
● independent data accesses
● continuous data accesses
for (k = 0; k < m; k ++)
for (j = 0; j < m; j ++)
for (i = 0; i < n; i ++)
a[k][j][i] = b[k][j][i] * c[k][j][i];
Example of not vectorizable loop:
while (a[i] != 8)
{
if (a[i] != 0)
a[i] = a[i-1];
b[i+stride] = 0;
}
uncountable
control flow
loop carried dependence
access with unknown stride
5
Special features
● vectorization of outer loops
● vectorization of straight-line code
● if-conversion
● multiple data-types and type conversions
● recognition of special idioms (e.g. dot-product,
widening operations)
● strided memory accesses
● cost model
● runtime aliasing and alignment tests
● auto-detection of vector size
Examples:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
6
GCC Versions
● Current Linaro GCC is based on FSF GCC 4.6
● Once FSF GCC 4.7 is released (in about six
months) Linaro GCC will switch to GCC 4.7
● Some of GCC 4.7 vectorizer related features:
● __builtin_assume_aligned – alignment hints
● vectorization of conditions with mixed types
● vectorization of bool
7
Special features
● vectorization of outer loops
● vectorization of straight-line code
● if-conversion
● multiple data-types and type conversions
● recognition of special idioms (e.g. dot-product,
widening operations)
● strided memory accesses
● cost model
● runtime aliasing and alignment tests
● auto-detection of vector size
Examples:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
8
Vectorizing for NEON
libswscale/rgb2rgb_template.c:
static inline void rgb24tobgr16_c(
const uint8_t *src, uint8_t *dst,
int src_size) {
const uint8_t *s = src;
const uint8_t *end;
uint16_t *d = (uint16_t *)dst;
end = s + src_size;
while (s < end) {
const int b = *s++;
const int g = *s++;
const int r = *s++;
*d++ = (b>>3) | ((g&0xFC)<<3)
| ((r&0xF8)<<8);
} }
.L12:
mov r1, r4
add r0, r0, #1
vld3.8 {d16, d18, d20}, [r1]!
cmp r0, r6
mov r2, ip
add r4, r4, #48
add ip, ip, #32
vld3.8 {d17, d19, d21}, [r1]
vand q12, q9, q14
vshr.u8 q11, q8, #3
vand q8, q10, q13
vshll.u8 q9, d25, #3
vmovl.u8 q10, d22
vshll.u8 q15, d24, #3
vmovl.u8 q11, d23
vorr q12, q15, q10
vorr q11, q9, q11
vshll.u8 q10, d16, #8
vshll.u8 q9, d17, #8
vorr q8, q12, q10
vorr q11, q11, q9
vst1.16 {q8}, [r2]!
vst1.16 {q11}, [r2]
bcc .L12
scalar: 75000 runs take 216.583ms
vector: 75000 runs take 48.8586ms
speedup: 4.433x
strided access
& and shift right
performed on u8
widening shift
no over-promotion to s32
9
Writing vectorizer-friendly code
● Avoid aliasing problems
– Use __restrict__ qualified pointers
void foo (int *__restrict__ pInput, int *__restrict__ pOutput)
● Don’t unroll loops
– Loop vectorization is more powerful than SLP
for (i=0; i<n; i+=4) {
sum += a[0]; for (i=0; i<n; i++)
sum += a[1]; sum += a[i];
sum += a[2];
sum += a[3];
a += 4;}
10
Writing vectorizer-friendly code (cont.)
● Use countable loops, with no side-effects
– No function-calls in the loop (distribute into a separate
loop)
for (i=0; i<n; i++)
for (i=0; i<n; i++) { if (a[i] == 0) foo();
if (a[i] == 0) foo(); for (i=0; i<n; i++)
b[i] = c[i]; } b[i] = c[i];
– No ‘break’/’continue’
for (i=0; i<n; i++)
for (i=0; i<n; i++) { if (a[i] == 8) {m = i; break;}
if (a[i] == 8) break; for (i=0; i<m; i++)
b[i] = c[i]; } b[i] = c[i];
11
Writing vectorizer-friendly code (cont.)
● Keep the memory access-pattern simple
– Don't use indirect accesses, e.g.:
for (i=0; i<n; i++)
a[b[i]] = x;
– Don't use unknown stride, e.g.:
for (i=0; i<n; i++)
a[i+stride] = x;
● Use "int" iterators rather than "unsigned int" iterators
– The C standard says that the former cannot overflow,
which helps the compiler to determine the trip count.
12
Some of our recent contributions
● Support of vldN/vstN
● NEON specific patterns: e.g. widening shift
● SLP (straight-line code vectorization)
improvements
● RTL improvements:
● reducing the number of moves and amount of
spilling (both for auto- and hand-vectorised code)
● improving modulo scheduling of NEON code
13
People
● Linaro Toolchain WG
● Ira Rosen (IRC: irar)
ira.rosen@linaro.org
– auto-vectorizer
● Richard Sandiford (IRC: rsandifo)
richard.sandiford@linaro.org
– NEON back-end/RTL optimizations
14
Helping us
Send us examples of code that are important to
you to vectorize.
15
Output Example
ex.c:
1 #define N 128
2 int a[N], b[N];
3 void foo (void)
4 {
5 int i;
6
7 for (i = 0; i < N; i++)
8 a[i] = i;
9
10 for (i = 0; i < N; i+=5)
11 b[i] = i;
12 }
● What's got vectorized:
gcc -c -O3 -ftree-vectorizer-verbose=1 ex.c
ex.c:7: note: LOOP VECTORIZED.
ex.c:3: note: vectorized 1 loops in function.
● What's got vectorized and what didn't:
gcc -c -O3 -ftree-vectorizer-verbose=2 ex.c
ex.c:10: note: not vectorized: complicated access
pattern.
ex.c:10: note: not vectorized: complicated access
pattern.
ex.c:7: note: LOOP VECTORIZED.
ex.c:3: note: vectorized 1 loops in function.
...
● All the details:
gcc -c -O3 -ftree-vectorizer-verbose=9 ex.c
or
gcc -c -O3 -fdump-tree-vect-details ex.c

More Related Content

PDF
Hopper アーキテクチャで、変わること、変わらないこと
PDF
ARM CPUにおけるSIMDを用いた高速計算入門
PDF
Xbyakの紹介とその周辺
PDF
1076: CUDAデバッグ・プロファイリング入門
PPTX
【DL輪読会】Investigating Tradeoffs in Real-World Video Super-Resolution
PDF
TEE (Trusted Execution Environment)は第二の仮想化技術になるか?
PDF
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
PDF
組み込み関数(intrinsic)によるSIMD入門
Hopper アーキテクチャで、変わること、変わらないこと
ARM CPUにおけるSIMDを用いた高速計算入門
Xbyakの紹介とその周辺
1076: CUDAデバッグ・プロファイリング入門
【DL輪読会】Investigating Tradeoffs in Real-World Video Super-Resolution
TEE (Trusted Execution Environment)は第二の仮想化技術になるか?
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
組み込み関数(intrinsic)によるSIMD入門

What's hot (20)

PPTX
AVX-512(フォーマット)詳解
PDF
05 第4.4節-第4.8節 ROS2の応用機能(2/2)
PDF
BPF / XDP 8월 세미나 KossLab
PDF
いまさら聞けないarmを使ったNEONの基礎と活用事例
PDF
3次元レジストレーション(PCLデモとコード付き)
PDF
第 1 回 Jetson ユーザー勉強会
PPTX
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
PPTX
【論文読み会】BEiT_BERT Pre-Training of Image Transformers.pptx
PDF
10分で分かるLinuxブロックレイヤ
PDF
Deflate
PDF
新しい並列for構文のご提案
PDF
CXL_説明_公開用.pdf
PPTX
純粋関数型アルゴリズム入門
PDF
Tensorflow lite for microcontroller
PDF
llvm入門
PDF
BPF Internals (eBPF)
PDF
Magnum IO GPUDirect Storage 最新情報
PDF
不揮発メモリ(NVDIMM)とLinuxの対応動向について
PPTX
画像処理の高性能計算
PDF
ROS 2 Foxy with Eclipse Cyclone DDS | Philly ROS Meetup July 20th 2020
AVX-512(フォーマット)詳解
05 第4.4節-第4.8節 ROS2の応用機能(2/2)
BPF / XDP 8월 세미나 KossLab
いまさら聞けないarmを使ったNEONの基礎と活用事例
3次元レジストレーション(PCLデモとコード付き)
第 1 回 Jetson ユーザー勉強会
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
【論文読み会】BEiT_BERT Pre-Training of Image Transformers.pptx
10分で分かるLinuxブロックレイヤ
Deflate
新しい並列for構文のご提案
CXL_説明_公開用.pdf
純粋関数型アルゴリズム入門
Tensorflow lite for microcontroller
llvm入門
BPF Internals (eBPF)
Magnum IO GPUDirect Storage 最新情報
不揮発メモリ(NVDIMM)とLinuxの対応動向について
画像処理の高性能計算
ROS 2 Foxy with Eclipse Cyclone DDS | Philly ROS Meetup July 20th 2020
Ad

Viewers also liked (18)

PDF
Q4.11: NEON Intrinsics
PDF
Moving NEON to 64 bits
PPTX
GCC for ARMv8 Aarch64
PPTX
COMPLETE DETAIL ABOUT ARM PART1
PDF
64-bit Android
PDF
LAS16-406: Android Widevine on OP-TEE
PDF
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
PDF
LAS16-504: Secure Storage updates in OP-TEE
PDF
SFO15-503: Secure storage in OP-TEE
ODP
Introduction to Optee (26 may 2016)
PPTX
Introduction to armv8 aarch64
PDF
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
PDF
LCU14-103: How to create and run Trusted Applications on OP-TEE
PDF
HKG15-311: OP-TEE for Beginners and Porting Review
PDF
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
PPTX
Arm v8 instruction overview android 64 bit briefing
PDF
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
PDF
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
Q4.11: NEON Intrinsics
Moving NEON to 64 bits
GCC for ARMv8 Aarch64
COMPLETE DETAIL ABOUT ARM PART1
64-bit Android
LAS16-406: Android Widevine on OP-TEE
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
LAS16-504: Secure Storage updates in OP-TEE
SFO15-503: Secure storage in OP-TEE
Introduction to Optee (26 may 2016)
Introduction to armv8 aarch64
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEE
HKG15-311: OP-TEE for Beginners and Porting Review
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
Arm v8 instruction overview android 64 bit briefing
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
Ad

Similar to Q4.11: Using GCC Auto-Vectorizer (20)

PDF
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
PDF
Boosting Developer Productivity with Clang
PDF
第11回 配信講義 計算科学技術特論A(2021)
PDF
COSCUP2023 RSA256 Verilator.pdf
PPTX
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
PDF
Address/Thread/Memory Sanitizer
PDF
ESL Anyone?
PDF
Cryptography and secure systems
PDF
Meltdown & spectre
PDF
Meltdown & Spectre
PDF
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
DOCX
Lab Practices and Works Documentation / Report on Computer Graphics
PDF
Computer Graphics - Lecture 01 - 3D Programming I
ODP
Linux kernel tracing superpowers in the cloud
PPTX
Static analysis of C++ source code
PPTX
Static analysis of C++ source code
PPTX
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
PPTX
리눅스 드라이버 실습 #3
PPT
Lecture 04
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
Boosting Developer Productivity with Clang
第11回 配信講義 計算科学技術特論A(2021)
COSCUP2023 RSA256 Verilator.pdf
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Address/Thread/Memory Sanitizer
ESL Anyone?
Cryptography and secure systems
Meltdown & spectre
Meltdown & Spectre
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
Lab Practices and Works Documentation / Report on Computer Graphics
Computer Graphics - Lecture 01 - 3D Programming I
Linux kernel tracing superpowers in the cloud
Static analysis of C++ source code
Static analysis of C++ source code
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
리눅스 드라이버 실습 #3
Lecture 04

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx

Q4.11: Using GCC Auto-Vectorizer

  • 1. 1 Using GCC Auto-Vectorizer Ira Rosen <ira.rosen@linaro.org> Michael Hope <michael.hope@linaro.org> r1 bzr branch lp:~michaelh1/+junk/using-the-vectorizer
  • 2. 2 Using GCC Vectorizer ● Vectorization is enabled by the flag -ftree-vectorize and by default at -O3: ● gcc –O2 –ftree-vectorize myloop.c ● or gcc –O3 myloop.c ● To enable NEON: -mfpu=neon -mfloat-abi=softfp or -mfloat-abi=hard ● Information on which loops got vectorized, and which didn’t and why: ● -fdump-tree-vect(-details) – dumps information into myloop.c.##t.vect ● -ftree-vectorizer-verbose=[X] – dumps to stderr ● More information: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
  • 3. 3 Other useful flags ● -ffast-math - if operating on floats in a reduction computation (to allow the vectorizer to change the order of the computation) ● -funsafe-loop-optimizations - if using "unsigned int" loop counters (can be assumed not to overflow) ● -ftree-loop-if-convert-stores - more aggressive if-conversion ● --param min-vect-loop-bound=[X] - if have loops with a short trip-count ● -fno-vect-loop-version- if worried about code size
  • 4. 4 What's vectorizable ● Innermost loops ● countable ● no control flow ● independent data accesses ● continuous data accesses for (k = 0; k < m; k ++) for (j = 0; j < m; j ++) for (i = 0; i < n; i ++) a[k][j][i] = b[k][j][i] * c[k][j][i]; Example of not vectorizable loop: while (a[i] != 8) { if (a[i] != 0) a[i] = a[i-1]; b[i+stride] = 0; } uncountable control flow loop carried dependence access with unknown stride
  • 5. 5 Special features ● vectorization of outer loops ● vectorization of straight-line code ● if-conversion ● multiple data-types and type conversions ● recognition of special idioms (e.g. dot-product, widening operations) ● strided memory accesses ● cost model ● runtime aliasing and alignment tests ● auto-detection of vector size Examples: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
  • 6. 6 GCC Versions ● Current Linaro GCC is based on FSF GCC 4.6 ● Once FSF GCC 4.7 is released (in about six months) Linaro GCC will switch to GCC 4.7 ● Some of GCC 4.7 vectorizer related features: ● __builtin_assume_aligned – alignment hints ● vectorization of conditions with mixed types ● vectorization of bool
  • 7. 7 Special features ● vectorization of outer loops ● vectorization of straight-line code ● if-conversion ● multiple data-types and type conversions ● recognition of special idioms (e.g. dot-product, widening operations) ● strided memory accesses ● cost model ● runtime aliasing and alignment tests ● auto-detection of vector size Examples: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
  • 8. 8 Vectorizing for NEON libswscale/rgb2rgb_template.c: static inline void rgb24tobgr16_c( const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; uint16_t *d = (uint16_t *)dst; end = s + src_size; while (s < end) { const int b = *s++; const int g = *s++; const int r = *s++; *d++ = (b>>3) | ((g&0xFC)<<3) | ((r&0xF8)<<8); } } .L12: mov r1, r4 add r0, r0, #1 vld3.8 {d16, d18, d20}, [r1]! cmp r0, r6 mov r2, ip add r4, r4, #48 add ip, ip, #32 vld3.8 {d17, d19, d21}, [r1] vand q12, q9, q14 vshr.u8 q11, q8, #3 vand q8, q10, q13 vshll.u8 q9, d25, #3 vmovl.u8 q10, d22 vshll.u8 q15, d24, #3 vmovl.u8 q11, d23 vorr q12, q15, q10 vorr q11, q9, q11 vshll.u8 q10, d16, #8 vshll.u8 q9, d17, #8 vorr q8, q12, q10 vorr q11, q11, q9 vst1.16 {q8}, [r2]! vst1.16 {q11}, [r2] bcc .L12 scalar: 75000 runs take 216.583ms vector: 75000 runs take 48.8586ms speedup: 4.433x strided access & and shift right performed on u8 widening shift no over-promotion to s32
  • 9. 9 Writing vectorizer-friendly code ● Avoid aliasing problems – Use __restrict__ qualified pointers void foo (int *__restrict__ pInput, int *__restrict__ pOutput) ● Don’t unroll loops – Loop vectorization is more powerful than SLP for (i=0; i<n; i+=4) { sum += a[0]; for (i=0; i<n; i++) sum += a[1]; sum += a[i]; sum += a[2]; sum += a[3]; a += 4;}
  • 10. 10 Writing vectorizer-friendly code (cont.) ● Use countable loops, with no side-effects – No function-calls in the loop (distribute into a separate loop) for (i=0; i<n; i++) for (i=0; i<n; i++) { if (a[i] == 0) foo(); if (a[i] == 0) foo(); for (i=0; i<n; i++) b[i] = c[i]; } b[i] = c[i]; – No ‘break’/’continue’ for (i=0; i<n; i++) for (i=0; i<n; i++) { if (a[i] == 8) {m = i; break;} if (a[i] == 8) break; for (i=0; i<m; i++) b[i] = c[i]; } b[i] = c[i];
  • 11. 11 Writing vectorizer-friendly code (cont.) ● Keep the memory access-pattern simple – Don't use indirect accesses, e.g.: for (i=0; i<n; i++) a[b[i]] = x; – Don't use unknown stride, e.g.: for (i=0; i<n; i++) a[i+stride] = x; ● Use "int" iterators rather than "unsigned int" iterators – The C standard says that the former cannot overflow, which helps the compiler to determine the trip count.
  • 12. 12 Some of our recent contributions ● Support of vldN/vstN ● NEON specific patterns: e.g. widening shift ● SLP (straight-line code vectorization) improvements ● RTL improvements: ● reducing the number of moves and amount of spilling (both for auto- and hand-vectorised code) ● improving modulo scheduling of NEON code
  • 13. 13 People ● Linaro Toolchain WG ● Ira Rosen (IRC: irar) ira.rosen@linaro.org – auto-vectorizer ● Richard Sandiford (IRC: rsandifo) richard.sandiford@linaro.org – NEON back-end/RTL optimizations
  • 14. 14 Helping us Send us examples of code that are important to you to vectorize.
  • 15. 15 Output Example ex.c: 1 #define N 128 2 int a[N], b[N]; 3 void foo (void) 4 { 5 int i; 6 7 for (i = 0; i < N; i++) 8 a[i] = i; 9 10 for (i = 0; i < N; i+=5) 11 b[i] = i; 12 } ● What's got vectorized: gcc -c -O3 -ftree-vectorizer-verbose=1 ex.c ex.c:7: note: LOOP VECTORIZED. ex.c:3: note: vectorized 1 loops in function. ● What's got vectorized and what didn't: gcc -c -O3 -ftree-vectorizer-verbose=2 ex.c ex.c:10: note: not vectorized: complicated access pattern. ex.c:10: note: not vectorized: complicated access pattern. ex.c:7: note: LOOP VECTORIZED. ex.c:3: note: vectorized 1 loops in function. ... ● All the details: gcc -c -O3 -ftree-vectorizer-verbose=9 ex.c or gcc -c -O3 -fdump-tree-vect-details ex.c