SlideShare a Scribd company logo
Michael Hope, Toolchain
bzr branch lp:~michaelh1/+junk/intrinsics-demo
NEON Intrinsics
What's NEON?
●Ch 19 'Introducting NEON'
http://infocenter.arm.com/help/topic/com.arm.doc.den0013a/
SIMD is...
Same instruction, many values
Anything involving signals is great for
SIMD
Normalisation
● Easier to read and write
● Easier (better?) register allocation
● Compiler knows how to schedule
● ABI neutral
Advantages
Works across compilers
> gcc-mcpu=cortex-a9 -mfpu=neon -O3 -c test.c
> armcc --cpu Cortex-A9 --c99 -O3 -c test.c
> clang -mcpu=cortex-a9 -mfpu=neon -O3 -c test.c
Tune for the architecture
-mtune=cortex-a9
-mtune=cortex-a8
-mtune=cortex-a5
SMS, unrolling, profiling?
Writing
Environment
#include <arm_neon.h>
gcc -march=armv7-a -mfpu=neon
Data types
<type>x<lanes>_t (uint8x4_t)
<type>x<lanes>x<# registers>_t
(int16x2x4_t)
Some Instructions
Add
uint16x4_t vadd_u16 (
uint16x4_t left,
uint16x4_t right
)
Multiply
uint64x2_t vmlal_u32
(uint64x2_t,
uint32x2_t, uint32x2_t)
int32x4_t vqdmlal_s16
(int32x4_t,
int16x4_t, int16x4_t)
Strided load
uint8x8x2_t vld2_u8
(const uint8_t *)
Form of expected instruction(s):
vld2.8 {d0, d1}, [r0]
Documentation
GCC
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
ARM
http://infocenter.arm.com/help/topic/com.arm.doc.den0013a
Blog posts
Search for “Coding with NEON” on
http://blogs.arm.com
Writing
Colour space conversion
Y = 0.2126 R + 0.7152 G + 0.0722 B
HD television (ITU BT.709)
Versions
Nils Pipenbrinck
http://hilbert-space.de/?p=22
Q4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
Performance
Plain C
48.481 s
Assembly
8.727 s (5.55 x faster)
Intrinsics
8.728 s (5.55 x faster)
Bigger Routines
“libpixelflinger: Add ARM NEON optimized
scanline_t32cb16”
http://wiki.linaro.org/RichardSandiford/Sandbox/IntrinsicsPerformance
Hand-written
2.831 s
Intrinsics
2.637 s (7.4 % faster)
Q4.11: NEON Intrinsics

More Related Content

PDF
いまさら聞けないarmを使ったNEONの基礎と活用事例
PDF
ARM CPUにおけるSIMDを用いた高速計算入門
PDF
PDF
組み込み関数(intrinsic)によるSIMD入門
PDF
LLVM Backend の紹介
PDF
from Source to Binary: How GNU Toolchain Works
PPTX
TVMの次期グラフIR Relayの紹介
PDF
Intro to SVE 富岳のA64FXを触ってみた
いまさら聞けないarmを使ったNEONの基礎と活用事例
ARM CPUにおけるSIMDを用いた高速計算入門
組み込み関数(intrinsic)によるSIMD入門
LLVM Backend の紹介
from Source to Binary: How GNU Toolchain Works
TVMの次期グラフIR Relayの紹介
Intro to SVE 富岳のA64FXを触ってみた

What's hot (20)

PDF
NeurIPS2021から見るメタ学習の研究動向 - 第83回人工知能セミナー (2022.3.7)「AIトレンド・トップカンファレンス報告会(NeurI...
PDF
凸最適化 〜 双対定理とソルバーCVXPYの紹介 〜
PPTX
GCC for ARMv8 Aarch64
PPTX
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
PDF
DPDKによる高速コンテナネットワーキング
PPTX
【材料力学】自重を受ける棒の伸び
PPTX
ARM LinuxのMMUはわかりにくい
PPTX
並列化による高速化
PDF
AndroidとSELinux
PDF
TCP/IPプロトコルスタック自作入門
PDF
ARM Trusted FirmwareのBL31を単体で使う!
PDF
Page reclaim
PDF
ディープラーニングの2値化(Binarized Neural Network)
PDF
レシピの作り方入門
PDF
initramfsについて
PDF
Zynq MPSoC勉強会 Codec編
PDF
機械学習で泣かないためのコード設計 2018
PPTX
Nginx Unitを試してみた話
PDF
Constexpr 中3女子テクニック
PDF
constexpr関数はコンパイル時処理。これはいい。実行時が霞んで見える。cpuの嬌声が聞こえてきそうだ
NeurIPS2021から見るメタ学習の研究動向 - 第83回人工知能セミナー (2022.3.7)「AIトレンド・トップカンファレンス報告会(NeurI...
凸最適化 〜 双対定理とソルバーCVXPYの紹介 〜
GCC for ARMv8 Aarch64
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
DPDKによる高速コンテナネットワーキング
【材料力学】自重を受ける棒の伸び
ARM LinuxのMMUはわかりにくい
並列化による高速化
AndroidとSELinux
TCP/IPプロトコルスタック自作入門
ARM Trusted FirmwareのBL31を単体で使う!
Page reclaim
ディープラーニングの2値化(Binarized Neural Network)
レシピの作り方入門
initramfsについて
Zynq MPSoC勉強会 Codec編
機械学習で泣かないためのコード設計 2018
Nginx Unitを試してみた話
Constexpr 中3女子テクニック
constexpr関数はコンパイル時処理。これはいい。実行時が霞んで見える。cpuの嬌声が聞こえてきそうだ
Ad

Viewers also liked (19)

PDF
Q4.11: Using GCC Auto-Vectorizer
PDF
Moving NEON to 64 bits
PPTX
COMPLETE DETAIL ABOUT ARM PART1
PDF
中華チップ全盛時代のARM SoCの選び方_公開版
PDF
64-bit Android
PDF
LAS16-406: Android Widevine on OP-TEE
PDF
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
PDF
EXAME-PARTE-II
PDF
LAS16-504: Secure Storage updates in OP-TEE
ODP
Introduction to Optee (26 may 2016)
PDF
SFO15-503: Secure storage in OP-TEE
PPTX
Introduction to armv8 aarch64
PDF
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
PDF
LCU14-103: How to create and run Trusted Applications on OP-TEE
PDF
HKG15-311: OP-TEE for Beginners and Porting Review
PDF
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
PPTX
Arm v8 instruction overview android 64 bit briefing
PDF
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
PDF
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
Q4.11: Using GCC Auto-Vectorizer
Moving NEON to 64 bits
COMPLETE DETAIL ABOUT ARM PART1
中華チップ全盛時代のARM SoCの選び方_公開版
64-bit Android
LAS16-406: Android Widevine on OP-TEE
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
EXAME-PARTE-II
LAS16-504: Secure Storage updates in OP-TEE
Introduction to Optee (26 may 2016)
SFO15-503: Secure storage in OP-TEE
Introduction to armv8 aarch64
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEE
HKG15-311: OP-TEE for Beginners and Porting Review
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
Arm v8 instruction overview android 64 bit briefing
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
Ad

Similar to Q4.11: NEON Intrinsics (20)

PDF
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
PDF
AMP Kynetics - ELC 2018 Portland
PPTX
Tiny ML for spark Fun Edge
PDF
Heterogeneous multiprocessing on androd and i.mx7
PDF
Haskell Symposium 2010: An LLVM backend for GHC
PDF
The Past, Present, and Future of OpenACC
PDF
Challenges in GPU compilers
PDF
Introduction to Parallelization and performance optimization
PPTX
PDF
openmpfinal.pdf
PPT
Mirage: ML kernels in the cloud (ML Workshop 2010)
PDF
SNAP MACHINE LEARNING
PDF
不深不淺,帶你認識 LLVM (Found LLVM in your life)
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
PDF
Some experiences for porting application to Intel Xeon Phi
PPT
PPT
CS4961-L9.ppt
PDF
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
PDF
Large-Scale Optimization Strategies for Typical HPC Workloads
PDF
100Gbps OpenStack For Providing High-Performance NFV
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
AMP Kynetics - ELC 2018 Portland
Tiny ML for spark Fun Edge
Heterogeneous multiprocessing on androd and i.mx7
Haskell Symposium 2010: An LLVM backend for GHC
The Past, Present, and Future of OpenACC
Challenges in GPU compilers
Introduction to Parallelization and performance optimization
openmpfinal.pdf
Mirage: ML kernels in the cloud (ML Workshop 2010)
SNAP MACHINE LEARNING
不深不淺,帶你認識 LLVM (Found LLVM in your life)
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Some experiences for porting application to Intel Xeon Phi
CS4961-L9.ppt
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
Large-Scale Optimization Strategies for Typical HPC Workloads
100Gbps OpenStack For Providing High-Performance NFV

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
Modernizing your data center with Dell and AMD
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf

Q4.11: NEON Intrinsics