Submit Search
3 - Intro to SVE.pdf for intro ARM SVE part
0 likes
10 views
J
JunZhao68
ARM SVE
Automotive
Read more
1 of 18
Download now
Download to read offline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
More Related Content
PDF
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
KTN
PDF
EVO-RAIL 2.0 Overview Deck
Erik Bussink
PDF
Connecting Docker for Cloud IaaS (Speech at CSDN-Oct18
DaoliCloud Ltd
PPTX
Hyper-V Networking
Paulo Freitas
PDF
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Gaurav Raina
PPSX
Virtualization & tipping point
Finto Thomas , CISSP, TOGAF, CCSP, ITIL. JNCIS
PDF
VMworld Europe 204: Technical Deep Dive on EVO: RAIL, the new VMware Hyper-Co...
VMworld
DOCX
Geeta_Resume
Geeta Bodati
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
KTN
EVO-RAIL 2.0 Overview Deck
Erik Bussink
Connecting Docker for Cloud IaaS (Speech at CSDN-Oct18
DaoliCloud Ltd
Hyper-V Networking
Paulo Freitas
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Gaurav Raina
Virtualization & tipping point
Finto Thomas , CISSP, TOGAF, CCSP, ITIL. JNCIS
VMworld Europe 204: Technical Deep Dive on EVO: RAIL, the new VMware Hyper-Co...
VMworld
Geeta_Resume
Geeta Bodati
Similar to 3 - Intro to SVE.pdf for intro ARM SVE part
(20)
PPTX
ONAP SDC - Model driven design
Eden Rozin
PPTX
VMware Hyper-Converged: EVO:RAIL Overview
Rolta AdvizeX
PPTX
Reference design for v mware nsx
solarisyougood
PDF
Windows Azure: Scaling SDN in the Public Cloud
Open Networking Summits
PDF
SDN & NFV Introduction - Open Source Data Center Networking
Thomas Graf
PDF
Network Virtualization & Software-defined Networking
Digicomp Academy AG
PPT
Lec1 final
Gichelle Amon
PDF
pravesh_kumar
PRAVESH KUMAR
PDF
Orchestrated virtualized multivendor SD-WAN services
ADVA
PDF
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
Edge AI and Vision Alliance
PPTX
Datacenter 2014: IPnett - Martin Milnert
Mediehuset Ingeniøren Live
PDF
Server And Hardware Virtualization_Aakash1.1
Aakash Agarwal
PDF
PLNOG 8: Ivan Pepelnjak - Cloud Networking - From Theory to Practice
PROIDEA
PDF
OpenStack Scale-out Networking Architecture
Randy Bias
PPTX
VLIW(Very Long Instruction Word)
Pragnya Dash
PPTX
Show and Tell: Building Applications on Cisco Open SDN Controller
Cisco DevNet
PDF
SDN in the Public Cloud: Windows Azure
Open Networking Summits
PDF
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Michelle Holley
PDF
Atf 3 q15-4 - scaling the the software driven cloud network
Mason Mei
PDF
Network Virtualization: Delivering on the Promises of SDN
Open Networking Summits
ONAP SDC - Model driven design
Eden Rozin
VMware Hyper-Converged: EVO:RAIL Overview
Rolta AdvizeX
Reference design for v mware nsx
solarisyougood
Windows Azure: Scaling SDN in the Public Cloud
Open Networking Summits
SDN & NFV Introduction - Open Source Data Center Networking
Thomas Graf
Network Virtualization & Software-defined Networking
Digicomp Academy AG
Lec1 final
Gichelle Amon
pravesh_kumar
PRAVESH KUMAR
Orchestrated virtualized multivendor SD-WAN services
ADVA
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
Edge AI and Vision Alliance
Datacenter 2014: IPnett - Martin Milnert
Mediehuset Ingeniøren Live
Server And Hardware Virtualization_Aakash1.1
Aakash Agarwal
PLNOG 8: Ivan Pepelnjak - Cloud Networking - From Theory to Practice
PROIDEA
OpenStack Scale-out Networking Architecture
Randy Bias
VLIW(Very Long Instruction Word)
Pragnya Dash
Show and Tell: Building Applications on Cisco Open SDN Controller
Cisco DevNet
SDN in the Public Cloud: Windows Azure
Open Networking Summits
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Michelle Holley
Atf 3 q15-4 - scaling the the software driven cloud network
Mason Mei
Network Virtualization: Delivering on the Promises of SDN
Open Networking Summits
Ad
More from JunZhao68
(20)
PDF
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
JunZhao68
PDF
愛小孩的歐拉一 兼論 108 數學課綱.pdf for 欧拉&数论相关课程描述啊
JunZhao68
PDF
svd15_86.pdf for SVD study and revosited
JunZhao68
PDF
Quadra-T1-T2-T4_TechSpec.pdf for netint VPA
JunZhao68
PDF
Python Advanced Course - part III.pdf for Python
JunZhao68
PDF
Python Advanced Course - part I.pdf for Python
JunZhao68
PDF
pytorch-cheatsheet.pdf for ML study with pythroch
JunZhao68
PDF
Vocabulary Cards for AI and KIDs MIT.pdf
JunZhao68
PDF
how CNN works for tech Every parts introductions.pdf
JunZhao68
PDF
eics22-slides for researchers need when implementing novel imteraction tech
JunZhao68
PDF
Netflix-talk for live video streaming tech
JunZhao68
PPTX
Linear system 1_linear in linear algebra.pptx
JunZhao68
PDF
GDC2012 JMV Rotations with jim van verth
JunZhao68
PDF
1-MIV-tutorial-part-1.pdf
JunZhao68
PDF
GOP-Size_report_11_16.pdf
JunZhao68
PDF
02-VariableLengthCodes_pres.pdf
JunZhao68
PDF
MHV-Presentation-Forman (1).pdf
JunZhao68
PDF
CODA_presentation.pdf
JunZhao68
PDF
http3-quic-streaming-2020-200121234036.pdf
JunZhao68
PDF
NTTW4-FFmpeg.pdf
JunZhao68
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
JunZhao68
愛小孩的歐拉一 兼論 108 數學課綱.pdf for 欧拉&数论相关课程描述啊
JunZhao68
svd15_86.pdf for SVD study and revosited
JunZhao68
Quadra-T1-T2-T4_TechSpec.pdf for netint VPA
JunZhao68
Python Advanced Course - part III.pdf for Python
JunZhao68
Python Advanced Course - part I.pdf for Python
JunZhao68
pytorch-cheatsheet.pdf for ML study with pythroch
JunZhao68
Vocabulary Cards for AI and KIDs MIT.pdf
JunZhao68
how CNN works for tech Every parts introductions.pdf
JunZhao68
eics22-slides for researchers need when implementing novel imteraction tech
JunZhao68
Netflix-talk for live video streaming tech
JunZhao68
Linear system 1_linear in linear algebra.pptx
JunZhao68
GDC2012 JMV Rotations with jim van verth
JunZhao68
1-MIV-tutorial-part-1.pdf
JunZhao68
GOP-Size_report_11_16.pdf
JunZhao68
02-VariableLengthCodes_pres.pdf
JunZhao68
MHV-Presentation-Forman (1).pdf
JunZhao68
CODA_presentation.pdf
JunZhao68
http3-quic-streaming-2020-200121234036.pdf
JunZhao68
NTTW4-FFmpeg.pdf
JunZhao68
Ad
Recently uploaded
(20)
PPTX
Paediatric History & Clinical Examination.pptx
9hpcs7ptf7
PDF
Honda Dealership SNS Evaluation pdf/ppts
savleenk88
PDF
Renesas R-Car_Cockpit_overview210214-Gen4.pdf
wenhu10
PPTX
laws of thermodynamics with diagrams details
parvindersinghsandhu1
PPTX
Small Fleets, Big Change: Market Acceleration by Niki Okuk
Forth
PPTX
Lecture 3b C Library xnxjxjxjxkx_ ESP32.pptx
indalrenu
PPTX
Culture by Design.pptxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ddwc381
PPTX
Intro to ISO 9001 2015.pptx for awareness
nadianisar5
PPTX
Robot_ppt_YRG[1] [Read-Only]bestppt.pptx
Shivam463160
PDF
Volvo EC290C NL EC290CNL engine Manual.pdf
Service Repair Manual
PDF
EC300D LR EC300DLR - Volvo Service Repair Manual.pdf
Service Repair Manual
PDF
Volvo EC290C NL EC290CNL excavator weight.pdf
Service Repair Manual
PDF
How Much does a Volvo EC290C NL EC290CNL Weight.pdf
Service Repair Manual
PDF
Marketing project 2024 for marketing students
hatimfahad17
PPTX
Transmission system. Describe construction & working of varius automobile sys...
bobik247
PPTX
1. introduction-to-bvcjdhjdfffffffffffffffffffffffffffffffffffmicroprocessors...
eeshakhanzadi43
PPTX
laws of thermodynamics with complete explanation
parvindersinghsandhu1
PDF
Volvo EC20C Excavator Step-by-step Maintenance Instructions pdf
Service Repair Manual
PDF
Delivers.ai: 2020–2026 Autonomous Journey
Autonomous Robots
PPTX
building_blocks.pptxdcsDVabdbzfbtydtyyjtj67
Muthupriyadharshini1
Paediatric History & Clinical Examination.pptx
9hpcs7ptf7
Honda Dealership SNS Evaluation pdf/ppts
savleenk88
Renesas R-Car_Cockpit_overview210214-Gen4.pdf
wenhu10
laws of thermodynamics with diagrams details
parvindersinghsandhu1
Small Fleets, Big Change: Market Acceleration by Niki Okuk
Forth
Lecture 3b C Library xnxjxjxjxkx_ ESP32.pptx
indalrenu
Culture by Design.pptxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ddwc381
Intro to ISO 9001 2015.pptx for awareness
nadianisar5
Robot_ppt_YRG[1] [Read-Only]bestppt.pptx
Shivam463160
Volvo EC290C NL EC290CNL engine Manual.pdf
Service Repair Manual
EC300D LR EC300DLR - Volvo Service Repair Manual.pdf
Service Repair Manual
Volvo EC290C NL EC290CNL excavator weight.pdf
Service Repair Manual
How Much does a Volvo EC290C NL EC290CNL Weight.pdf
Service Repair Manual
Marketing project 2024 for marketing students
hatimfahad17
Transmission system. Describe construction & working of varius automobile sys...
bobik247
1. introduction-to-bvcjdhjdfffffffffffffffffffffffffffffffffffmicroprocessors...
eeshakhanzadi43
laws of thermodynamics with complete explanation
parvindersinghsandhu1
Volvo EC20C Excavator Step-by-step Maintenance Instructions pdf
Service Repair Manual
Delivers.ai: 2020–2026 Autonomous Journey
Autonomous Robots
building_blocks.pptxdcsDVabdbzfbtydtyyjtj67
Muthupriyadharshini1
3 - Intro to SVE.pdf for intro ARM SVE part
1.
Arm SVE Fundamentals
2.
2 © 2019
Arm Limited Arm’s Scalable Vector Extension (SVE) An ISA feature which Si partners can implement at length – 128 to 2048 bits How SVE works SVE improves auto-vectorization 1 + 2 + 3 + 4 1 + 2 + 3 + 4 3 7 = = = = 1 2 3 4 5 5 5 5 1 0 1 0 6 2 8 4 + = pred 1 2 0 0 1 1 0 0 + pred 1 2 WHILELT n n-2 1 0 1 0 n-1 n n+1 INDEX i for (i = 0; i < n; ++i) Gather-load and scatter-store Per-lane predication Predicate-driven loop control and management Vector partitioning and software-managed speculation Extended floating-point horizontal reductions The hardware sets the vector length … 0 512 In software, vectors have no length The exact same binary code runs on hardware with different vector lengths = A C B + 512b 512b 512b + = 512b vector unit 256b 256b 256b + = 256b vector unit
3.
3 © 2019
Arm Limited Vector Length Agnostic programming model VLA Write once Compile once Vectorize more loops
4.
4 © 2019
Arm Limited SVE vs Traditional ISA How do we compute data which has ten chunks of 4-bytes? SVE (128-bit VLA vector engine) ❑ Three iterations over a 16-byte VLA register with an adjustable predicate Aarch64 (scalar) ❑ Ten iterations over a 4-byte register NEON (128-bit vector engine) ❑ Two iterations over a 16-byte register + two iterations of a drain loop over a 4-byte register
5.
5 © 2019
Arm Limited How big can an SVE vector be? Any multiple of 128 bits up to 2048 bits, and it can be dynamically reduced. (A) VL = LEN x 128 (B) VL <= 2048 VL is implementation dependent, can be reduced by the OS/Hypervisor. ?
6.
6 © 2019
Arm Limited How can you program when the vector length is unknown? SVE provides features to enable VLA programming from the assembly level and up 1 2 3 4 5 5 5 5 1 0 1 0 6 2 8 4 + = pred Per-lane predication Operations work on individual lanes under control of a predicate register. n-2 1 0 1 0 WHILELT n n-1 n n+1 INDEX i for (i = 0; i < n; ++i) Predicate-driven loop control and management Eliminate scalar loop heads and tails by processing partial vectors. Vector partitioning & software-managed speculation First Faulting Load instructions allow memory accesses to cross into invalid pages. 1 2 0 0 1 1 0 0 + pred 1 2
7.
7 © 2019
Arm Limited SVE Registers • Scalable vector registers • Z0-Z31 extending NEON’s 128-bit V0-V31. • Packed DP, SP & HP floating-point elements. • Packed 64, 32, 16 & 8-bit integer elements. • Scalable predicate registers • P0-P7 governing predicates for load/store/arithmetic. • P8-P15 additional predicates for loop management. • FFR first fault register for software speculation.
8.
8 © 2019
Arm Limited SVE vector & predicate register organization
9.
9 © 2019
Arm Limited VLA Programming Approaches Don’t panic! • Compilers: • Auto-vectorization: GCC, Arm Compiler for HPC, Cray, Fujitsu • Compiler directives, e.g. OpenMP – #pragma omp parallel for simd – #pragma vector always • Libraries: • Arm Performance Library (ArmPL) • Cray LibSci • Fujitsu SSL II • Intrinsics (ACLE): • Arm C Language Extensions for SVE • Arm Scalable Vector Extensions and Application to Machine Learning • Assembly: • Full ISA Specification: The Scalable Vector Extension for Armv8-A
10.
10 © 2019
Arm Limited SVE supports vectorization in complex code Right from the start, SVE was engineered to handle codes that usually won’t vectorize 1 + 2 + 3 + 4 1 + 2 + 3 + 4 3 7 = = = = Extended floating-point horizontal reductions In-order and tree-based reductions trade-off performance and repeatability. Gather-load and scatter-store Loads a single register from several non-contiguous memory locations.
11.
11 © 2019
Arm Limited Portability Is it really possible to run a vectorized application anywhere? Write once: can my code compile for machines with different VL? • Code that is auto-vectorized by the compiler • Hand-written assembly • Hand-written C intrinsics Compile once: Can I take my executable and run it on machines with different VL? • Self contained programs with no external dependencies • But what about programs that depend on external libraries? ... (spoiler: ) ?
12.
12 © 2019
Arm Limited Auto-vectorize external calls: libm example. float sinf(float); NEON • Neon has 128-bit and 64-bit register split. • The library has to provide at least 2 symbols, because it doesn’t know where the auto-vec code comes from: • _ZGVnN2v_sinf • _ZGVnN4v_sinf SVE • Does libm need to provide a symbol for each VL? • _ZGVsM4v_sinf • _ZGVsM6v_sinf • _ZGVsM8v_sinf • _ZGVsM10v_sinf • … • One symbol! _ZGVsMxv_sinf
13.
13 © 2019
Arm Limited Open source support • Arm actively posting SVE open source patches upstream • Beginning with first public announcement of SVE at HotChips 2016 • Available upstream • GNU Binutils-2.28: released Feb 2017, includes SVE assembler & disassembler • GCC 8: Full assembly, disassembly and basic auto-vectorization • LLVM 7: Full assembly, disassembly • QEMU 3: User space SVE emulation • GDB 8.2 HPC use cases fully included • Under upstream review • LLVM: Since Nov 2016, as presented at LLVM conference • Linux kernel: Since Mar 2017, LWN article on SVE support
14.
14 © 2021
Arm SVE: More Powerful Vectorization on V1 SVE vectorizes more codes and makes better use of the vector units 0 0,2 0,4 0,6 0,8 1 haccmk memcpy long_strlen short_strlen pixel_avg milc_opt Cycles / Cycles SVE / NEON Simulation Projections Lower is better
15.
15 © 2019
Arm Limited Quick Recap • SVE enables Vector Length Agnostic (VLA) programming • VLA enables portability, scalability, and optimization • Predicates control which operations affect which vector lanes • Predicates are not bitmasks • You can think of them as dynamically resizing the vector registers • The actual vector length is set by the CPU architect • Any multiple of 128 bits up to 2048 bits • May be dynamically reduced by the OS or hypervisor • SVE was designed for HPC and can vectorize complex structures • Many open source and commercial tools currently support SVE n-2 1 0 1 0 WHILELT n n-1 n n+1 INDEX i for (i = 0; i < n; ++i) 1 2 0 0 1 1 0 0 + pred 1 2
16.
Hands On: HACC
17.
17 © 2019
Arm Limited 05_Apps/01_HACC See README.md for details • Computationally intensive part of an N-body cosmology code. • Application performance is dominated by a long chain of floating point instructions • Performance scales well with vector length • FOM: Wall clock time spent in the application loop reported in seconds NEON128 on A64FX SVE512 on A64FX
18.
Thank You Danke Merci 谢谢 ありがとう Gracias Kiitos 감사합니다 धन्यवाद اًشكر תודה © 2019
Arm Limited
Download