SlideShare a Scribd company logo
PLDI 2017 Tutorial Session
Vectorization with LMS:
SIMD Intrinsics
Alen StojanovDepartment of Computer Science,
ETH Zurich, Switzerland
2
SISD
SIMD
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
What is SIMD?
Single Instruction
Multiple Data
3
SISD
SIMD
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
AVX x4
#define T double
void add(T* x, T* y, T* z, int N) {
for(int i = 0; i < N; ++i) {
T x1, y1, z1;
x1 = x[i];
y1 = y[i];
z1 = x1 + y1;
z[i] = z1;
}
}
Scalar
#define T double
void add(T* x, T* y, T* z, int N) {
for(int i = 0; i < N; i += 4) {
__m256d x1, y1, z1;
x1 = _mm256_loadu_pd(x + i);
y1 = _mm256_loadu_pd(y + i);
z1 = _mm256_add_pd(x1, y1);
_mm256_storeu_pd(z + i, z1);
}
}
4
SISD
SIMDAVX x4
#define T double
void add(T* x, T* y, T* z, int N) {
for(int i = 0; i < N; ++i) {
T x1, y1, z1;
x1 = x[i];
y1 = y[i];
z1 = x1 + y1;
z[i] = z1;
}
}
Scalar
#define T double
void add(T* x, T* y, T* z, int N) {
for(int i = 0; i < N; i += 4) {
__m256d x1, y1, z1;
x1 = _mm256_loadu_pd(x + i);
y1 = _mm256_loadu_pd(y + i);
z1 = _mm256_add_pd(x1, y1);
_mm256_storeu_pd(z + i, z1);
}
}
LBB0_3:
movsd (%rdi,%rax,8), %xmm0
addsd (%rsi,%rax,8), %xmm0
movsd %xmm0, (%rdx,%rax,8)
incq %rax
cmpl %eax, %r9d
jne LBB0_3
LBB0_3:
vmovupd (%rdi,%r10,8), %ymm0
vaddpd (%rsi,%r10,8), %ymm0, %ymm0
vmovupd %ymm0, (%rax)
addq $4, %r10
addq $32, %rax
addq $1, %rcx
jne LBB0_3
• MMX
• SSE / SSE2 / SSE3 / SSSE3 / SSE4.1 / SSE4.2
• AVX / AVX2 / AVX-512
• FMA / KNC / SVML
8x float
4x double
32x 8-bits
16x 16-bits
8x 32-bits
4x 64-bits
256-bit
AVX
4x floats
2x doubles
16x 8-bits
8x 16-bits
4x 32-bits
2x 64-bits
SSE
operands
for each
6
That’s not all
Shuffles:
• _mm256_permutevar_pd
• _mm256_shufflehi_epi16
• …
Strings:
• _mm_cmpestrm
• _mm_cmpistrm
• ..
Bitwise operators:
• _mm256_bslli_epi128
• _mm512_rol_epi32
• …
Statistics:
• _mm_avg_epu8
• _mm256_cdfnorm_pd
• …
Logical:
• _mm256_or_pd
• _mm256_andnot_pd
• …
Crypto:
• _mm_aesdec_si128
• _mm_sha1msg1_epu32
• …
Loads:
• _mm_i32gather_epi32
• _mm256_broadcast_ps
• …
Stores:
• _mm512_storenrngo_pd
• _mm_store_pd1.
• …
Casts:
• _mm256_castps_pd
• _mm256_cvtps_epi32
• …
7
There are a lot of SIMD instructions
AVX-512 has 3519 intrinsics
How do you port all intrinsics into LMS?
Ivaylo Toskov
ETH Zurich
Idea #2: Generate them automatically
Idea #1: Get a Master student to do it
9
data-3.3.16.xml
Challenge #1
Scala chokes on big classes ~ 64kB
limit for a method
• Split the implementation
into multiple classes
• Make one trait inherit all
split classes
Challenge #2
LMS has read / write effects
• Produce the effects
automatically using the
category data in the Intel
Intrinsics Guide
<intrinsic tech='AVX' rettype='__m256d' name='_mm256_loadu_pd'>
<type>Floating Point</type>
<CPUID>AVX</CPUID>
<category>Load</category>
<parameter varname='mem_addr' type='double const *’ />
<description>
Load 256-bits (composed of 4 packed
double-precision (64-bit) floating-point elements)
from memory into "dst". "mem_addr" does not need
to be aligned on any particular boundary.
</description>
<operation>
dst[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
</operation>
<instruction name='vmovupd' form='ymm, m256’ />
<header>immintrin.h</header>
</intrinsic>
Challenge #3
Type Mappings – unsigned?
• Use Scala Unsigned for
unsigned operations.
Challenge #4
Pointers?
• Disallow and use memory
offsets instead
Challenge #5
Implement Arrays only?
• Abstract containers for the
need of the DSL
Challenge #6, #7, ...
Try to think of everything?
• Checked.
13
https://github.com/ivtoskov/lms-intrinsics
How do we make use of
the intrinsics ?
15
https://github.com/astojanov/lms-tutorial-pldi

More Related Content

PDF
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
PDF
Bartosz Milewski, “Re-discovering Monads in C++”
PPT
Ceng232 Decoder Multiplexer Adder
PDF
Machine learning with scikit-learn
PDF
EE8351 DLC
PPT
Digital logic circuit
PPT
Lec 2 digital basics
DOC
Logic Gates O level Past Papers questions
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Bartosz Milewski, “Re-discovering Monads in C++”
Ceng232 Decoder Multiplexer Adder
Machine learning with scikit-learn
EE8351 DLC
Digital logic circuit
Lec 2 digital basics
Logic Gates O level Past Papers questions

What's hot (20)

PPTX
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
PDF
Computer graphics lab manual
PPT
Digital Logic Circuits
DOC
Dpsd lecture-notes
PPTX
Bitwise Operations in Programming
DOCX
Computer Graphics Lab File C Programs
PPTX
Decoder
PDF
Computer graphics lab report with code in cpp
PPT
Digital Logic & Design (DLD) presentation
PPT
Unit 4 dica
PPT
Decoder for digital electronics
DOCX
Cg my own programs
PPTX
PDT DC015 Chapter 2 Computer System 2017/2018 (f)
DOCX
Computer graphics programs in c++
PPT
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
PPTX
PST SC015 Chapter 2 Computer System (III) 2017/2018
DOCX
Name dld preparation
PPT
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
PPT
Defense Senior College on Error Coding presentation 4/22/2010
DOC
Computer graphics
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
Computer graphics lab manual
Digital Logic Circuits
Dpsd lecture-notes
Bitwise Operations in Programming
Computer Graphics Lab File C Programs
Decoder
Computer graphics lab report with code in cpp
Digital Logic & Design (DLD) presentation
Unit 4 dica
Decoder for digital electronics
Cg my own programs
PDT DC015 Chapter 2 Computer System 2017/2018 (f)
Computer graphics programs in c++
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...
PST SC015 Chapter 2 Computer System (III) 2017/2018
Name dld preparation
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...
Defense Senior College on Error Coding presentation 4/22/2010
Computer graphics
Ad

Similar to Vectorization with LMS: SIMD Intrinsics (20)

PPTX
lec2 - Modern Processors - SIMD.pptx
PPTX
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
PPTX
Data-Level Parallelism in Vector, SIMD, and GPU Architectures.pptx
PPTX
SIMD.pptx
PPTX
SIMD Processing Using Compiler Intrinsics
PDF
Vectorization on x86: all you need to know
PPTX
Caqa5e ch4
PDF
Simd programming introduction
PPTX
Data-Level Parallelism in Microprocessors
PPT
chapter4.ppt
PPTX
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
PDF
Joel Falcou, Boost.SIMD
PDF
Designing C++ portable SIMD support
PDF
How I learned to stop worrying and love the dark silicon apocalypse.pdf
PDF
X86 SIMD Instructions
PDF
SIMD inside and outside Oracle 12c In Memory
PDF
Enhancing the matrix transpose operation using intel avx instruction set exte...
PPTX
SIMD inside and outside oracle 12c
PDF
Andes RISC-V vector extension demystified-tutorial
PDF
Дмитрий Вовк: Векторизация кода под мобильные платформы
lec2 - Modern Processors - SIMD.pptx
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Data-Level Parallelism in Vector, SIMD, and GPU Architectures.pptx
SIMD.pptx
SIMD Processing Using Compiler Intrinsics
Vectorization on x86: all you need to know
Caqa5e ch4
Simd programming introduction
Data-Level Parallelism in Microprocessors
chapter4.ppt
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Joel Falcou, Boost.SIMD
Designing C++ portable SIMD support
How I learned to stop worrying and love the dark silicon apocalypse.pdf
X86 SIMD Instructions
SIMD inside and outside Oracle 12c In Memory
Enhancing the matrix transpose operation using intel avx instruction set exte...
SIMD inside and outside oracle 12c
Andes RISC-V vector extension demystified-tutorial
Дмитрий Вовк: Векторизация кода под мобильные платформы
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Cloud computing and distributed systems.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
Cloud computing and distributed systems.
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Vectorization with LMS: SIMD Intrinsics

  • 1. PLDI 2017 Tutorial Session Vectorization with LMS: SIMD Intrinsics Alen StojanovDepartment of Computer Science, ETH Zurich, Switzerland
  • 3. 3 SISD SIMD 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 AVX x4 #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; ++i) { T x1, y1, z1; x1 = x[i]; y1 = y[i]; z1 = x1 + y1; z[i] = z1; } } Scalar #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; i += 4) { __m256d x1, y1, z1; x1 = _mm256_loadu_pd(x + i); y1 = _mm256_loadu_pd(y + i); z1 = _mm256_add_pd(x1, y1); _mm256_storeu_pd(z + i, z1); } }
  • 4. 4 SISD SIMDAVX x4 #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; ++i) { T x1, y1, z1; x1 = x[i]; y1 = y[i]; z1 = x1 + y1; z[i] = z1; } } Scalar #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; i += 4) { __m256d x1, y1, z1; x1 = _mm256_loadu_pd(x + i); y1 = _mm256_loadu_pd(y + i); z1 = _mm256_add_pd(x1, y1); _mm256_storeu_pd(z + i, z1); } } LBB0_3: movsd (%rdi,%rax,8), %xmm0 addsd (%rsi,%rax,8), %xmm0 movsd %xmm0, (%rdx,%rax,8) incq %rax cmpl %eax, %r9d jne LBB0_3 LBB0_3: vmovupd (%rdi,%r10,8), %ymm0 vaddpd (%rsi,%r10,8), %ymm0, %ymm0 vmovupd %ymm0, (%rax) addq $4, %r10 addq $32, %rax addq $1, %rcx jne LBB0_3
  • 5. • MMX • SSE / SSE2 / SSE3 / SSSE3 / SSE4.1 / SSE4.2 • AVX / AVX2 / AVX-512 • FMA / KNC / SVML 8x float 4x double 32x 8-bits 16x 16-bits 8x 32-bits 4x 64-bits 256-bit AVX 4x floats 2x doubles 16x 8-bits 8x 16-bits 4x 32-bits 2x 64-bits SSE operands for each
  • 6. 6 That’s not all Shuffles: • _mm256_permutevar_pd • _mm256_shufflehi_epi16 • … Strings: • _mm_cmpestrm • _mm_cmpistrm • .. Bitwise operators: • _mm256_bslli_epi128 • _mm512_rol_epi32 • … Statistics: • _mm_avg_epu8 • _mm256_cdfnorm_pd • … Logical: • _mm256_or_pd • _mm256_andnot_pd • … Crypto: • _mm_aesdec_si128 • _mm_sha1msg1_epu32 • … Loads: • _mm_i32gather_epi32 • _mm256_broadcast_ps • … Stores: • _mm512_storenrngo_pd • _mm_store_pd1. • … Casts: • _mm256_castps_pd • _mm256_cvtps_epi32 • …
  • 7. 7 There are a lot of SIMD instructions AVX-512 has 3519 intrinsics
  • 8. How do you port all intrinsics into LMS? Ivaylo Toskov ETH Zurich Idea #2: Generate them automatically Idea #1: Get a Master student to do it
  • 10. Challenge #1 Scala chokes on big classes ~ 64kB limit for a method • Split the implementation into multiple classes • Make one trait inherit all split classes
  • 11. Challenge #2 LMS has read / write effects • Produce the effects automatically using the category data in the Intel Intrinsics Guide <intrinsic tech='AVX' rettype='__m256d' name='_mm256_loadu_pd'> <type>Floating Point</type> <CPUID>AVX</CPUID> <category>Load</category> <parameter varname='mem_addr' type='double const *’ /> <description> Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into "dst". "mem_addr" does not need to be aligned on any particular boundary. </description> <operation> dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0 </operation> <instruction name='vmovupd' form='ymm, m256’ /> <header>immintrin.h</header> </intrinsic>
  • 12. Challenge #3 Type Mappings – unsigned? • Use Scala Unsigned for unsigned operations. Challenge #4 Pointers? • Disallow and use memory offsets instead Challenge #5 Implement Arrays only? • Abstract containers for the need of the DSL Challenge #6, #7, ... Try to think of everything? • Checked.
  • 14. How do we make use of the intrinsics ?