SlideShare a Scribd company logo
How I Sped up Complex Matrix-Vector Multiplication:
Finding Intel MKL’s “Secret Sauce”
Brandon Liu
Yale Efficient Computing Lab, 8/14/20
Presentation outline
1. High-level research overview + results
2. Interesting tools + resources that might be helpful to others
3. Reflections + next steps
Research goal:
Write a kernel for integer complex matrix-vector multiplication that
runs faster than those provided by Intel’s Math Kernel library.
SLOW
SLOW
SLOW
1.54x faster than MKL for (16x64) matrix * (64x1) vector
Kernel Avg. time
MKL (float) 0.073 µs
Mine (int16_t) 0.048 µs
Kernel Avg. time
MKL (float) 0.057 µs
Mine (int16_t) 0.038 µs
1.50x faster than MKL for (64x16) matrix * (16x1) vector
Fast!
How do we beat Intel MKL?
Ultimately, the key was to reverse engineer and study MKL’s fastest
proprietary implementation, then adapt it for a smaller integer data
type.
Using a smaller data type accomplishes two things:
1) increases SIMD parallelism during computation, and
2) decreases memory accesses
vmulps zmm3 zmm0 zmm1
1.6 5.2 6.0 … … … … … … … … … … … … …
x x x x x x x x x x x x x x x x
zmm0
zmm1 7.2 1.0 8.9 … … … … … … … … … … … … …
32 bit float
= = = = = = = = = = = = = = = =
11.52 5.2 53.4 … … … … … … … … … … … … …zmm3
512 bit zmm registers (fits 16 floats)
(Vector multiply, precision single)
vpmullwzmm3 zmm0 zmm1
x
zmm0
zmm1
word = 16 bits
=
zmm3
Same 512 bit zmm registers (fits 32 int16)
(Vector multiply, word (store low 16 bits))
Pros and cons of using int16_t over float
• int16 (fixed point, 16 bits)
• Saves 2x memory accesses and space
• 2x more computations per instruction (increased SIMD parallelism)
• Use less energy? (less transistors, shorter wires, less capacitance)
• Limited range of representable #s (enough for baseband processing)
• float (floating point, 32 bits)
• Better existing hardware support (FMA ports)
• Better existing library support (MKL and other math libraries)
• Greater range of representable #s
Multiple functions to perform complex matrix-vector
multiply with Intel MKL — which is fastest?
• Armadillo (multiply operator)
• A C++ library that wraps MKL functions in easy to use syntax
• Calls cgemv() under the hood
• cgemv()
• complex general matrix-vector multiply
• cgemm()
• complex general matrix-matrix multiply
• Works because a matrix with 1 column is the same as a vector
• jit_cgemm()
• Just-in-Time compiled complex general matrix-matrix multiply
• JIT gemm kernels introduced in 2018
• By far the fastest of the 4
MKL Just-in-Time generated kernels are by far the fastest
Column major implementation is far faster than row major
• MKL lets you generate either row or column major kernels
• Column major is faster because:
• No horizontal reductions (summations of a vector register)
• Vector elements each loaded only once as opposed to M times
So, Intel MKL’s JIT cgemm kernel is the fastest— how
does it work and how can I beat it?
5 useful tools/resources 
1) Zydis: Runtime disassembler
● objdump–dbinaryname>output.asm
○ Only works for statically compiled code
● Options for examining disassembly at runtime: GDB, Zydis
○ GDB helps step through instructions (stepi) and view contents of registers
○ Zydis lets you output particular sections of assembly programmatically
○ Code snippets for using Zydis are in my repository
1) Zydis: Examples of assembly output
…
jit_cgemm kernel for (2x2) * (2x1) jit_cgemm kernel for (64x16) * (16x1)
MKL JIT generated kernels are optimized to the problem — 11 vs 475 lines!
2) Intel’s Manuals and WikiChip
● https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
● The manufacturer instruction set and optimization references were key to
understanding what instructions exactly did and any side effects.
● WikiChip helped with general understanding the ports of my
microarchitecture and theoretical throughput.
3) Xbyak: JIT assembler used by Intel MKL
• Allows run-time compilation of x86 (IA32), x64 (AMD64, x86-64) instructions
• Ultimately what I used to write my int16 cgemv kernel generator
• Generates the machine instructions straight from C++ bindings – no reordering
• Open source (https://github.com/herumi/xbyak)
Snippet of my kernel code generator using Xbyak
Generating and running a kernel for (M x K) * (K x 1) cgemv
4) Intel VTune Profiler
● Detailed data collection about software performance and bottlenecks
● hotspots and uarch-exploration options were most useful
● The data pointed me in the right direction to finding my bottleneck, but Intel
Vtune’s suggestions were too generic to actually help fix anything.
5) Agner Fog’s test programs for latency/throughput
● https://www.agner.org/optimize/
● Open source test scripts that empirically measure instruction latency and
throughput on your machine, among other things.
● Useful because Intel did not provide theoretical numbers for the newest
instructions I used on my architecture.
● Helped me figure out my last roadblock that a particular instruction’s
latency/throughput was not the issue (vpdpwssds).
Resolving the key data dependency issue
Old, bad version New, better version
zmm29 and zmm28’s contents are updated each iteration of
the loop (they are accumulators). This is a data dependency.
Because vpdpwssds takes a relatively long amount of time, in
this version the next iteration must “wait” for the previous
iteration to complete before it can run. (bad!)
In this better version, we unroll the loop in steps of 4, and introduce
3 more pairs of registers in the red boxes, so each are partial
accumulators. At the end we call vpaddd to sum them up so that we
match zmm29 and zmm28 like before, but its worth the extra
instructions as the data dependencies are now spread further apart
so they don’t have to wait each loop anymore.
Reflections +
Next steps
Summary of results and takeways
● My int16_t cgemv kernel generator:
○ Supports matrices with column major, interleaved complex data layout
○ Dimensions: M rows by K columns, M <= 208, M is a multiply of 16, any K
● Key takeaways:
○ How you lay out data has a significant impact on how efficiently you can perform
computations on them
○ Memory access and instruction ordering/data dependencies have a huge impact
on performance in compute kernels
○ Compilers do not necessarily use the best/latest machine instructions and
optimize SIMD code perfectly
○ In my case, I had to essentially hand-compile my source into assembly
Next steps
● Next steps/ideas:
○ Calculate error and precision compared to float
○ Support any size matrix
○ Extend to real matrix-vector multiplication?
○ Formally prove correctness? (I just compare to MKL’s correct output)
● Bigger idea:
○ Automate certain optimizations of assembly code? (Beat the compiler?)
■ Some existing research on randomized assembly instruction ordering to
generate faster code (http://stoke.stanford.edu/)
Rough timeline of events May 2020 – Aug 2020
● Studied caches, locality, optimizing matrix-matrix multiply, Intel Intrinsics
● Implemented cgemv with Intrinsics (many times)
● Implemented cgemv with Agner Fog’s Vector Class Library
● Studied Halide and the idea of algorithm vs. schedule
● Studied compiler optimizations, barriers, inline assembly
● Studied existing research on complex number data layouts and tested them
● Looked into MKL Compact BLAS routines
● Learned to use Intel VTune and benchmarked different MKL cgemv methods
● Contacted CMU researchers about alternative data layouts
● Learned to use Zydis and GDB for runtime disassembling
● Pored over MKL’s jitted assembly + Intel instruction references  breakthrough!
● Learned to use Xbyak JIT code generator and write x86_64 assembly
● Wrote JIT kernel generator for int16 cgemv
● Discovered VNNI instructions and updated algorithm to incorporate fused multiply add
● Used Agner Fog’s scripts to identify/fix a data dependency issue for small matrix sizes
Personal reflections on learning
● Most directly useful knowledge toward actually beating MKL came at the end
○ How to best optimize time/energy in the right areas?
■ Big picture  small picture
■ How to know what I don’t know but need to know?
○ The experience is enlightening both in that I learned about many topics deeply
but also in that I learned about the process of research in general
● Thank you to Jian and Lin for all their help and support!
Complex multiplication review
𝑎 + 𝑏𝑖 ∗ 𝑐 + 𝑑𝑖 = (𝑎𝑐 − 𝑏𝑑) + (𝑏𝑐 + 𝑎𝑑)𝑖
Real
component
Imaginary
component
● Complex multiplication is like binomial multiplication (first, outer, inner, last)
● Makes it a little tricky to implement with SIMD
Intel MKL JIT cgemm kernel walkthrough
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
K = 4
M=8
SIMD widthV
= 2
Vector
Matrix
x
a
b
a
b (bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
Result
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
=
a
b
a
b
a
b
a
b
Intel MKL JIT cgemm kernel walkthrough
Result
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
a
b
a
b
a
b
a
b
K = 4
M=8
Vector
Matrix
x
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
c1
c1
c1
c1
d1
d1
d1
d1
x
x
= ac1…k
bc1…k
ac1…k
bc1…k
ad1…k
bd1…k
ad1…k
bd1…k
Accumulate, for
each column
=
c2
c2
c2
c2
d2
d2
d2
d2
c3
c3
c3
c3
d3
d3
d3
d3
c4
c4
c4
c4
d4
d4
d4
d4
permute
(swap pairs)
bd1…k
ad1…k
bd1…k
ad1…k
a
b
a
b
Fused negate
multiply add
x
1
-1
1
-1
=
bd1…k
ad1…k
bd1…k
ad1…k
–
–
–
+
+ =
☺
Fused multiply
add
+
+
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
SIMD widthV
= 2
Note: subscript 1…k used to signify summation of values with subscript in range 1 to k
a
b
a
b
a
b
a
b
My JIT cgemv kernel walkthrough
Result
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
(ac-bd)1…k
(ad+bc)1…k
(ac-bd)1…k
(ad+bc)1…k
a
b
a
b
a
b
a
b
K = 4
M=8
Vector
Matrix
x
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
c1
c1
c1
c1
d1
d1
d1
d1
x
x
=ac-bd1…k
ac-bd1…k
ad+bc1…k
ad+bc1…k
Accumulate, for
each column
=
c2
c2
c2
c2
d2
d2
d2
d2
c3
c3
c3
c3
d3
d3
d3
d3
c4
-d4
c4
-d4
d4
c4
d4
c4
(swap pairs)
a
b
a
b
=
☺
Fused multiply
add
+
+
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
SIMD widthV
= 2
Note: subscript 1…k used to signify summation of values with subscript in range 1 to k
a
b
a
b
a
b
a
b
SIMD widthV
= 2
Intel MKL JIT cgemm kernel walkthrough
Result
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
K = 4
M=8
Vector
Matrix
x
Repeat sequence M/V times to yield all results
Notice that blue vectors c and d are reused
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
=
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
=
=
=
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
Design considerations
Design considerations
1. What data type to use? (float vs. int16)
2. What algorithm/schedule to use? (repeated dot product vs. multiply-add)
3. What data order/layout to use? (row/column major, interleaved/split)
4. How to write the code? (Intel Intrinsics vs. writing x86_64 assembly)
Overarching goals: speed up computation, reduce memory overhead.
1)What data type to use? (float vs. int16)
• int16 (fixed point, 16 bits)
• Limited range of representable #s (enough for baseband processing)
• Saves 2x memory bandwidth and space
• Increases SIMD parallelism (2x more computations per instruction)
• Use less energy? (less transistors, shorter wires, less capacitance)
• float (floating point, 32 bits)
• Greater range of representable #s
• Better existing hardware support (FMA ports)
• Better existing library support (MKL and other math libraries)
Repeated dot products pseudocode:
Matvec_rowmaj(mat_a, vec_x, res_y):
For each row in mat_a, index i:
dotProd = dot(row, vec_x)
res_y[i] = dotProd
Repeated multiply-add pseudocode:
Matvec_colmaj(mat_a, vec_x, res_y):
For each col in mat_a, index j:
For each value in col, index i:
res_y[i] += value * vec_x[j]
𝑎11 𝑎12
𝑎21 𝑎22
𝑎31 𝑎32
x
𝑥1
𝑥2
=
𝑦1
𝑦2
𝑦3
𝑎11 𝑎12
𝑎21 𝑎22
𝑎31 𝑎32
x
𝑥1
𝑥2
=
𝑦1
𝑦2
𝑦3
Accumulate
Both versions produce the same result (algorithm), but use different orders of data
access (schedule)
2) What algorithm/schedule to use?
• Horizontal summation in the inner loop (slow)
• Vector elements loaded M times (M=# rows)
• No horizontal reductions
• Vector elements each loaded only once
3)What data order/layout to use?
• Row major or Column major
• Column major for locality of access w/ repeated multiply-add method
• Interleaved complex or Split complex
• Interleaved is standard / typical for complex numbers
r1 i1 r2 i2 … … rk ik
r1
i1
r2
i2
…
…
rk
ik
complex* mat
complex* mat_real
complex* mat_imag
Interleaved complex # layout
Split complex # layout
4) How to write the code?
• Using Intel Intrinsics
• Vector instructions wrapped in C++ style functions
• “Higher level” programming, less fine tuned control
• Register usage and instruction ordering determined by compiler
• Writing x86_64 assembly by hand
• Lowest level of programming, most fine-tuned control of instructions
• Register usage and instruction ordering manually determined
• Prone to error (Compilers are pretty smart/safe, while programmers
can introduce bugs)
Both methods are non-portable  only run on CPUs that support the instructions

More Related Content

PPTX
Microchip: CXL Use Cases and Enabling Ecosystem
PDF
AMD: Where Gaming Begins
 
PPT
rdma-intro-module.ppt
PDF
An Introduction to Free and Open Source Software Licensing and Business Models
PPTX
Debug dpdk process bottleneck & painpoints
PPTX
Graphics Processing Unit by Saurabh
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
GPU Programming
Microchip: CXL Use Cases and Enabling Ecosystem
AMD: Where Gaming Begins
 
rdma-intro-module.ppt
An Introduction to Free and Open Source Software Licensing and Business Models
Debug dpdk process bottleneck & painpoints
Graphics Processing Unit by Saurabh
Network Programming: Data Plane Development Kit (DPDK)
GPU Programming

What's hot (20)

PPTX
Introduction to Embedded Linux
PDF
GPGPU Computation
PPTX
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
PPTX
Introduction to armv8 aarch64
PDF
Graphic card information search pp
PPTX
Introduction Linux Device Drivers
PPTX
Broadcom PCIe & CXL Switches OCP Final.pptx
PPTX
The State of CXL-related Activities within OCP
PDF
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
PPTX
Enfabrica - Bridging the Network and Memory Worlds
PDF
Linux on ARM 64-bit Architecture
PPTX
U-Boot Porting on New Hardware
PDF
Andes enhancing verification coverage for risc v vector extension using riscv-dv
PDF
NVMe overview
PPT
Pcie drivers basics
PPTX
Understanding DPDK algorithmics
PPTX
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
PDF
Andes RISC-V vector extension demystified-tutorial
PDF
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
PDF
ARM Processor Tutorial
Introduction to Embedded Linux
GPGPU Computation
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
Introduction to armv8 aarch64
Graphic card information search pp
Introduction Linux Device Drivers
Broadcom PCIe & CXL Switches OCP Final.pptx
The State of CXL-related Activities within OCP
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Enfabrica - Bridging the Network and Memory Worlds
Linux on ARM 64-bit Architecture
U-Boot Porting on New Hardware
Andes enhancing verification coverage for risc v vector extension using riscv-dv
NVMe overview
Pcie drivers basics
Understanding DPDK algorithmics
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Andes RISC-V vector extension demystified-tutorial
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
ARM Processor Tutorial
Ad

Similar to How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S (20)

PDF
Adaptive Linear Solvers and Eigensolvers
PDF
Mkl mic lab_0
PDF
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
PPTX
Sandia Fast Matmul
PPTX
Code and memory optimization tricks
PPTX
Code and Memory Optimisation Tricks
PPTX
lec2 - Modern Processors - SIMD.pptx
PPT
chapter4.ppt
PPT
Happy To Use SIMD
PDF
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
PPTX
Something about SSE and beyond
PPT
Single instruction multiple data
PDF
Tiling matrix-matrix multiply, code tuning
PDF
Designing C++ portable SIMD support
PPTX
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
PPTX
Data-Level Parallelism in Vector, SIMD, and GPU Architectures.pptx
PPTX
A framework for practical fast matrix multiplication (BLIS retreat)
PDF
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
PPTX
Caqa5e ch4
PDF
Vectorization in ATLAS
Adaptive Linear Solvers and Eigensolvers
Mkl mic lab_0
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Sandia Fast Matmul
Code and memory optimization tricks
Code and Memory Optimisation Tricks
lec2 - Modern Processors - SIMD.pptx
chapter4.ppt
Happy To Use SIMD
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Something about SSE and beyond
Single instruction multiple data
Tiling matrix-matrix multiply, code tuning
Designing C++ portable SIMD support
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Data-Level Parallelism in Vector, SIMD, and GPU Architectures.pptx
A framework for practical fast matrix multiplication (BLIS retreat)
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Caqa5e ch4
Vectorization in ATLAS
Ad

Recently uploaded (20)

PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
composite construction of structures.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
web development for engineering and engineering
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
composite construction of structures.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CYBER-CRIMES AND SECURITY A guide to understanding
Sustainable Sites - Green Building Construction
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
web development for engineering and engineering
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
OOP with Java - Java Introduction (Basics)
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Lesson 3_Tessellation.pptx finite Mathematics
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S

  • 1. How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL’s “Secret Sauce” Brandon Liu Yale Efficient Computing Lab, 8/14/20
  • 2. Presentation outline 1. High-level research overview + results 2. Interesting tools + resources that might be helpful to others 3. Reflections + next steps
  • 3. Research goal: Write a kernel for integer complex matrix-vector multiplication that runs faster than those provided by Intel’s Math Kernel library.
  • 7. 1.54x faster than MKL for (16x64) matrix * (64x1) vector Kernel Avg. time MKL (float) 0.073 µs Mine (int16_t) 0.048 µs Kernel Avg. time MKL (float) 0.057 µs Mine (int16_t) 0.038 µs 1.50x faster than MKL for (64x16) matrix * (16x1) vector Fast!
  • 8. How do we beat Intel MKL? Ultimately, the key was to reverse engineer and study MKL’s fastest proprietary implementation, then adapt it for a smaller integer data type. Using a smaller data type accomplishes two things: 1) increases SIMD parallelism during computation, and 2) decreases memory accesses
  • 9. vmulps zmm3 zmm0 zmm1 1.6 5.2 6.0 … … … … … … … … … … … … … x x x x x x x x x x x x x x x x zmm0 zmm1 7.2 1.0 8.9 … … … … … … … … … … … … … 32 bit float = = = = = = = = = = = = = = = = 11.52 5.2 53.4 … … … … … … … … … … … … …zmm3 512 bit zmm registers (fits 16 floats) (Vector multiply, precision single)
  • 10. vpmullwzmm3 zmm0 zmm1 x zmm0 zmm1 word = 16 bits = zmm3 Same 512 bit zmm registers (fits 32 int16) (Vector multiply, word (store low 16 bits))
  • 11. Pros and cons of using int16_t over float • int16 (fixed point, 16 bits) • Saves 2x memory accesses and space • 2x more computations per instruction (increased SIMD parallelism) • Use less energy? (less transistors, shorter wires, less capacitance) • Limited range of representable #s (enough for baseband processing) • float (floating point, 32 bits) • Better existing hardware support (FMA ports) • Better existing library support (MKL and other math libraries) • Greater range of representable #s
  • 12. Multiple functions to perform complex matrix-vector multiply with Intel MKL — which is fastest? • Armadillo (multiply operator) • A C++ library that wraps MKL functions in easy to use syntax • Calls cgemv() under the hood • cgemv() • complex general matrix-vector multiply • cgemm() • complex general matrix-matrix multiply • Works because a matrix with 1 column is the same as a vector • jit_cgemm() • Just-in-Time compiled complex general matrix-matrix multiply • JIT gemm kernels introduced in 2018 • By far the fastest of the 4
  • 13. MKL Just-in-Time generated kernels are by far the fastest
  • 14. Column major implementation is far faster than row major • MKL lets you generate either row or column major kernels • Column major is faster because: • No horizontal reductions (summations of a vector register) • Vector elements each loaded only once as opposed to M times
  • 15. So, Intel MKL’s JIT cgemm kernel is the fastest— how does it work and how can I beat it? 5 useful tools/resources 
  • 16. 1) Zydis: Runtime disassembler ● objdump–dbinaryname>output.asm ○ Only works for statically compiled code ● Options for examining disassembly at runtime: GDB, Zydis ○ GDB helps step through instructions (stepi) and view contents of registers ○ Zydis lets you output particular sections of assembly programmatically ○ Code snippets for using Zydis are in my repository
  • 17. 1) Zydis: Examples of assembly output … jit_cgemm kernel for (2x2) * (2x1) jit_cgemm kernel for (64x16) * (16x1) MKL JIT generated kernels are optimized to the problem — 11 vs 475 lines!
  • 18. 2) Intel’s Manuals and WikiChip ● https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html ● The manufacturer instruction set and optimization references were key to understanding what instructions exactly did and any side effects. ● WikiChip helped with general understanding the ports of my microarchitecture and theoretical throughput.
  • 19. 3) Xbyak: JIT assembler used by Intel MKL • Allows run-time compilation of x86 (IA32), x64 (AMD64, x86-64) instructions • Ultimately what I used to write my int16 cgemv kernel generator • Generates the machine instructions straight from C++ bindings – no reordering • Open source (https://github.com/herumi/xbyak) Snippet of my kernel code generator using Xbyak Generating and running a kernel for (M x K) * (K x 1) cgemv
  • 20. 4) Intel VTune Profiler ● Detailed data collection about software performance and bottlenecks ● hotspots and uarch-exploration options were most useful ● The data pointed me in the right direction to finding my bottleneck, but Intel Vtune’s suggestions were too generic to actually help fix anything.
  • 21. 5) Agner Fog’s test programs for latency/throughput ● https://www.agner.org/optimize/ ● Open source test scripts that empirically measure instruction latency and throughput on your machine, among other things. ● Useful because Intel did not provide theoretical numbers for the newest instructions I used on my architecture. ● Helped me figure out my last roadblock that a particular instruction’s latency/throughput was not the issue (vpdpwssds).
  • 22. Resolving the key data dependency issue Old, bad version New, better version zmm29 and zmm28’s contents are updated each iteration of the loop (they are accumulators). This is a data dependency. Because vpdpwssds takes a relatively long amount of time, in this version the next iteration must “wait” for the previous iteration to complete before it can run. (bad!) In this better version, we unroll the loop in steps of 4, and introduce 3 more pairs of registers in the red boxes, so each are partial accumulators. At the end we call vpaddd to sum them up so that we match zmm29 and zmm28 like before, but its worth the extra instructions as the data dependencies are now spread further apart so they don’t have to wait each loop anymore.
  • 24. Summary of results and takeways ● My int16_t cgemv kernel generator: ○ Supports matrices with column major, interleaved complex data layout ○ Dimensions: M rows by K columns, M <= 208, M is a multiply of 16, any K ● Key takeaways: ○ How you lay out data has a significant impact on how efficiently you can perform computations on them ○ Memory access and instruction ordering/data dependencies have a huge impact on performance in compute kernels ○ Compilers do not necessarily use the best/latest machine instructions and optimize SIMD code perfectly ○ In my case, I had to essentially hand-compile my source into assembly
  • 25. Next steps ● Next steps/ideas: ○ Calculate error and precision compared to float ○ Support any size matrix ○ Extend to real matrix-vector multiplication? ○ Formally prove correctness? (I just compare to MKL’s correct output) ● Bigger idea: ○ Automate certain optimizations of assembly code? (Beat the compiler?) ■ Some existing research on randomized assembly instruction ordering to generate faster code (http://stoke.stanford.edu/)
  • 26. Rough timeline of events May 2020 – Aug 2020 ● Studied caches, locality, optimizing matrix-matrix multiply, Intel Intrinsics ● Implemented cgemv with Intrinsics (many times) ● Implemented cgemv with Agner Fog’s Vector Class Library ● Studied Halide and the idea of algorithm vs. schedule ● Studied compiler optimizations, barriers, inline assembly ● Studied existing research on complex number data layouts and tested them ● Looked into MKL Compact BLAS routines ● Learned to use Intel VTune and benchmarked different MKL cgemv methods ● Contacted CMU researchers about alternative data layouts ● Learned to use Zydis and GDB for runtime disassembling ● Pored over MKL’s jitted assembly + Intel instruction references  breakthrough! ● Learned to use Xbyak JIT code generator and write x86_64 assembly ● Wrote JIT kernel generator for int16 cgemv ● Discovered VNNI instructions and updated algorithm to incorporate fused multiply add ● Used Agner Fog’s scripts to identify/fix a data dependency issue for small matrix sizes
  • 27. Personal reflections on learning ● Most directly useful knowledge toward actually beating MKL came at the end ○ How to best optimize time/energy in the right areas? ■ Big picture  small picture ■ How to know what I don’t know but need to know? ○ The experience is enlightening both in that I learned about many topics deeply but also in that I learned about the process of research in general ● Thank you to Jian and Lin for all their help and support!
  • 28. Complex multiplication review 𝑎 + 𝑏𝑖 ∗ 𝑐 + 𝑑𝑖 = (𝑎𝑐 − 𝑏𝑑) + (𝑏𝑐 + 𝑎𝑑)𝑖 Real component Imaginary component ● Complex multiplication is like binomial multiplication (first, outer, inner, last) ● Makes it a little tricky to implement with SIMD
  • 29. Intel MKL JIT cgemm kernel walkthrough a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b K = 4 M=8 SIMD widthV = 2 Vector Matrix x a b a b (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k Result (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k =
  • 30. a b a b a b a b Intel MKL JIT cgemm kernel walkthrough Result a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k a b a b a b a b K = 4 M=8 Vector Matrix x a b a b a b a b a b a b a b a b c1 c1 c1 c1 d1 d1 d1 d1 x x = ac1…k bc1…k ac1…k bc1…k ad1…k bd1…k ad1…k bd1…k Accumulate, for each column = c2 c2 c2 c2 d2 d2 d2 d2 c3 c3 c3 c3 d3 d3 d3 d3 c4 c4 c4 c4 d4 d4 d4 d4 permute (swap pairs) bd1…k ad1…k bd1…k ad1…k a b a b Fused negate multiply add x 1 -1 1 -1 = bd1…k ad1…k bd1…k ad1…k – – – + + = ☺ Fused multiply add + + (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k SIMD widthV = 2 Note: subscript 1…k used to signify summation of values with subscript in range 1 to k
  • 31. a b a b a b a b My JIT cgemv kernel walkthrough Result a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b (ac-bd)1…k (ad+bc)1…k (ac-bd)1…k (ad+bc)1…k a b a b a b a b K = 4 M=8 Vector Matrix x a b a b a b a b a b a b a b a b c1 c1 c1 c1 d1 d1 d1 d1 x x =ac-bd1…k ac-bd1…k ad+bc1…k ad+bc1…k Accumulate, for each column = c2 c2 c2 c2 d2 d2 d2 d2 c3 c3 c3 c3 d3 d3 d3 d3 c4 -d4 c4 -d4 d4 c4 d4 c4 (swap pairs) a b a b = ☺ Fused multiply add + + (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k SIMD widthV = 2 Note: subscript 1…k used to signify summation of values with subscript in range 1 to k
  • 32. a b a b a b a b SIMD widthV = 2 Intel MKL JIT cgemm kernel walkthrough Result a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b K = 4 M=8 Vector Matrix x Repeat sequence M/V times to yield all results Notice that blue vectors c and d are reused (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k = (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k = = = (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k (ac-bd)1…k (bc+ad)1…k
  • 34. Design considerations 1. What data type to use? (float vs. int16) 2. What algorithm/schedule to use? (repeated dot product vs. multiply-add) 3. What data order/layout to use? (row/column major, interleaved/split) 4. How to write the code? (Intel Intrinsics vs. writing x86_64 assembly) Overarching goals: speed up computation, reduce memory overhead.
  • 35. 1)What data type to use? (float vs. int16) • int16 (fixed point, 16 bits) • Limited range of representable #s (enough for baseband processing) • Saves 2x memory bandwidth and space • Increases SIMD parallelism (2x more computations per instruction) • Use less energy? (less transistors, shorter wires, less capacitance) • float (floating point, 32 bits) • Greater range of representable #s • Better existing hardware support (FMA ports) • Better existing library support (MKL and other math libraries)
  • 36. Repeated dot products pseudocode: Matvec_rowmaj(mat_a, vec_x, res_y): For each row in mat_a, index i: dotProd = dot(row, vec_x) res_y[i] = dotProd Repeated multiply-add pseudocode: Matvec_colmaj(mat_a, vec_x, res_y): For each col in mat_a, index j: For each value in col, index i: res_y[i] += value * vec_x[j] 𝑎11 𝑎12 𝑎21 𝑎22 𝑎31 𝑎32 x 𝑥1 𝑥2 = 𝑦1 𝑦2 𝑦3 𝑎11 𝑎12 𝑎21 𝑎22 𝑎31 𝑎32 x 𝑥1 𝑥2 = 𝑦1 𝑦2 𝑦3 Accumulate Both versions produce the same result (algorithm), but use different orders of data access (schedule) 2) What algorithm/schedule to use? • Horizontal summation in the inner loop (slow) • Vector elements loaded M times (M=# rows) • No horizontal reductions • Vector elements each loaded only once
  • 37. 3)What data order/layout to use? • Row major or Column major • Column major for locality of access w/ repeated multiply-add method • Interleaved complex or Split complex • Interleaved is standard / typical for complex numbers r1 i1 r2 i2 … … rk ik r1 i1 r2 i2 … … rk ik complex* mat complex* mat_real complex* mat_imag Interleaved complex # layout Split complex # layout
  • 38. 4) How to write the code? • Using Intel Intrinsics • Vector instructions wrapped in C++ style functions • “Higher level” programming, less fine tuned control • Register usage and instruction ordering determined by compiler • Writing x86_64 assembly by hand • Lowest level of programming, most fine-tuned control of instructions • Register usage and instruction ordering manually determined • Prone to error (Compilers are pretty smart/safe, while programmers can introduce bugs) Both methods are non-portable  only run on CPUs that support the instructions

Editor's Notes

  • #39: X should be read less often from memory in the column major layout too?