How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S

How I Sped up Complex Matrix-Vector Multiplication:
Finding Intel MKL’s “Secret Sauce”
Brandon Liu
Yale Efficient Computing Lab, 8/14/20

Presentation outline
1. High-level research overview + results
2. Interesting tools + resources that might be helpful to others
3. Reflections + next steps

Research goal:
Write a kernel for integer complex matrix-vector multiplication that
runs faster than those provided by Intel’s Math Kernel library.

1.54x faster than MKL for (16x64) matrix * (64x1) vector
Kernel Avg. time
MKL (float) 0.073 µs
Mine (int16_t) 0.048 µs
Kernel Avg. time
MKL (float) 0.057 µs
Mine (int16_t) 0.038 µs
1.50x faster than MKL for (64x16) matrix * (16x1) vector
Fast!

How do we beat Intel MKL?
Ultimately, the key was to reverse engineer and study MKL’s fastest
proprietary implementation, then adapt it for a smaller integer data
type.
Using a smaller data type accomplishes two things:
1) increases SIMD parallelism during computation, and
2) decreases memory accesses

vmulps zmm3 zmm0 zmm1
1.6 5.2 6.0 … … … … … … … … … … … … …
x x x x x x x x x x x x x x x x
zmm0
zmm1 7.2 1.0 8.9 … … … … … … … … … … … … …
32 bit float
= = = = = = = = = = = = = = = =
11.52 5.2 53.4 … … … … … … … … … … … … …zmm3
512 bit zmm registers (fits 16 floats)
(Vector multiply, precision single)

vpmullwzmm3 zmm0 zmm1
x
zmm0
zmm1
word = 16 bits
=
zmm3
Same 512 bit zmm registers (fits 32 int16)
(Vector multiply, word (store low 16 bits))

Pros and cons of using int16_t over float
• int16 (fixed point, 16 bits)
• Saves 2x memory accesses and space
• 2x more computations per instruction (increased SIMD parallelism)
• Use less energy? (less transistors, shorter wires, less capacitance)
• Limited range of representable #s (enough for baseband processing)
• float (floating point, 32 bits)
• Better existing hardware support (FMA ports)
• Better existing library support (MKL and other math libraries)
• Greater range of representable #s

Multiple functions to perform complex matrix-vector
multiply with Intel MKL — which is fastest?
• Armadillo (multiply operator)
• A C++ library that wraps MKL functions in easy to use syntax
• Calls cgemv() under the hood
• cgemv()
• complex general matrix-vector multiply
• cgemm()
• complex general matrix-matrix multiply
• Works because a matrix with 1 column is the same as a vector
• jit_cgemm()
• Just-in-Time compiled complex general matrix-matrix multiply
• JIT gemm kernels introduced in 2018
• By far the fastest of the 4

MKL Just-in-Time generated kernels are by far the fastest

Column major implementation is far faster than row major
• MKL lets you generate either row or column major kernels
• Column major is faster because:
• No horizontal reductions (summations of a vector register)
• Vector elements each loaded only once as opposed to M times

So, Intel MKL’s JIT cgemm kernel is the fastest— how
does it work and how can I beat it?
5 useful tools/resources 

1) Zydis: Runtime disassembler
● objdump–dbinaryname>output.asm
○ Only works for statically compiled code
● Options for examining disassembly at runtime: GDB, Zydis
○ GDB helps step through instructions (stepi) and view contents of registers
○ Zydis lets you output particular sections of assembly programmatically
○ Code snippets for using Zydis are in my repository

1) Zydis: Examples of assembly output
…
jit_cgemm kernel for (2x2) * (2x1) jit_cgemm kernel for (64x16) * (16x1)
MKL JIT generated kernels are optimized to the problem — 11 vs 475 lines!

2) Intel’s Manuals and WikiChip
● https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
● The manufacturer instruction set and optimization references were key to
understanding what instructions exactly did and any side effects.
● WikiChip helped with general understanding the ports of my
microarchitecture and theoretical throughput.

3) Xbyak: JIT assembler used by Intel MKL
• Allows run-time compilation of x86 (IA32), x64 (AMD64, x86-64) instructions
• Ultimately what I used to write my int16 cgemv kernel generator
• Generates the machine instructions straight from C++ bindings – no reordering
• Open source (https://github.com/herumi/xbyak)
Snippet of my kernel code generator using Xbyak
Generating and running a kernel for (M x K) * (K x 1) cgemv

4) Intel VTune Profiler
● Detailed data collection about software performance and bottlenecks
● hotspots and uarch-exploration options were most useful
● The data pointed me in the right direction to finding my bottleneck, but Intel
Vtune’s suggestions were too generic to actually help fix anything.

5) Agner Fog’s test programs for latency/throughput
● https://www.agner.org/optimize/
● Open source test scripts that empirically measure instruction latency and
throughput on your machine, among other things.
● Useful because Intel did not provide theoretical numbers for the newest
instructions I used on my architecture.
● Helped me figure out my last roadblock that a particular instruction’s
latency/throughput was not the issue (vpdpwssds).

Resolving the key data dependency issue
Old, bad version New, better version
zmm29 and zmm28’s contents are updated each iteration of
the loop (they are accumulators). This is a data dependency.
Because vpdpwssds takes a relatively long amount of time, in
this version the next iteration must “wait” for the previous
iteration to complete before it can run. (bad!)
In this better version, we unroll the loop in steps of 4, and introduce
3 more pairs of registers in the red boxes, so each are partial
accumulators. At the end we call vpaddd to sum them up so that we
match zmm29 and zmm28 like before, but its worth the extra
instructions as the data dependencies are now spread further apart
so they don’t have to wait each loop anymore.

Summary of results and takeways
● My int16_t cgemv kernel generator:
○ Supports matrices with column major, interleaved complex data layout
○ Dimensions: M rows by K columns, M <= 208, M is a multiply of 16, any K
● Key takeaways:
○ How you lay out data has a significant impact on how efficiently you can perform
computations on them
○ Memory access and instruction ordering/data dependencies have a huge impact
on performance in compute kernels
○ Compilers do not necessarily use the best/latest machine instructions and
optimize SIMD code perfectly
○ In my case, I had to essentially hand-compile my source into assembly

Next steps
● Next steps/ideas:
○ Calculate error and precision compared to float
○ Support any size matrix
○ Extend to real matrix-vector multiplication?
○ Formally prove correctness? (I just compare to MKL’s correct output)
● Bigger idea:
○ Automate certain optimizations of assembly code? (Beat the compiler?)
■ Some existing research on randomized assembly instruction ordering to
generate faster code (http://stoke.stanford.edu/)

Rough timeline of events May 2020 – Aug 2020
● Studied caches, locality, optimizing matrix-matrix multiply, Intel Intrinsics
● Implemented cgemv with Intrinsics (many times)
● Implemented cgemv with Agner Fog’s Vector Class Library
● Studied Halide and the idea of algorithm vs. schedule
● Studied compiler optimizations, barriers, inline assembly
● Studied existing research on complex number data layouts and tested them
● Looked into MKL Compact BLAS routines
● Learned to use Intel VTune and benchmarked different MKL cgemv methods
● Contacted CMU researchers about alternative data layouts
● Learned to use Zydis and GDB for runtime disassembling
● Pored over MKL’s jitted assembly + Intel instruction references  breakthrough!
● Learned to use Xbyak JIT code generator and write x86_64 assembly
● Wrote JIT kernel generator for int16 cgemv
● Discovered VNNI instructions and updated algorithm to incorporate fused multiply add
● Used Agner Fog’s scripts to identify/fix a data dependency issue for small matrix sizes

Personal reflections on learning
● Most directly useful knowledge toward actually beating MKL came at the end
○ How to best optimize time/energy in the right areas?
■ Big picture  small picture
■ How to know what I don’t know but need to know?
○ The experience is enlightening both in that I learned about many topics deeply
but also in that I learned about the process of research in general
● Thank you to Jian and Lin for all their help and support!

Complex multiplication review
𝑎 + 𝑏𝑖 ∗ 𝑐 + 𝑑𝑖 = (𝑎𝑐 − 𝑏𝑑) + (𝑏𝑐 + 𝑎𝑑)𝑖
Real
component
Imaginary
component
● Complex multiplication is like binomial multiplication (first, outer, inner, last)
● Makes it a little tricky to implement with SIMD

Intel MKL JIT cgemm kernel walkthrough
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
K = 4
M=8
SIMD widthV
= 2
Vector
Matrix
x
a
b
a
b (bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
Result
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
=

a
b
a
b
a
b
a
b
Result
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
a
b
a
b
a
b
a
b
K = 4
M=8
Vector
Matrix
x
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
c1
c1
c1
c1
d1
d1
d1
d1
x
x
= ac1…k
bc1…k
ac1…k
bc1…k
ad1…k
bd1…k
ad1…k
bd1…k
Accumulate, for
each column
=
c2
c2
c2
c2
d2
d2
d2
d2
c3
c3
c3
c3
d3
d3
d3
d3
c4
c4
c4
c4
d4
d4
d4
d4
permute
(swap pairs)
bd1…k
ad1…k
bd1…k
ad1…k
a
b
a
b
Fused negate
multiply add
x
1
-1
1
-1
=
bd1…k
ad1…k
bd1…k
ad1…k
–
–
–
+
+ =
☺
Fused multiply
add
+
+
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
SIMD widthV
= 2
Note: subscript 1…k used to signify summation of values with subscript in range 1 to k

a
b
a
b
a
b
a
b
My JIT cgemv kernel walkthrough
Result
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
(ac-bd)1…k
(ad+bc)1…k
(ac-bd)1…k
(ad+bc)1…k
a
b
a
b
a
b
a
b
K = 4
M=8
Vector
Matrix
x
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
c1
c1
c1
c1
d1
d1
d1
d1
x
x
=ac-bd1…k
ac-bd1…k
ad+bc1…k
ad+bc1…k
Accumulate, for
each column
=
c2
c2
c2
c2
d2
d2
d2
d2
c3
c3
c3
c3
d3
d3
d3
d3
c4
-d4
c4
-d4
d4
c4
d4
c4
(swap pairs)
a
b
a
b
=
☺
Fused multiply
add
+
+
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
SIMD widthV
= 2
Note: subscript 1…k used to signify summation of values with subscript in range 1 to k

a
b
a
b
a
b
a
b
SIMD widthV
= 2
Result
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
a
b
K = 4
M=8
Vector
Matrix
x
Repeat sequence M/V times to yield all results
Notice that blue vectors c and d are reused
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
=
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
=
=
=
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k
(ac-bd)1…k
(bc+ad)1…k

Design considerations
1. What data type to use? (float vs. int16)
2. What algorithm/schedule to use? (repeated dot product vs. multiply-add)
3. What data order/layout to use? (row/column major, interleaved/split)
4. How to write the code? (Intel Intrinsics vs. writing x86_64 assembly)
Overarching goals: speed up computation, reduce memory overhead.

1)What data type to use? (float vs. int16)
• int16 (fixed point, 16 bits)
• Limited range of representable #s (enough for baseband processing)
• Saves 2x memory bandwidth and space
• Increases SIMD parallelism (2x more computations per instruction)
• Use less energy? (less transistors, shorter wires, less capacitance)
• float (floating point, 32 bits)
• Greater range of representable #s
• Better existing hardware support (FMA ports)
• Better existing library support (MKL and other math libraries)

Repeated dot products pseudocode:
Matvec_rowmaj(mat_a, vec_x, res_y):
For each row in mat_a, index i:
dotProd = dot(row, vec_x)
res_y[i] = dotProd
Repeated multiply-add pseudocode:
Matvec_colmaj(mat_a, vec_x, res_y):
For each col in mat_a, index j:
For each value in col, index i:
res_y[i] += value * vec_x[j]
𝑎11 𝑎12
𝑎21 𝑎22
𝑎31 𝑎32
x
𝑥1
𝑥2
=
𝑦1
𝑦2
𝑦3
𝑎11 𝑎12
𝑎21 𝑎22
𝑎31 𝑎32
x
𝑥1
𝑥2
=
𝑦1
𝑦2
𝑦3
Accumulate
Both versions produce the same result (algorithm), but use different orders of data
access (schedule)
2) What algorithm/schedule to use?
• Horizontal summation in the inner loop (slow)
• Vector elements loaded M times (M=# rows)
• No horizontal reductions
• Vector elements each loaded only once

3)What data order/layout to use?
• Row major or Column major
• Column major for locality of access w/ repeated multiply-add method
• Interleaved complex or Split complex
• Interleaved is standard / typical for complex numbers
r1 i1 r2 i2 … … rk ik
r1
i1
r2
i2
…
…
rk
ik
complex* mat
complex* mat_real
complex* mat_imag
Interleaved complex # layout
Split complex # layout

4) How to write the code?
• Using Intel Intrinsics
• Vector instructions wrapped in C++ style functions
• “Higher level” programming, less fine tuned control
• Register usage and instruction ordering determined by compiler
• Writing x86_64 assembly by hand
• Lowest level of programming, most fine-tuned control of instructions
• Register usage and instruction ordering manually determined
• Prone to error (Compilers are pretty smart/safe, while programmers
can introduce bugs)
Both methods are non-portable  only run on CPUs that support the instructions

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S

More Related Content

What's hot (20)

Similar to How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S (20)

Recently uploaded (20)

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S

Editor's Notes