SlideShare a Scribd company logo
A FRAMEWORK FOR 
PRACTICAL FAST MATRIX 
MULTIPLICATION 
1 
0 5000 10000 15000 
25 
20 
15 
10 
5 
0 
Dimension (N) 
Effective GFLOPS / core 
Parallel performance of Strassen on <N,N,N> 
MKL, 6 cores 
MKL, 24 cores 
DFS, 6 cores 
BFS, 6 cores 
HYBRID, 6 cores 
DFS, 24 cores 
BFS, 24 cores 
HYBRID, 24 cores 
arXiv: 1409.2908 
Austin Benson (arbenson@stanford.edu), ICME, Stanford 
Grey Ballard, Sandia National Laboratories 
BLIS Retreat, September 26, 2014
Fast matrix multiplication: 
bridging theory and practice 
2 
• There are a number of Strassen-like algorithms for matrix 
multiplication that have only been “discovered” recently. 
[Smirnov13], [Benson&Ballard14] 
• We show that they can achieve higher performance with 
respect to MKL (sequential and sometimes in parallel). 
• We use code generation to do extensive prototyping. There 
are several practical issues, and there is plenty of room for 
improvement (lots of expertise at UT to help here!) 
2 2.81 3 
[Strassen79] 
2.37 
[Williams12] 
xxxxx xxx x
Strassen’s algorithm 
3
4 
Key ingredients of Strassen’s algorithm 
• 1. Block partitioning of matrices (<2, 2, 2>) 
• 2. Seven linear combinations of sub-blocks of A 
• 3. Seven linear combinations of sub-blocks of B 
• 4. Seven matrix multiplies to form Mr (recursive) 
• 5. Linear combinations of Mr to form Cij
Key ingredients of fast matmul algorithms 
• 1. Block partitioning of matrices (<M, K, N>) 
• 2. R linear combinations of sub-blocks of A 
• 3. R linear combinations of sub-blocks of B 
• 4. R matrix multiplies to form Mr (recursive) 
R < MKN  faster than classical 
• 5. Linear combinations of Mr to form Cij 
5
“Outer product” fast algorithm 
• <4, 2, 4> partitioning 
• R = 26 multiplies (< 4 * 2 * 4 = 32) 
 23% speedup per recursive step (if everything else free) 
• Linear combinations of Aij to form Sr: 68 terms 
• Linear combinations of Bij to form Tr: 52 terms 
• Linear combinations of Mr to form Cij: 69 terms 
6
Discovering fast algorithms is a 
numerical challenge 
7 
• Low-rank tensor decompositions lead to fast algorithms 
• Tensors are small, but we need exact decompositions 
 NP-hard 
• Use alternating least squares with regularization and 
rounding tricks [Smirnov13], [Benson&Ballard14] 
• We have around 10 fast algorithms for <M, K, N> 
decompositions. Also have permutations, e.g., <K, M, N>.
8 
[Strassen69] 
[Smirnov13]
Code generation lets us prototype 
algorithms quickly 
9 
• We have compact representation of many fast algorithms: 
1. dimensions of block partitioning (<M, K, N>) 
2. linear combinations of sub-blocks (Sr, Tr) 
3. linear combinations of Mr to form Cij 
• We use code generation to rapidly prototype fast algorithms 
• Our approach: test all algorithms on a bunch of different 
problem sizes and look for patterns
Practical issues 
10 
• Best way to do matrix additions? (in paper) 
• Can we eliminate redundant linear combinations? (in paper) 
• Different problem shapes other than square (this talk) 
• When to stop recursion? (this talk) 
• How to parallelize? (this talk) 
=
Recursion cutoff: look at gemm curve 
25 
20 
15 
10 
0 1000 2000 3000 
Dimension (N) 
GFLOPS 
Sequential dgemm performance 
N x 800 x 800 
N x 800 x N 
N x N x N 
peak 
25 
20 
15 
10 
0 2000 4000 6000 8000 
Dimension (N) 
GFLOPS / core 
Parallel dgemm performance (24 cores) 
Basic idea: take another 
recursive step if the sub-problems 
will still operate at 
high performance 
11 
<M, K, N> = <4, 2, 3>
Sequential performance 
28 
26 
24 
22 
20 
18 
16 
0 2000 4000 6000 8000 
Dimension (N) 
Effective GFLOPS 
Sequential performance on N x N x N 
= 
12 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
Effective GFLOPS for M x K x N multiplies 
= 1e-9 * 2 * MKN / time in seconds 
True peak
Sequential performance 
28 
26 
24 
22 
20 
18 
16 
0 2000 4000 6000 8000 
Dimension (N) 
Effective GFLOPS 
Sequential performance on N x N x N 
= 
MKL 
STRASSEN 
<4,4,2> 
<4,3,3> 
<3,4,3> 
<3,3,6> 
<3,6,3> 
<6,3,3> 
• All algorithms beat MKL on large problems 
• Strassen’s algorithm is hard to beat 
13
Sequential performance = 
27 
26 
25 
24 
23 
22 
2000 4000 6000 8000 10000 12000 
dimension (N) 
Effective GFLOPS 
Sequential performance on N x 1600 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
• Almost all algorithms beat MKL 
• <4, 2, 4> and <3, 2, 3> tend to perform the best 
14
Sequential performance = 
26 
25 
24 
23 
22 
• Almost all algorithms beat MKL 
• <4, 3, 3> and <4, 2, 3> tend to perform the best 
15 
10000 12000 14000 16000 18000 
dimension (N) 
Effective GFLOPS 
Sequential performance on N x 2400 x 2400 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN
Parallelization 
C 
+ 
M2 … 
M1 M7 
+ 
M2 … 
M1 M7 
+ 
M2 … 
M1 M7 
16
DFS Parallelization 
C 
+ 
M2 … 
M1 M7 
+ 
M2 … 
M1 M7 
All threads 
Use parallel MKL 
17 
+ Easy to implement 
+ Load balanced 
+ Same memory 
footprint as sequential 
- Need large base 
cases for high 
performance
BFS Parallelization 
C 
+ 
omp taskwait 
M2 … 
M1 M7 
+ 
omp taskwait 
M2 … 
M1 M7 
1 thread 
18 
1 thread 1 thread 
+ High performance for smaller base cases 
- Sometimes harder to load balance: 24 threads, 49 subproblems 
- More memory
HYBRID parallelization 
C 
+ 
omp taskwait 
M2 … 
M1 M7 
+ 
omp taskwait 
M2 … 
M1 M7 
19 
1 thread 1 thread all threads 
+ Better load balancing 
- Explicit synchronization or else we can over-subscribe threads
20 
0 5000 10000 15000 20000 0 5000 10000 15000 
25 
20 
15 
10 
5 
0 
Dimension (N) 
Effective GFLOPS / core 
Parallel performance of <4,2,4> on <N,2800,N> 
MKL, 6 cores 
MKL, 24 cores 
DFS, 6 cores 
BFS, 6 cores 
HYBRID, 6 cores 
DFS, 24 cores 
BFS, 24 cores 
HYBRID, 24 cores
Bandwidth problems 
• We rely on the cost of matrix multiplications to be much 
more expensive than the cost of matrix additions 
• Parallel dgemm on 24 cores: easily get 50-75% of peak 
• STREAM benchmark: < 6x speedup in read/write 
performance on 24 cores 
C 
+ 
M2 … 
M1 M7 
21
Parallel performance = 
22 
Performance (6 cores) on N x N x N 
28 
26 
24 
22 
20 
18 
9000 10000 11000 12000 13000 
Dimension (N) 
Effective GFLOPS / core 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
Performance (24 cores) on N x N x N 
22 
20 
18 
16 
9000 10000 11000 12000 13000 
Dimension (N) 
Effective GFLOPS / core 
MKL 
STRASSEN 
<3,2,2> 
<3,2,4> 
<4,2,3> 
<3,4,2> 
<3,3,3> 
<4,2,4> 
<2,3,4> 
• 6 cores: similar performance to sequential 
• 24 cores: can sometimes beat MKL, but barely
24 
23 
22 
21 
20 
19 
18 
10000 15000 20000 10000 15000 
dimension (N) 
Effective GFLOPS / core 
Performance (6 cores) on N x 2800 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
20 
18 
16 
14 
12 
5000 10000 15000 20000 
dimension (N) 
Effective GFLOPS / core 
Performance (24 cores) on N x 2800 x N 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
Parallel performance = 
Bad MKL 
performance 
• 6 cores: similar performance to sequential 
• 24 cores: MKL best for large problems 
23
Parallel performance = 
23 
22 
21 
20 
19 
18 
18 
17 
16 
15 
14 
13 
12 
• 6 cores: similar performance to sequential 
• 24 cores: MKL usually the best 
24 
10000 15000 20000 10000 15000 
dimension (N) 
Effective GFLOPS / core 
Performance (6 cores) on N x 3000 x 3000 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN 
16000 18000 20000 22000 24000 26000 
dimension (N) 
Effective GFLOPS / core 
Performance (24 cores) on N x 3000 x 3000 
MKL 
<4,2,4> 
<4,3,3> 
<3,2,3> 
<4,2,3> 
STRASSEN
High-level conclusions 
25 
• For square matrix multiplication, Strassen’s algorithm is 
hard to beat 
• For rectangular matrix multiplication, use a fast algorithm 
that “matches the shape” 
• Bandwidth limits the performance of shared memory 
parallel fast matrix multiplication 
 should be less of an issue in distributed memory 
Future work: 
• Numerical stability 
• Using fast matmul as a kernel for other algorithms in 
numerical linear algebra
A FRAMEWORK FOR 
PRACTICAL FAST MATRIX 
MULTIPLICATION 
26 
0 5000 10000 15000 
25 
20 
15 
10 
5 
0 
Dimension (N) 
Effective GFLOPS / core 
Parallel performance of Strassen on <N,N,N> 
MKL, 6 cores 
MKL, 24 cores 
DFS, 6 cores 
BFS, 6 cores 
HYBRID, 6 cores 
DFS, 24 cores 
BFS, 24 cores 
HYBRID, 24 cores 
arXiv: 1409.2908 
Austin Benson (arbenson@stanford.edu), ICME, Stanford 
Grey Ballard, Sandia National Laboratories 
BLIS Retreat, September 26, 2014
Matrix additions (linear combinations) 
S1 S2 
S S7 S 6 S 5 4 S3 
A11 A12 A21 A22 
“Pairwise” 
2x 
DAXPY 
2x 
DAXPY 
27
Matrix additions (linear combinations) 
S1 S2 
S S7 S 6 S 5 4 S3 
A11 A12 A21 A22 
“Write once” 
custom 
“DAXPY” 
custom 
“DAXPY” 
28
Matrix additions (linear combinations) 
A11 A12 A21 A22 
Entry-wise 
updates 
S1 S2 
S S7 S 6 S 5 4 S3 
“Streaming” 
29
Common subexpression elimination (CSE) 
• Example in <4, 2, 4> algorithm (R = 26 multiples): 
T11 T25 
B B24 12 B22 B23 
Four additions, six reads, two writes 
30
Common subexpression elimination (CSE) 
• Example in <4, 2, 4> algorithm (R = 26 multiples): 
T11 T25 
B B24 12 B22 B23 
Y 
Three additions, six reads, three writes 
 Net increase in communication! 
31
CSE does not really help 
Effective GFLOPS for M x K x N multiplies 
= 1e-9 * 2 * MKN / time in seconds 
32

More Related Content

PPTX
Sandia Fast Matmul
PPTX
fast-matmul-ppopp2015
PPT
PPT
PPT
PPT
PPT
Double Patterning (4/2 update)
PDF
Big data matrix factorizations and Overlapping community detection in graphs
Sandia Fast Matmul
fast-matmul-ppopp2015
Double Patterning (4/2 update)
Big data matrix factorizations and Overlapping community detection in graphs

What's hot (19)

PDF
Iterative methods with special structures
PPTX
Double patterning for 32nm and beyond
PDF
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
PDF
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
PDF
Recent Advances in Kernel-Based Graph Classification
PDF
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
PPT
Double Patterning
PDF
Ijciras1101
PPT
PPTX
Ram minimum spanning tree
PDF
Spacey random walks and higher-order data analysis
PDF
VJAI Paper Reading#3-KDD2019-ClusterGCN
PDF
Kaggle Amazon Contest
PPTX
Intensity Constraint Gradient-Based Image Reconstruction
PDF
Multidimension Scaling and Isomap
PPT
Project seminar ppt_steelcasting
PPTX
[Vldb 2013] skyline operator on anti correlated distributions
PPTX
Learning a nonlinear embedding by preserving class neibourhood structure 최종
PDF
Spectral clustering with motifs and higher-order structures
Iterative methods with special structures
Double patterning for 32nm and beyond
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Recent Advances in Kernel-Based Graph Classification
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Double Patterning
Ijciras1101
Ram minimum spanning tree
Spacey random walks and higher-order data analysis
VJAI Paper Reading#3-KDD2019-ClusterGCN
Kaggle Amazon Contest
Intensity Constraint Gradient-Based Image Reconstruction
Multidimension Scaling and Isomap
Project seminar ppt_steelcasting
[Vldb 2013] skyline operator on anti correlated distributions
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Spectral clustering with motifs and higher-order structures
Ad

Similar to A framework for practical fast matrix multiplication (BLIS retreat) (20)

PPTX
A framework for practical fast matrix multiplication
PPTX
fast-matmul-cse15
PDF
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
PDF
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PDF
xldb-2015
PPTX
Cycle’s topological optimizations and the iterative decoding problem on gener...
PDF
PR-305: Exploring Simple Siamese Representation Learning
PDF
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
PDF
IJCAI13 Paper review: Large-scale spectral clustering on graphs
PDF
Druinsky_SIAMCSE15
PDF
Tutorial-on-DNN-07-Co-design-Precision.pdf
PDF
Cycle’s topological optimizations and the iterative decoding problem on gener...
PDF
Learning Convolutional Neural Networks for Graphs
PDF
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
PDF
generalized_nbody_acs_2015_challacombe
PPTX
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
A framework for practical fast matrix multiplication
fast-matmul-cse15
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
xldb-2015
Cycle’s topological optimizations and the iterative decoding problem on gener...
PR-305: Exploring Simple Siamese Representation Learning
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
IJCAI13 Paper review: Large-scale spectral clustering on graphs
Druinsky_SIAMCSE15
Tutorial-on-DNN-07-Co-design-Precision.pdf
Cycle’s topological optimizations and the iterative decoding problem on gener...
Learning Convolutional Neural Networks for Graphs
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
generalized_nbody_acs_2015_challacombe
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Ad

More from Austin Benson (20)

PDF
Hypergraph Cuts with General Splitting Functions (JMM)
PDF
Spectral embeddings and evolving networks
PDF
Computational Frameworks for Higher-order Network Data Analysis
PDF
Higher-order link prediction and other hypergraph modeling
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Higher-order link prediction
PDF
Simplicial closure & higher-order link prediction
PDF
Three hypergraph eigenvector centralities
PDF
Semi-supervised learning of edge flows
PDF
Choosing to grow a graph
PDF
Link prediction in networks with core-fringe structure
PDF
Higher-order Link Prediction GraphEx
PDF
Higher-order Link Prediction Syracuse
PDF
Random spatial network models for core-periphery structure
PDF
Random spatial network models for core-periphery structure.
PDF
Simplicial closure & higher-order link prediction
PDF
Simplicial closure and simplicial diffusions
PDF
Sampling methods for counting temporal motifs
PDF
Set prediction three ways
Hypergraph Cuts with General Splitting Functions (JMM)
Spectral embeddings and evolving networks
Computational Frameworks for Higher-order Network Data Analysis
Higher-order link prediction and other hypergraph modeling
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
Higher-order link prediction
Simplicial closure & higher-order link prediction
Three hypergraph eigenvector centralities
Semi-supervised learning of edge flows
Choosing to grow a graph
Link prediction in networks with core-fringe structure
Higher-order Link Prediction GraphEx
Higher-order Link Prediction Syracuse
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structure.
Simplicial closure & higher-order link prediction
Simplicial closure and simplicial diffusions
Sampling methods for counting temporal motifs
Set prediction three ways

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
System and Network Administration Chapter 2
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
history of c programming in notes for students .pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
L1 - Introduction to python Backend.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
Wondershare Filmora 15 Crack With Activation Key [2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Understanding Forklifts - TECH EHS Solution
System and Network Administration Chapter 2
CHAPTER 2 - PM Management and IT Context
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Odoo Companies in India – Driving Business Transformation.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Odoo POS Development Services by CandidRoot Solutions
history of c programming in notes for students .pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Illustrator 28.6 Crack My Vision of Vector Design
L1 - Introduction to python Backend.pptx
Design an Analysis of Algorithms I-SECS-1021-03

A framework for practical fast matrix multiplication (BLIS retreat)

  • 1. A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION 1 0 5000 10000 15000 25 20 15 10 5 0 Dimension (N) Effective GFLOPS / core Parallel performance of Strassen on <N,N,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores arXiv: 1409.2908 Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories BLIS Retreat, September 26, 2014
  • 2. Fast matrix multiplication: bridging theory and practice 2 • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • We show that they can achieve higher performance with respect to MKL (sequential and sometimes in parallel). • We use code generation to do extensive prototyping. There are several practical issues, and there is plenty of room for improvement (lots of expertise at UT to help here!) 2 2.81 3 [Strassen79] 2.37 [Williams12] xxxxx xxx x
  • 4. 4 Key ingredients of Strassen’s algorithm • 1. Block partitioning of matrices (<2, 2, 2>) • 2. Seven linear combinations of sub-blocks of A • 3. Seven linear combinations of sub-blocks of B • 4. Seven matrix multiplies to form Mr (recursive) • 5. Linear combinations of Mr to form Cij
  • 5. Key ingredients of fast matmul algorithms • 1. Block partitioning of matrices (<M, K, N>) • 2. R linear combinations of sub-blocks of A • 3. R linear combinations of sub-blocks of B • 4. R matrix multiplies to form Mr (recursive) R < MKN  faster than classical • 5. Linear combinations of Mr to form Cij 5
  • 6. “Outer product” fast algorithm • <4, 2, 4> partitioning • R = 26 multiplies (< 4 * 2 * 4 = 32)  23% speedup per recursive step (if everything else free) • Linear combinations of Aij to form Sr: 68 terms • Linear combinations of Bij to form Tr: 52 terms • Linear combinations of Mr to form Cij: 69 terms 6
  • 7. Discovering fast algorithms is a numerical challenge 7 • Low-rank tensor decompositions lead to fast algorithms • Tensors are small, but we need exact decompositions  NP-hard • Use alternating least squares with regularization and rounding tricks [Smirnov13], [Benson&Ballard14] • We have around 10 fast algorithms for <M, K, N> decompositions. Also have permutations, e.g., <K, M, N>.
  • 9. Code generation lets us prototype algorithms quickly 9 • We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij • We use code generation to rapidly prototype fast algorithms • Our approach: test all algorithms on a bunch of different problem sizes and look for patterns
  • 10. Practical issues 10 • Best way to do matrix additions? (in paper) • Can we eliminate redundant linear combinations? (in paper) • Different problem shapes other than square (this talk) • When to stop recursion? (this talk) • How to parallelize? (this talk) =
  • 11. Recursion cutoff: look at gemm curve 25 20 15 10 0 1000 2000 3000 Dimension (N) GFLOPS Sequential dgemm performance N x 800 x 800 N x 800 x N N x N x N peak 25 20 15 10 0 2000 4000 6000 8000 Dimension (N) GFLOPS / core Parallel dgemm performance (24 cores) Basic idea: take another recursive step if the sub-problems will still operate at high performance 11 <M, K, N> = <4, 2, 3>
  • 12. Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = 12 MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds True peak
  • 13. Sequential performance 28 26 24 22 20 18 16 0 2000 4000 6000 8000 Dimension (N) Effective GFLOPS Sequential performance on N x N x N = MKL STRASSEN <4,4,2> <4,3,3> <3,4,3> <3,3,6> <3,6,3> <6,3,3> • All algorithms beat MKL on large problems • Strassen’s algorithm is hard to beat 13
  • 14. Sequential performance = 27 26 25 24 23 22 2000 4000 6000 8000 10000 12000 dimension (N) Effective GFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 14
  • 15. Sequential performance = 26 25 24 23 22 • Almost all algorithms beat MKL • <4, 3, 3> and <4, 2, 3> tend to perform the best 15 10000 12000 14000 16000 18000 dimension (N) Effective GFLOPS Sequential performance on N x 2400 x 2400 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  • 16. Parallelization C + M2 … M1 M7 + M2 … M1 M7 + M2 … M1 M7 16
  • 17. DFS Parallelization C + M2 … M1 M7 + M2 … M1 M7 All threads Use parallel MKL 17 + Easy to implement + Load balanced + Same memory footprint as sequential - Need large base cases for high performance
  • 18. BFS Parallelization C + omp taskwait M2 … M1 M7 + omp taskwait M2 … M1 M7 1 thread 18 1 thread 1 thread + High performance for smaller base cases - Sometimes harder to load balance: 24 threads, 49 subproblems - More memory
  • 19. HYBRID parallelization C + omp taskwait M2 … M1 M7 + omp taskwait M2 … M1 M7 19 1 thread 1 thread all threads + Better load balancing - Explicit synchronization or else we can over-subscribe threads
  • 20. 20 0 5000 10000 15000 20000 0 5000 10000 15000 25 20 15 10 5 0 Dimension (N) Effective GFLOPS / core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores
  • 21. Bandwidth problems • We rely on the cost of matrix multiplications to be much more expensive than the cost of matrix additions • Parallel dgemm on 24 cores: easily get 50-75% of peak • STREAM benchmark: < 6x speedup in read/write performance on 24 cores C + M2 … M1 M7 21
  • 22. Parallel performance = 22 Performance (6 cores) on N x N x N 28 26 24 22 20 18 9000 10000 11000 12000 13000 Dimension (N) Effective GFLOPS / core MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> Performance (24 cores) on N x N x N 22 20 18 16 9000 10000 11000 12000 13000 Dimension (N) Effective GFLOPS / core MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4> • 6 cores: similar performance to sequential • 24 cores: can sometimes beat MKL, but barely
  • 23. 24 23 22 21 20 19 18 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 20 18 16 14 12 5000 10000 15000 20000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Parallel performance = Bad MKL performance • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 23
  • 24. Parallel performance = 23 22 21 20 19 18 18 17 16 15 14 13 12 • 6 cores: similar performance to sequential • 24 cores: MKL usually the best 24 10000 15000 20000 10000 15000 dimension (N) Effective GFLOPS / core Performance (6 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 16000 18000 20000 22000 24000 26000 dimension (N) Effective GFLOPS / core Performance (24 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  • 25. High-level conclusions 25 • For square matrix multiplication, Strassen’s algorithm is hard to beat • For rectangular matrix multiplication, use a fast algorithm that “matches the shape” • Bandwidth limits the performance of shared memory parallel fast matrix multiplication  should be less of an issue in distributed memory Future work: • Numerical stability • Using fast matmul as a kernel for other algorithms in numerical linear algebra
  • 26. A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION 26 0 5000 10000 15000 25 20 15 10 5 0 Dimension (N) Effective GFLOPS / core Parallel performance of Strassen on <N,N,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores arXiv: 1409.2908 Austin Benson (arbenson@stanford.edu), ICME, Stanford Grey Ballard, Sandia National Laboratories BLIS Retreat, September 26, 2014
  • 27. Matrix additions (linear combinations) S1 S2 S S7 S 6 S 5 4 S3 A11 A12 A21 A22 “Pairwise” 2x DAXPY 2x DAXPY 27
  • 28. Matrix additions (linear combinations) S1 S2 S S7 S 6 S 5 4 S3 A11 A12 A21 A22 “Write once” custom “DAXPY” custom “DAXPY” 28
  • 29. Matrix additions (linear combinations) A11 A12 A21 A22 Entry-wise updates S1 S2 S S7 S 6 S 5 4 S3 “Streaming” 29
  • 30. Common subexpression elimination (CSE) • Example in <4, 2, 4> algorithm (R = 26 multiples): T11 T25 B B24 12 B22 B23 Four additions, six reads, two writes 30
  • 31. Common subexpression elimination (CSE) • Example in <4, 2, 4> algorithm (R = 26 multiples): T11 T25 B B24 12 B22 B23 Y Three additions, six reads, three writes  Net increase in communication! 31
  • 32. CSE does not really help Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds 32

Editor's Notes

  • #4: egin{bmatrix} C_{11} & C_{12} \ C_{21} & C_{22} end{bmatrix} = egin{bmatrix} A_{11} & A_{12} \ A_{21} & A_{22} end{bmatrix} cdot egin{bmatrix} B_{11} & B_{12} \ B_{21} & B_{22} end{bmatrix} S_1 &= A_{11} + A_{22} \ S_2 &= A_{21} + A_{22} \ S_3 &= A_{11} \ S_4 &= A_{22} \ S_5 &= A_{11} + A_{12} \ S_6 &= A_{21} - A_{11} \ S_7 &= A_{12} - A_{22} \ T_1 &= B_{11} + B_{22} \ T_2 &= B_{11} \ T_3 &= B_{12} - B_{22} \ T_4 &= B_{21} - B_{11} \ T_5 &= B_{22} \ T_6 &= B_{11} + B_{12} \ T_7 &= B_{21} + B_{22} \
  • #9: egin{table} egin{tabular}{l c c c c c} Algorithm & Multiples & Multiplies & speedup per & \ base case & (fast) & (classical) & recursive step & exponent\ $langle 2,2,3 angle $ & 11 & 12 & 9\% & 2.89 \ $langle 2,2,5 angle $ & 18 & 20 & 11\% & 2.89\ $langle 2,2,2 angle $ & 7 & 8 & 14\% & 2.81 \ $langle 2,2,4 angle $ & 14 & 16 & 14\% & 2.85\ $langle 3,3,3 angle $ & 23 & 26 & 17\% & 2.85 \ $langle 2,3,3 angle $ & 15 & 18 & 20\% & 2.81 \ $langle 2,3,4 angle $ & 20 & 24 & 20\% & 2.83\ $langle 2,4,4 angle $ & 26 & 32 & 23\% & 2.82 \ $langle 3,3,4 angle $ & 29 & 36 & 24\% & 2.82 \ $langle 3,4,4 angle $ & 38 & 48 & 26\% & 2.82 \ $langle 3,3,6 angle $ & 40 & 54 & 35\% & 2.77 \ end{tabular} end{table}
  • #17: egin{eqnarray*} S_7 &=& A_{12} - A_{22} \ T_7 &=& B_{21} + B_{22} \ M_7 &=& S_7 cdot T_7 end{eqnarray*}
  • #18: egin{eqnarray*} S_7 &=& A_{12} - A_{22} \ T_7 &=& B_{21} + B_{22} \ M_7 &=& S_7 cdot T_7 end{eqnarray*}
  • #19: egin{eqnarray*} S_7 &=& A_{12} - A_{22} \ T_7 &=& B_{21} + B_{22} \ M_7 &=& S_7 cdot T_7 end{eqnarray*}
  • #31: T{11} &= B{24} - left(B{12} + B{22} ight) \ T{25} &= B{23} + B{12} + B{22}
  • #32: Y &= B_{12} + B_{22} \ T_{11} &= B_{24} - Y \ T_{25} &= B_{23} + Y