Approximate Matrix Multiplication and Space Partitioning Trees:
An Exploration
N.P. Slagle, Lance J. Fortnow
October 23, 2012
Abstract
Herein we explore a dual tree algorithm for matrix multiplication of A ∈ RM×D
and B ∈
RD×N
, very narrowly effective if the normalized rows of A and columns of B, treated as vectors
in RD
, fall into clusters of order proportionate to Ω(Dτ
) with radii less than arcsin( /
√
2) on
the surface of the unit D-ball. The algorithm leverages a pruning rule necessary to guarantee
precision proportionate to vector magnitude products in the resultant matrix. Unfortunately,
if the rows and columns are uniformly distributed on the surface of the unit D-ball, then the
expected points per required cluster approaches zero exponentially fast in D; thus, the approach
requires a great deal of work to pass muster.
1 Introduction and Related Work
Matrix multiplication, ubiquitous in computing, naively requires O(MDN) floating point opera-
tions to multiply together matrices A ∈ RM×D and B ∈ RD×N . We present an investigation of our
novel approach to matrix multiplication after a brief discussion of related work and an explanation
of space-partitioning trees.
1.1 State-of-the-Art for Square Matrices
For N = D = M, Strassen [12] gave an O(Nlog2 7) algorithm that partitions the matrices into
blocks, generalizing the notion that to multiply binary integers a and b, one need only compute
[(a + b)2 − (a − b)2]/4, an operation requiring three additions, two squares, and a left shift. Several
improvements appear in the literature [1],[6],[8],[10], the most recent of which give O(N2.3736...) [11]
and O(N2.3727...) [13], both augmentations of the Coppersmith-Winograd algorithm [2]. The latest
algorithms feature constants sufficiently large to preclude application on modern hardware [9].
1.2 Motivating the Space
The product of A in RM×D and B in RD×N features all possible inner products between the row
vectors of A and the column vectors of B, each an element of RD. We investigate whether organizing
these two sets of vectors into space-partitioning trees can reduce the complexity of the na¨ıve matrix
multiplication by exploiting the distribution of the data.
1
1.2.1 Space-Partitioning Trees
We can organize a finite collection of points S in Euclidian space RD into a space-partitioning tree
T such that the root node P0 contains all points in S, and for any other node P in T , all points in
P are in π(P), the parent node of P. Figure 1 depicts a space-partitioning tree in R2.
Figure 1: A space-partitioning tree in R2; the ellipses denote covariances on the node points; the
large dots denote node centroids.
A space-partitioning tree definition requires a recursive partitioning rule, such as that appearing
in algorithm 1. Organizing S into such a tree generally requires O(D|S| log(D|S|)) time complexity.
Algorithm 1 [L, R] =partition(P, m)
1: If |P| ≤ m, then RETURN [NULL, NULL].
2: Pick the dimension k that maximizes the range of xk for x ∈ P.
3: Sort the points in P according to dimension k.
4: Split P into L and R using the median (or mean) of xk.
5. RETURN [L, R].
1.2.2 Dual Tree Algorithm
Given a reference tree R and a query tree Q of data points, we can perform pairwise operations
such as kernel summations and inner products across across nodes rather than points, performing
a depth-first search on both trees. The algorithm leverages a pruning criterion to guarantee level
approximation in the outputs. Algorithm 2 exhibits this approach.
2
Algorithm 2 dualTreeCompareNodes(R, Q, operation op, pruning rule R, , C)
1: If R and Q are leaf nodes, then perform the point-wise operation, filling in appropriate entries
of C. RETURN
2: If rule R(R, Q) is true, approximate op between points in the nodes using their centroids, filling
in appropriate entries of C; then RETURN.
3: Call
• dualTreeCompareNodes(R.left, Q.left)
• dualTreeCompareNodes(R.left, Q.right)
• dualTreeCompareNodes(R.right, Q.left)
• dualTreeCompareNodes(R.right, Q.right)
1.2.3 Space-Partitioning Trees in the Literature
Applied statistical methods such as dual tree approximate kernel summations [3], [4], [5] and
other pairwise statistical problems [7] partition the query and test samples into respective space-
partitioning trees for efficient look-ups. Using cover trees, Ram [7] demonstrates linear time com-
plexity for na¨ıve O(N2) pairwise algorithms.
2 Dual Tree Investigation
2.1 Product Matrix Entries
Given the two matrices A ∈ RM×D and B ∈ RD×N , we can think of the entries of C = AB as
cij = |ai||bj|cos θij, where ai is the ith row of A, bj is the jth column of B, and θij is the angle
between ai and bj. We can compute the magnitudes of these vectors in time O(D(M + N)) and all
products of the magnitudes in time O(MN), for a total time complexity of O(MN + D(M + N)).
Thus, computing the cosines of the angles for M, N ∈ O(D) is the O(MDN) bottleneck. We give
narrow conditions under which we can reduce this complexity.
2.2 Algorithm
In our investigation, we normalize the row vectors of A and the column vectors of B, then organize
each set into a ball tree, a space-partitioning tree such that each node is a D-ball. To compute the
cosines of the angles between all pairs, we apply the dual tree algorithm. The pruning rule must
guarantee that the relative error of our estimate cij with respect to the full magnitude |ai||bj| be
no more than , or, more formally,
|cij − cij| ≤ |ai||bj|. (1)
Thus, we require
| cos θij − cos θij| ≤ . (2)
The pruning rule guaranteeing the above error bound appears in algorithm 3.
3
Algorithm 3 dualTreeMatrixMultiplication(A, B, )
1: Allocate M × N matrix C.
2: Compute the magnitudes of ai and bj for i = 1, . . . , M, j = 1, . . . , N.
3: Fill in C so that cij = |ai||bj|.
4: Compute ui = ai/|ai|, vj = bj/|bj|.
5: Allocate trees U and V with root(U) = {ui} and root(V) = {vj}.
6: Call partition(root(U), size), partition(root(V), size), with size the minimum number of points
(defaulted to one) per tree node.
7: Let op(s, t) =< s, t >.
8: For node balls R ∈ U, Q ∈ V, define
• α:=angle between the centers of R,Q,
• β:=angle subtending half of the node ball R, and
• γ:=angle subtending half of the node ball Q,
all angles in [0, π].
9: Define the pruning rule R as an evaluation of |β + γ| ≤ | sin α|+| cos α| .
10: Call dualTreeCompareNodes(root(U),root(V), op, R, , C).
11: RETURN C.
We could define a more conservative pruning rule of |β + γ| ≤ /
√
2 ≤ | sin α|+| cos α| . For future
analyses, we apply the more conservative bound.
Figure 2 exhibits the relationships between angles α, β, and γ.
Figure 2: Angles in algorithm 3.
2.2.1 Proof of the Pruning Rule
Simply put, the pruning rule in algorithm 3 bounds the largest possible error on the cosine function
in terms of the center-to-center angle (our approximation) and the angles subtending the balls R
and Q, formally stated in theorem 1.
4
Theorem 1. Given ball nodes R and Q and angles as defined in algorithm 3, if β+γ ≤ | sin α|+| cos α| ,
then |r · q − cos α| ≤ for all r ∈ R, q ∈ Q.
To prove theorem 1, we need the following lemma.
Lemma 2. Given both the ball nodes R and Q and angles listed in theorem 1, let error(r, q) =
|r · q − cos α|. The maximum of error occurs when r and q are in the span of the two centers of R
and Q. Furthermore, the maxima of error are | cos(α β γ) − cos α|.
Proof. Let ¯r and ¯q be the centers of R and Q, respectively. Since cos θ is monotone for θ ∈ [0, π], the extrema of the
error function occur when r and q fall on the surface of R and Q, respectively. Furthermore, we only care about the
extrema of r · q since the maxima and minima of this function bound the error about cos α. Thus, we optimize r · q
subject to ¯r · ¯q = cos α, ¯r · r = cos β, ¯q · q = cos γ, and r · r = q · q = ¯r · ¯r = ¯q · ¯q = 1.
Leveraging Lagrange multipliers, we obtain the solutions
r = ¯r[cos β cot α sin β] + ¯q ±
sin β
sin α
and
q = ¯r ±
sin γ
sin α
+ ¯q[cos γ cot α sin γ],
with
r · q = ± cos α sin β sin γ + cos α cos β cos γ ± sin α cos β sin γ± sin α sin β cos γ = cos(α β γ).
Notice, the possible values of r and q maximizing the error are simply the edges of the cones
subtending balls R and Q in the hyperplane spanned by ¯r and ¯q. Now, we prove theorem 1.
Proof. By hypothesis, |β + γ| [| sin α| + | cos α|] ≤ . Since |β + γ| ≥ | β γ|, | sin h| ≤ |h|, |1 − cos h| ≤ |h| for
β, γ ∈ [0, π] and h ∈ [−π, π], we have
≥ | β γ| | sin α|
sin( β γ)
β γ
+ | cos α|
1 − cos( β γ)
β γ
≥ | cos α cos( β γ) − sin α sin( β γ) − cos α|,
and so
≥ | cos α cos( β γ) − sin α sin( β γ) − cos α| = | cos(α β γ) − cos α|.
2.2.2 Analysis of Algorithm 3
Given matrices A in RM×D and B in RD×N , computing the magnitudes, normalizing the rows of A
and columns of B, and computing magnitude products for C requires O(D(M +N)+MN). Orga-
nizing the normalized points into space-partitioning trees requires O(MD log MD + ND log ND).
Finally, an analysis of the dual tree algorithm requires conditions on the data points. We suppose
that given the approximation constant , the number of points falling in node balls of appropri-
ate size, say radius roughly arcsin( /
√
2), is bounded below by fD( ). If the points are clustered
into such balls, each prune saves the computation of at least D[fD( )]2. So we can fill into C
[fD( )]2 entries with a constant number of inner products at cost O(D), for a total complexity of
O(MDN/[fD( )]2). Thus, we have the following theorem.
Theorem 3. The total time complexity of algorithm 3 is O(MD log MD + ND log ND + MN(1 +
D/[fD( )]2)).
5
2.3 Gaping Caveat
An obvious caveat in the analysis is the behavior of fD( ) as D increases without bound. For a
rough sketch of the expected behavior of fD, recall that the volume and surface area of a D-ball of
radius r are
VD(r) =
2Dπ
D−1
2 Γ D+1
2
DΓ(D)
rD
(3)
and
SAD(r) = VD (r) =
2Dπ
D−1
2 Γ D+1
2
Γ(D)
rD−1
. (4)
Since uniformly distributed data represents something of a worst-case scenario with respect to
clustering algorithms, we explore the expected cluster sizes by dividing the surface of the unit
D-ball by the node balls of appropriate size.
Theorem 4. Assuming that the normalized rows of A and columns of B are uniformly distributed
about the unit D-ball, let W be the number of points in each ball of radius arcsin( /
√
2). Then
E[W] ≈ MN
VD−1(arcsin( /
√
2))
SAD(1)
=
MNΓ D
2
2
√
πΓ D+1
2
arcsinD−1
( /
√
2)
Thus, since Γ (x + 1/2) /Γ (x) ∈ O(x), exponentially few points fall into each ball of radius
arcsin( /
√
2). Thus, we require strong clustering conditions, stated formally below, if the dual tree
approach described in algorithm 3 is to defeat na¨ıve matrix multiplication.
Theorem 5. If M = D = N and the normalized rows of A and columns of B form clusters of size
Ω(Dτ ) for τ > 0 where cluster radii are approximately arcsin( /
√
2) for
√
2 > > 0, then algorithm
3 runs in time O(D2 log D + D3−2τ ).
3 Concluding Remarks and Future Work
Given the problem of multiplying together matrices A ∈ RM×D and B ∈ RD×N , we present a
dual tree algorithm effective if row vectors of the left matrix and column vectors of the right
matrix fall into clusters of size proportionate to some positive power τ of the dimension D of said
vectors. Unfortunately, worst-case uniformly distributed vectors give exponentially small cluster
sizes. Possible improvements include partitioning columns of A and rows of B so that the size of
clusters increases slightly while incurring a greater cost in tree construction and the number of
magnitudes to calculate, or appealing to the asymptotic orthogonality of vectors as D becomes
arbitrarily large. Clearly, the approach needs a great deal of work to be of practical interest.
4 References
1. D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n2.7799
) complexity for n × n approximate matrix multi-
plication. Inf. Process. Lett., 8(5):234235, 1979.
2. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. J. Symbolic Computa-
tion, 9(3):251280, 1990.
3. A.G. Gray and A.W. Moore. N-Body Problems in Statistical Learning. T.K. Leen, T.G. Dietterich, and V.
Tresp, editors, Advances in Information Processing Systems 13 (December 2000). MIT Press, 2001.
6
4. A.G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and
Statistics 2003, 2003.
5. M.P. Holmes, A.G. Gray, and C.L. Isbell Jr. Fast Kernel Conditional Density Estimation: A Dual Tree Monte
Carlo Approach. Computational Statistics and Data Analysis, 1707-1718, 2010.
6. V. Y. Pan. Strassens algorithm is not optimal. In Proc. FOCS, volume 19, pages 166176, 1978.
7. P. Ram, D. Lee, W. March, A.G. Gray. Linear-time Algorithms for Pairwise Statistical Problems, NIPS, 2010.
8. F. Romani. Some properties of disjoint sums of tensors related to matrix multiplication, SIAM J. Comput.,
pages 263267, 1982.
9. S. Robinson. Toward an Optimal Algorithm for Matrix Multiplication, SIAM News 38 (9), 2005.
10. A. Sch¨onhage. Partial and total matrix multiplication. SIAM J. Comput., 10(3):434455, 1981.
11. A. Stothers. Ph.D. Thesis, U. Edinburgh, 2010.
12. V. Strassen. Gaussian Elimination is not Optimal, Numer. Math. 13, p. 354-356, 1969.
13. V.V. Williams. Multiplying matrices faster than Coppersmith-Winograd, STOC, 2012.
7

More Related Content

PDF
Approximate Thin Plate Spline Mappings
PDF
Stability criterion of periodic oscillations in a (9)
PDF
transplantation-isospectral-poster
PDF
Maximizing a Nonnegative, Monotone, Submodular Function Constrained to Matchings
PDF
Divide and conquer
PPTX
Graphical methods for 2 d heat transfer
PDF
New data structures and algorithms for \\post-processing large data sets and ...
PDF
10.1.1.630.8055
Approximate Thin Plate Spline Mappings
Stability criterion of periodic oscillations in a (9)
transplantation-isospectral-poster
Maximizing a Nonnegative, Monotone, Submodular Function Constrained to Matchings
Divide and conquer
Graphical methods for 2 d heat transfer
New data structures and algorithms for \\post-processing large data sets and ...
10.1.1.630.8055

What's hot (20)

PDF
Siegel
PDF
PDF
IRJET- Common Fixed Point Results in Menger Spaces
PDF
Numerical Solution of Diffusion Equation by Finite Difference Method
PDF
ENHANCEMENT OF TRANSMISSION RANGE ASSIGNMENT FOR CLUSTERED WIRELESS SENSOR NE...
PPT
Divide and Conquer
PPT
minimum spanning trees Algorithm
PPTX
Planetary Science Assignment Help
PDF
07 Tensor Visualization
PPT
L 4 4
PPTX
Co clustering by-block_value_decomposition
PDF
Graph Analytics and Complexity Questions and answers
PDF
Mesh simplification notes
PPT
Matrix 2 d
PDF
Relations
PPTX
Spanning trees & applications
PPTX
Minimum spanning tree
PPTX
Chapter 9 newer
Siegel
IRJET- Common Fixed Point Results in Menger Spaces
Numerical Solution of Diffusion Equation by Finite Difference Method
ENHANCEMENT OF TRANSMISSION RANGE ASSIGNMENT FOR CLUSTERED WIRELESS SENSOR NE...
Divide and Conquer
minimum spanning trees Algorithm
Planetary Science Assignment Help
07 Tensor Visualization
L 4 4
Co clustering by-block_value_decomposition
Graph Analytics and Complexity Questions and answers
Mesh simplification notes
Matrix 2 d
Relations
Spanning trees & applications
Minimum spanning tree
Chapter 9 newer
Ad

Viewers also liked (20)

PPT
生命銀行
PPT
Tema IV
DOC
Horario extraescolares 2011 12
DOC
CV Mudasir
PDF
WebVoyage Record View
PDF
Untitled Presentation
PPT
Untitled presentation
PDF
Phỏng vấn học viên trường Anh ngữ UV ESL
PDF
Amer Butt -CV
PDF
SIRIWAN'S_CV_11_2015
PDF
Curso Integral de las Contrataciones Públicas
PPT
Una Navidad Chida
PPTX
Fast, easy usability tricks for big product improvements
PPTX
Planejamento de ação de marketing no facebook
PPTX
La neurona
PPTX
8 клас урок 11
PDF
Lettre inforcom1 sfsic 1978
PPT
Hadoop Introduction (1.0)
PPTX
Evil By Design: Leading Customers Into Temptation (SXSW Version)
生命銀行
Tema IV
Horario extraescolares 2011 12
CV Mudasir
WebVoyage Record View
Untitled Presentation
Untitled presentation
Phỏng vấn học viên trường Anh ngữ UV ESL
Amer Butt -CV
SIRIWAN'S_CV_11_2015
Curso Integral de las Contrataciones Públicas
Una Navidad Chida
Fast, easy usability tricks for big product improvements
Planejamento de ação de marketing no facebook
La neurona
8 клас урок 11
Lettre inforcom1 sfsic 1978
Hadoop Introduction (1.0)
Evil By Design: Leading Customers Into Temptation (SXSW Version)
Ad

Similar to matrixMultiplication (20)

PPTX
AC-Unit1.pptx CRYPTOGRAPHIC NNNNFOR ALL
PPTX
Numerical Analysis Assignment Help
PPTX
Advanced Modularity Optimization Assignment Help
PPTX
Digital Communication Assignment Help
PDF
PDF
PDF
machinelearning project
PPTX
Numerical Analysis Assignment Help
PPTX
Mathematical_background_on_Blockchain__GenerativeAI.pptx
PPTX
Mathematical_background_on_Blockchain__GenerativeAI.pptx
PDF
BNL_Research_Report
PDF
manuscript - Squaring the Circle and Hilbert Space
PDF
v39i11.pdf
PPTX
Numerical Analysis Assignment Help
PPT
factoring
PPTX
Presntation11
PDF
Hormann.2001.TPI.pdf
PPT
Image segmentation
PDF
Cs6402 design and analysis of algorithms may june 2016 answer key
AC-Unit1.pptx CRYPTOGRAPHIC NNNNFOR ALL
Numerical Analysis Assignment Help
Advanced Modularity Optimization Assignment Help
Digital Communication Assignment Help
machinelearning project
Numerical Analysis Assignment Help
Mathematical_background_on_Blockchain__GenerativeAI.pptx
Mathematical_background_on_Blockchain__GenerativeAI.pptx
BNL_Research_Report
manuscript - Squaring the Circle and Hilbert Space
v39i11.pdf
Numerical Analysis Assignment Help
factoring
Presntation11
Hormann.2001.TPI.pdf
Image segmentation
Cs6402 design and analysis of algorithms may june 2016 answer key

matrixMultiplication

  • 1. Approximate Matrix Multiplication and Space Partitioning Trees: An Exploration N.P. Slagle, Lance J. Fortnow October 23, 2012 Abstract Herein we explore a dual tree algorithm for matrix multiplication of A ∈ RM×D and B ∈ RD×N , very narrowly effective if the normalized rows of A and columns of B, treated as vectors in RD , fall into clusters of order proportionate to Ω(Dτ ) with radii less than arcsin( / √ 2) on the surface of the unit D-ball. The algorithm leverages a pruning rule necessary to guarantee precision proportionate to vector magnitude products in the resultant matrix. Unfortunately, if the rows and columns are uniformly distributed on the surface of the unit D-ball, then the expected points per required cluster approaches zero exponentially fast in D; thus, the approach requires a great deal of work to pass muster. 1 Introduction and Related Work Matrix multiplication, ubiquitous in computing, naively requires O(MDN) floating point opera- tions to multiply together matrices A ∈ RM×D and B ∈ RD×N . We present an investigation of our novel approach to matrix multiplication after a brief discussion of related work and an explanation of space-partitioning trees. 1.1 State-of-the-Art for Square Matrices For N = D = M, Strassen [12] gave an O(Nlog2 7) algorithm that partitions the matrices into blocks, generalizing the notion that to multiply binary integers a and b, one need only compute [(a + b)2 − (a − b)2]/4, an operation requiring three additions, two squares, and a left shift. Several improvements appear in the literature [1],[6],[8],[10], the most recent of which give O(N2.3736...) [11] and O(N2.3727...) [13], both augmentations of the Coppersmith-Winograd algorithm [2]. The latest algorithms feature constants sufficiently large to preclude application on modern hardware [9]. 1.2 Motivating the Space The product of A in RM×D and B in RD×N features all possible inner products between the row vectors of A and the column vectors of B, each an element of RD. We investigate whether organizing these two sets of vectors into space-partitioning trees can reduce the complexity of the na¨ıve matrix multiplication by exploiting the distribution of the data. 1
  • 2. 1.2.1 Space-Partitioning Trees We can organize a finite collection of points S in Euclidian space RD into a space-partitioning tree T such that the root node P0 contains all points in S, and for any other node P in T , all points in P are in π(P), the parent node of P. Figure 1 depicts a space-partitioning tree in R2. Figure 1: A space-partitioning tree in R2; the ellipses denote covariances on the node points; the large dots denote node centroids. A space-partitioning tree definition requires a recursive partitioning rule, such as that appearing in algorithm 1. Organizing S into such a tree generally requires O(D|S| log(D|S|)) time complexity. Algorithm 1 [L, R] =partition(P, m) 1: If |P| ≤ m, then RETURN [NULL, NULL]. 2: Pick the dimension k that maximizes the range of xk for x ∈ P. 3: Sort the points in P according to dimension k. 4: Split P into L and R using the median (or mean) of xk. 5. RETURN [L, R]. 1.2.2 Dual Tree Algorithm Given a reference tree R and a query tree Q of data points, we can perform pairwise operations such as kernel summations and inner products across across nodes rather than points, performing a depth-first search on both trees. The algorithm leverages a pruning criterion to guarantee level approximation in the outputs. Algorithm 2 exhibits this approach. 2
  • 3. Algorithm 2 dualTreeCompareNodes(R, Q, operation op, pruning rule R, , C) 1: If R and Q are leaf nodes, then perform the point-wise operation, filling in appropriate entries of C. RETURN 2: If rule R(R, Q) is true, approximate op between points in the nodes using their centroids, filling in appropriate entries of C; then RETURN. 3: Call • dualTreeCompareNodes(R.left, Q.left) • dualTreeCompareNodes(R.left, Q.right) • dualTreeCompareNodes(R.right, Q.left) • dualTreeCompareNodes(R.right, Q.right) 1.2.3 Space-Partitioning Trees in the Literature Applied statistical methods such as dual tree approximate kernel summations [3], [4], [5] and other pairwise statistical problems [7] partition the query and test samples into respective space- partitioning trees for efficient look-ups. Using cover trees, Ram [7] demonstrates linear time com- plexity for na¨ıve O(N2) pairwise algorithms. 2 Dual Tree Investigation 2.1 Product Matrix Entries Given the two matrices A ∈ RM×D and B ∈ RD×N , we can think of the entries of C = AB as cij = |ai||bj|cos θij, where ai is the ith row of A, bj is the jth column of B, and θij is the angle between ai and bj. We can compute the magnitudes of these vectors in time O(D(M + N)) and all products of the magnitudes in time O(MN), for a total time complexity of O(MN + D(M + N)). Thus, computing the cosines of the angles for M, N ∈ O(D) is the O(MDN) bottleneck. We give narrow conditions under which we can reduce this complexity. 2.2 Algorithm In our investigation, we normalize the row vectors of A and the column vectors of B, then organize each set into a ball tree, a space-partitioning tree such that each node is a D-ball. To compute the cosines of the angles between all pairs, we apply the dual tree algorithm. The pruning rule must guarantee that the relative error of our estimate cij with respect to the full magnitude |ai||bj| be no more than , or, more formally, |cij − cij| ≤ |ai||bj|. (1) Thus, we require | cos θij − cos θij| ≤ . (2) The pruning rule guaranteeing the above error bound appears in algorithm 3. 3
  • 4. Algorithm 3 dualTreeMatrixMultiplication(A, B, ) 1: Allocate M × N matrix C. 2: Compute the magnitudes of ai and bj for i = 1, . . . , M, j = 1, . . . , N. 3: Fill in C so that cij = |ai||bj|. 4: Compute ui = ai/|ai|, vj = bj/|bj|. 5: Allocate trees U and V with root(U) = {ui} and root(V) = {vj}. 6: Call partition(root(U), size), partition(root(V), size), with size the minimum number of points (defaulted to one) per tree node. 7: Let op(s, t) =< s, t >. 8: For node balls R ∈ U, Q ∈ V, define • α:=angle between the centers of R,Q, • β:=angle subtending half of the node ball R, and • γ:=angle subtending half of the node ball Q, all angles in [0, π]. 9: Define the pruning rule R as an evaluation of |β + γ| ≤ | sin α|+| cos α| . 10: Call dualTreeCompareNodes(root(U),root(V), op, R, , C). 11: RETURN C. We could define a more conservative pruning rule of |β + γ| ≤ / √ 2 ≤ | sin α|+| cos α| . For future analyses, we apply the more conservative bound. Figure 2 exhibits the relationships between angles α, β, and γ. Figure 2: Angles in algorithm 3. 2.2.1 Proof of the Pruning Rule Simply put, the pruning rule in algorithm 3 bounds the largest possible error on the cosine function in terms of the center-to-center angle (our approximation) and the angles subtending the balls R and Q, formally stated in theorem 1. 4
  • 5. Theorem 1. Given ball nodes R and Q and angles as defined in algorithm 3, if β+γ ≤ | sin α|+| cos α| , then |r · q − cos α| ≤ for all r ∈ R, q ∈ Q. To prove theorem 1, we need the following lemma. Lemma 2. Given both the ball nodes R and Q and angles listed in theorem 1, let error(r, q) = |r · q − cos α|. The maximum of error occurs when r and q are in the span of the two centers of R and Q. Furthermore, the maxima of error are | cos(α β γ) − cos α|. Proof. Let ¯r and ¯q be the centers of R and Q, respectively. Since cos θ is monotone for θ ∈ [0, π], the extrema of the error function occur when r and q fall on the surface of R and Q, respectively. Furthermore, we only care about the extrema of r · q since the maxima and minima of this function bound the error about cos α. Thus, we optimize r · q subject to ¯r · ¯q = cos α, ¯r · r = cos β, ¯q · q = cos γ, and r · r = q · q = ¯r · ¯r = ¯q · ¯q = 1. Leveraging Lagrange multipliers, we obtain the solutions r = ¯r[cos β cot α sin β] + ¯q ± sin β sin α and q = ¯r ± sin γ sin α + ¯q[cos γ cot α sin γ], with r · q = ± cos α sin β sin γ + cos α cos β cos γ ± sin α cos β sin γ± sin α sin β cos γ = cos(α β γ). Notice, the possible values of r and q maximizing the error are simply the edges of the cones subtending balls R and Q in the hyperplane spanned by ¯r and ¯q. Now, we prove theorem 1. Proof. By hypothesis, |β + γ| [| sin α| + | cos α|] ≤ . Since |β + γ| ≥ | β γ|, | sin h| ≤ |h|, |1 − cos h| ≤ |h| for β, γ ∈ [0, π] and h ∈ [−π, π], we have ≥ | β γ| | sin α| sin( β γ) β γ + | cos α| 1 − cos( β γ) β γ ≥ | cos α cos( β γ) − sin α sin( β γ) − cos α|, and so ≥ | cos α cos( β γ) − sin α sin( β γ) − cos α| = | cos(α β γ) − cos α|. 2.2.2 Analysis of Algorithm 3 Given matrices A in RM×D and B in RD×N , computing the magnitudes, normalizing the rows of A and columns of B, and computing magnitude products for C requires O(D(M +N)+MN). Orga- nizing the normalized points into space-partitioning trees requires O(MD log MD + ND log ND). Finally, an analysis of the dual tree algorithm requires conditions on the data points. We suppose that given the approximation constant , the number of points falling in node balls of appropri- ate size, say radius roughly arcsin( / √ 2), is bounded below by fD( ). If the points are clustered into such balls, each prune saves the computation of at least D[fD( )]2. So we can fill into C [fD( )]2 entries with a constant number of inner products at cost O(D), for a total complexity of O(MDN/[fD( )]2). Thus, we have the following theorem. Theorem 3. The total time complexity of algorithm 3 is O(MD log MD + ND log ND + MN(1 + D/[fD( )]2)). 5
  • 6. 2.3 Gaping Caveat An obvious caveat in the analysis is the behavior of fD( ) as D increases without bound. For a rough sketch of the expected behavior of fD, recall that the volume and surface area of a D-ball of radius r are VD(r) = 2Dπ D−1 2 Γ D+1 2 DΓ(D) rD (3) and SAD(r) = VD (r) = 2Dπ D−1 2 Γ D+1 2 Γ(D) rD−1 . (4) Since uniformly distributed data represents something of a worst-case scenario with respect to clustering algorithms, we explore the expected cluster sizes by dividing the surface of the unit D-ball by the node balls of appropriate size. Theorem 4. Assuming that the normalized rows of A and columns of B are uniformly distributed about the unit D-ball, let W be the number of points in each ball of radius arcsin( / √ 2). Then E[W] ≈ MN VD−1(arcsin( / √ 2)) SAD(1) = MNΓ D 2 2 √ πΓ D+1 2 arcsinD−1 ( / √ 2) Thus, since Γ (x + 1/2) /Γ (x) ∈ O(x), exponentially few points fall into each ball of radius arcsin( / √ 2). Thus, we require strong clustering conditions, stated formally below, if the dual tree approach described in algorithm 3 is to defeat na¨ıve matrix multiplication. Theorem 5. If M = D = N and the normalized rows of A and columns of B form clusters of size Ω(Dτ ) for τ > 0 where cluster radii are approximately arcsin( / √ 2) for √ 2 > > 0, then algorithm 3 runs in time O(D2 log D + D3−2τ ). 3 Concluding Remarks and Future Work Given the problem of multiplying together matrices A ∈ RM×D and B ∈ RD×N , we present a dual tree algorithm effective if row vectors of the left matrix and column vectors of the right matrix fall into clusters of size proportionate to some positive power τ of the dimension D of said vectors. Unfortunately, worst-case uniformly distributed vectors give exponentially small cluster sizes. Possible improvements include partitioning columns of A and rows of B so that the size of clusters increases slightly while incurring a greater cost in tree construction and the number of magnitudes to calculate, or appealing to the asymptotic orthogonality of vectors as D becomes arbitrarily large. Clearly, the approach needs a great deal of work to be of practical interest. 4 References 1. D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n2.7799 ) complexity for n × n approximate matrix multi- plication. Inf. Process. Lett., 8(5):234235, 1979. 2. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. J. Symbolic Computa- tion, 9(3):251280, 1990. 3. A.G. Gray and A.W. Moore. N-Body Problems in Statistical Learning. T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Information Processing Systems 13 (December 2000). MIT Press, 2001. 6
  • 7. 4. A.G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. 5. M.P. Holmes, A.G. Gray, and C.L. Isbell Jr. Fast Kernel Conditional Density Estimation: A Dual Tree Monte Carlo Approach. Computational Statistics and Data Analysis, 1707-1718, 2010. 6. V. Y. Pan. Strassens algorithm is not optimal. In Proc. FOCS, volume 19, pages 166176, 1978. 7. P. Ram, D. Lee, W. March, A.G. Gray. Linear-time Algorithms for Pairwise Statistical Problems, NIPS, 2010. 8. F. Romani. Some properties of disjoint sums of tensors related to matrix multiplication, SIAM J. Comput., pages 263267, 1982. 9. S. Robinson. Toward an Optimal Algorithm for Matrix Multiplication, SIAM News 38 (9), 2005. 10. A. Sch¨onhage. Partial and total matrix multiplication. SIAM J. Comput., 10(3):434455, 1981. 11. A. Stothers. Ph.D. Thesis, U. Edinburgh, 2010. 12. V. Strassen. Gaussian Elimination is not Optimal, Numer. Math. 13, p. 354-356, 1969. 13. V.V. Williams. Multiplying matrices faster than Coppersmith-Winograd, STOC, 2012. 7