matrixMultiplication

Approximate Matrix Multiplication and Space Partitioning Trees:
An Exploration
N.P. Slagle, Lance J. Fortnow
October 23, 2012
Abstract
Herein we explore a dual tree algorithm for matrix multiplication of A ∈ RM×D
and B ∈
RD×N
, very narrowly effective if the normalized rows of A and columns of B, treated as vectors
in RD
, fall into clusters of order proportionate to Ω(Dτ
) with radii less than arcsin( /
√
2) on
the surface of the unit D-ball. The algorithm leverages a pruning rule necessary to guarantee
precision proportionate to vector magnitude products in the resultant matrix. Unfortunately,
if the rows and columns are uniformly distributed on the surface of the unit D-ball, then the
expected points per required cluster approaches zero exponentially fast in D; thus, the approach
requires a great deal of work to pass muster.
1 Introduction and Related Work
Matrix multiplication, ubiquitous in computing, naively requires O(MDN) floating point opera-
tions to multiply together matrices A ∈ RM×D and B ∈ RD×N . We present an investigation of our
novel approach to matrix multiplication after a brief discussion of related work and an explanation
of space-partitioning trees.
1.1 State-of-the-Art for Square Matrices
For N = D = M, Strassen [12] gave an O(Nlog2 7) algorithm that partitions the matrices into
blocks, generalizing the notion that to multiply binary integers a and b, one need only compute
[(a + b)2 − (a − b)2]/4, an operation requiring three additions, two squares, and a left shift. Several
improvements appear in the literature [1],[6],[8],[10], the most recent of which give O(N2.3736...) [11]
and O(N2.3727...) [13], both augmentations of the Coppersmith-Winograd algorithm [2]. The latest
algorithms feature constants sufficiently large to preclude application on modern hardware [9].
1.2 Motivating the Space
The product of A in RM×D and B in RD×N features all possible inner products between the row
vectors of A and the column vectors of B, each an element of RD. We investigate whether organizing
these two sets of vectors into space-partitioning trees can reduce the complexity of the na¨ıve matrix
multiplication by exploiting the distribution of the data.
1

1.2.1 Space-Partitioning Trees
We can organize a finite collection of points S in Euclidian space RD into a space-partitioning tree
T such that the root node P0 contains all points in S, and for any other node P in T , all points in
P are in π(P), the parent node of P. Figure 1 depicts a space-partitioning tree in R2.
Figure 1: A space-partitioning tree in R2; the ellipses denote covariances on the node points; the
large dots denote node centroids.
A space-partitioning tree definition requires a recursive partitioning rule, such as that appearing
in algorithm 1. Organizing S into such a tree generally requires O(D|S| log(D|S|)) time complexity.
Algorithm 1 [L, R] =partition(P, m)
1: If |P| ≤ m, then RETURN [NULL, NULL].
2: Pick the dimension k that maximizes the range of xk for x ∈ P.
3: Sort the points in P according to dimension k.
4: Split P into L and R using the median (or mean) of xk.
5. RETURN [L, R].
1.2.2 Dual Tree Algorithm
Given a reference tree R and a query tree Q of data points, we can perform pairwise operations
such as kernel summations and inner products across across nodes rather than points, performing
a depth-first search on both trees. The algorithm leverages a pruning criterion to guarantee level
approximation in the outputs. Algorithm 2 exhibits this approach.
2

Algorithm 2 dualTreeCompareNodes(R, Q, operation op, pruning rule R, , C)
1: If R and Q are leaf nodes, then perform the point-wise operation, filling in appropriate entries
of C. RETURN
2: If rule R(R, Q) is true, approximate op between points in the nodes using their centroids, filling
in appropriate entries of C; then RETURN.
3: Call
• dualTreeCompareNodes(R.left, Q.left)
• dualTreeCompareNodes(R.left, Q.right)
• dualTreeCompareNodes(R.right, Q.left)
• dualTreeCompareNodes(R.right, Q.right)
1.2.3 Space-Partitioning Trees in the Literature
Applied statistical methods such as dual tree approximate kernel summations [3], [4], [5] and
other pairwise statistical problems [7] partition the query and test samples into respective space-
partitioning trees for efficient look-ups. Using cover trees, Ram [7] demonstrates linear time com-
plexity for na¨ıve O(N2) pairwise algorithms.
2 Dual Tree Investigation
2.1 Product Matrix Entries
Given the two matrices A ∈ RM×D and B ∈ RD×N , we can think of the entries of C = AB as
cij = |ai||bj|cos θij, where ai is the ith row of A, bj is the jth column of B, and θij is the angle
between ai and bj. We can compute the magnitudes of these vectors in time O(D(M + N)) and all
products of the magnitudes in time O(MN), for a total time complexity of O(MN + D(M + N)).
Thus, computing the cosines of the angles for M, N ∈ O(D) is the O(MDN) bottleneck. We give
narrow conditions under which we can reduce this complexity.
2.2 Algorithm
In our investigation, we normalize the row vectors of A and the column vectors of B, then organize
each set into a ball tree, a space-partitioning tree such that each node is a D-ball. To compute the
cosines of the angles between all pairs, we apply the dual tree algorithm. The pruning rule must
guarantee that the relative error of our estimate cij with respect to the full magnitude |ai||bj| be
no more than , or, more formally,
|cij − cij| ≤ |ai||bj|. (1)
Thus, we require
| cos θij − cos θij| ≤ . (2)
The pruning rule guaranteeing the above error bound appears in algorithm 3.
3

Algorithm 3 dualTreeMatrixMultiplication(A, B, )
1: Allocate M × N matrix C.
2: Compute the magnitudes of ai and bj for i = 1, . . . , M, j = 1, . . . , N.
3: Fill in C so that cij = |ai||bj|.
4: Compute ui = ai/|ai|, vj = bj/|bj|.
5: Allocate trees U and V with root(U) = {ui} and root(V) = {vj}.
6: Call partition(root(U), size), partition(root(V), size), with size the minimum number of points
(defaulted to one) per tree node.
7: Let op(s, t) =< s, t >.
8: For node balls R ∈ U, Q ∈ V, define
• α:=angle between the centers of R,Q,
• β:=angle subtending half of the node ball R, and
• γ:=angle subtending half of the node ball Q,
all angles in [0, π].
9: Define the pruning rule R as an evaluation of |β + γ| ≤ | sin α|+| cos α| .
10: Call dualTreeCompareNodes(root(U),root(V), op, R, , C).
11: RETURN C.
We could define a more conservative pruning rule of |β + γ| ≤ /
√
2 ≤ | sin α|+| cos α| . For future
analyses, we apply the more conservative bound.
Figure 2 exhibits the relationships between angles α, β, and γ.
Figure 2: Angles in algorithm 3.
2.2.1 Proof of the Pruning Rule
Simply put, the pruning rule in algorithm 3 bounds the largest possible error on the cosine function
in terms of the center-to-center angle (our approximation) and the angles subtending the balls R
and Q, formally stated in theorem 1.
4

Theorem 1. Given ball nodes R and Q and angles as deﬁned in algorithm 3, if β+γ ≤ | sin α|+| cos α| ,
then |r · q − cos α| ≤ for all r ∈ R, q ∈ Q.
To prove theorem 1, we need the following lemma.
Lemma 2. Given both the ball nodes R and Q and angles listed in theorem 1, let error(r, q) =
|r · q − cos α|. The maximum of error occurs when r and q are in the span of the two centers of R
and Q. Furthermore, the maxima of error are | cos(α β γ) − cos α|.
Proof. Let ¯r and ¯q be the centers of R and Q, respectively. Since cos θ is monotone for θ ∈ [0, π], the extrema of the
error function occur when r and q fall on the surface of R and Q, respectively. Furthermore, we only care about the
extrema of r · q since the maxima and minima of this function bound the error about cos α. Thus, we optimize r · q
subject to ¯r · ¯q = cos α, ¯r · r = cos β, ¯q · q = cos γ, and r · r = q · q = ¯r · ¯r = ¯q · ¯q = 1.
Leveraging Lagrange multipliers, we obtain the solutions
r = ¯r[cos β cot α sin β] + ¯q ±
sin β
sin α
and
q = ¯r ±
sin γ
sin α
+ ¯q[cos γ cot α sin γ],
with
r · q = ± cos α sin β sin γ + cos α cos β cos γ ± sin α cos β sin γ± sin α sin β cos γ = cos(α β γ).
Notice, the possible values of r and q maximizing the error are simply the edges of the cones
subtending balls R and Q in the hyperplane spanned by ¯r and ¯q. Now, we prove theorem 1.
Proof. By hypothesis, |β + γ| [| sin α| + | cos α|] ≤ . Since |β + γ| ≥ | β γ|, | sin h| ≤ |h|, |1 − cos h| ≤ |h| for
β, γ ∈ [0, π] and h ∈ [−π, π], we have
≥ | β γ| | sin α|
sin( β γ)
β γ
+ | cos α|
1 − cos( β γ)
β γ
≥ | cos α cos( β γ) − sin α sin( β γ) − cos α|,
and so
≥ | cos α cos( β γ) − sin α sin( β γ) − cos α| = | cos(α β γ) − cos α|.
2.2.2 Analysis of Algorithm 3
Given matrices A in RM×D and B in RD×N , computing the magnitudes, normalizing the rows of A
and columns of B, and computing magnitude products for C requires O(D(M +N)+MN). Orga-
nizing the normalized points into space-partitioning trees requires O(MD log MD + ND log ND).
Finally, an analysis of the dual tree algorithm requires conditions on the data points. We suppose
that given the approximation constant , the number of points falling in node balls of appropri-
ate size, say radius roughly arcsin( /
√
2), is bounded below by fD( ). If the points are clustered
into such balls, each prune saves the computation of at least D[fD( )]2. So we can ﬁll into C
[fD( )]2 entries with a constant number of inner products at cost O(D), for a total complexity of
O(MDN/[fD( )]2). Thus, we have the following theorem.
Theorem 3. The total time complexity of algorithm 3 is O(MD log MD + ND log ND + MN(1 +
D/[fD( )]2)).
5

2.3 Gaping Caveat
An obvious caveat in the analysis is the behavior of fD( ) as D increases without bound. For a
rough sketch of the expected behavior of fD, recall that the volume and surface area of a D-ball of
radius r are
VD(r) =
2Dπ
D−1
2 Γ D+1
2
DΓ(D)
rD
(3)
and
SAD(r) = VD (r) =
2Dπ
D−1
2 Γ D+1
2
Γ(D)
rD−1
. (4)
Since uniformly distributed data represents something of a worst-case scenario with respect to
clustering algorithms, we explore the expected cluster sizes by dividing the surface of the unit
D-ball by the node balls of appropriate size.
Theorem 4. Assuming that the normalized rows of A and columns of B are uniformly distributed
about the unit D-ball, let W be the number of points in each ball of radius arcsin( /
√
2). Then
E[W] ≈ MN
VD−1(arcsin( /
√
2))
SAD(1)
=
MNΓ D
2
2
√
πΓ D+1
2
arcsinD−1
( /
√
2)
Thus, since Γ (x + 1/2) /Γ (x) ∈ O(x), exponentially few points fall into each ball of radius
arcsin( /
√
2). Thus, we require strong clustering conditions, stated formally below, if the dual tree
approach described in algorithm 3 is to defeat na¨ıve matrix multiplication.
Theorem 5. If M = D = N and the normalized rows of A and columns of B form clusters of size
Ω(Dτ ) for τ > 0 where cluster radii are approximately arcsin( /
√
2) for
√
2 > > 0, then algorithm
3 runs in time O(D2 log D + D3−2τ ).
3 Concluding Remarks and Future Work
Given the problem of multiplying together matrices A ∈ RM×D and B ∈ RD×N , we present a
dual tree algorithm eﬀective if row vectors of the left matrix and column vectors of the right
matrix fall into clusters of size proportionate to some positive power τ of the dimension D of said
vectors. Unfortunately, worst-case uniformly distributed vectors give exponentially small cluster
sizes. Possible improvements include partitioning columns of A and rows of B so that the size of
clusters increases slightly while incurring a greater cost in tree construction and the number of
magnitudes to calculate, or appealing to the asymptotic orthogonality of vectors as D becomes
arbitrarily large. Clearly, the approach needs a great deal of work to be of practical interest.
4 References
1. D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n2.7799
) complexity for n × n approximate matrix multi-
plication. Inf. Process. Lett., 8(5):234235, 1979.
2. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. J. Symbolic Computa-
tion, 9(3):251280, 1990.
3. A.G. Gray and A.W. Moore. N-Body Problems in Statistical Learning. T.K. Leen, T.G. Dietterich, and V.
Tresp, editors, Advances in Information Processing Systems 13 (December 2000). MIT Press, 2001.
6

4. A.G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artiﬁcial Intelligence and
Statistics 2003, 2003.
5. M.P. Holmes, A.G. Gray, and C.L. Isbell Jr. Fast Kernel Conditional Density Estimation: A Dual Tree Monte
Carlo Approach. Computational Statistics and Data Analysis, 1707-1718, 2010.
6. V. Y. Pan. Strassens algorithm is not optimal. In Proc. FOCS, volume 19, pages 166176, 1978.
7. P. Ram, D. Lee, W. March, A.G. Gray. Linear-time Algorithms for Pairwise Statistical Problems, NIPS, 2010.
8. F. Romani. Some properties of disjoint sums of tensors related to matrix multiplication, SIAM J. Comput.,
pages 263267, 1982.
9. S. Robinson. Toward an Optimal Algorithm for Matrix Multiplication, SIAM News 38 (9), 2005.
10. A. Sch¨onhage. Partial and total matrix multiplication. SIAM J. Comput., 10(3):434455, 1981.
11. A. Stothers. Ph.D. Thesis, U. Edinburgh, 2010.
12. V. Strassen. Gaussian Elimination is not Optimal, Numer. Math. 13, p. 354-356, 1969.
13. V.V. Williams. Multiplying matrices faster than Coppersmith-Winograd, STOC, 2012.
7

matrixMultiplication

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to matrixMultiplication (20)

matrixMultiplication