Speeding up Distributed Big Data Recommendation in Spark

AUSTRALIA CHINA INDIA ITALY MALAYSIA SOUTH AFRICA monash.edu
Algorithmic Acceleration of Parallel ALS
for Collaborative Filtering:
“Speeding up Distributed Big Data
Recommendation in Spark”
Hans De Sterck1,2, Manda Winlaw2, Mike Hynes2,
Anthony Caterini2
1 Monash University, School of Mathematical Sciences
2 University of Waterloo, Canada, Applied Mathematics
ICPADS 2015, Melbourne, December 2015

hans.desterck@monash.edu
ICPADS
2015

a talk on algorithms for parallel big data
analytics ...
1.  distributed computing frameworks for Big Data
analytics – Spark (vs HPC, MPI, Hadoop, ...)
2.  recommendation – the Netflix prize problem
3.  our contribution: an algorithm to speed up ALS
for recommendation
4.  our contribution: efficient parallel speedup of ALS
recommendation in Spark

ICPADS
2015

1. distributed computing frameworks for Big Data
■ my research background:
– scalable
scien>ﬁc
compu>ng
algorithms

(HPC)

– e.g.,
parallel
algebraic
mul>grid
(AMG)
for

solving
linear
systems
Ax=b

– e.g.,
on
Blue
Gene
(100,000s
of
cores),
MPI

ICPADS
2015

distributed computing frameworks for Big Data
■ more recently: there is a new game of large-scale
distributed computing in town!
– Google
PageRank
(1998)
(already
17
years...)

•  commodity
hardware
(fault-‐tolerant
...)

•  compute
where
the
data
is
(data-‐locality)

•  scalability
is
essen>al!
(just
like
in
HPC)

•  beginning
of
“Big
Data”,
“Cloud”,
“Data

Analy>cs”,
...

– new
Big
Data
analy>cs
applica>ons
are
now

appearing
everywhere!

web

crawl

ICPADS
2015

■ “Data Analytics” has grown its own “eco-system”,
“culture”, “software stack” (very different from HPC!)
•  MapReduce

•  Hadoop

•  Spark,
...

•  data
locality

•  “implicit”
communica>on
(restricted
(vs
MPI),
“shuﬄe”)

•  not
fast
(vs
HPC),
but
scalable

•  fault-‐tolerant
(replicate
data,
restart
tasks)

(from
“Spark:
In-‐Memory
Cluster
Compu>ng
for
Itera>ve
and
Interac>ve
Applica>ons”)

ICPADS
2015

■ MapReduce/Hadoop:
– major
disadvantage
for
itera>ve
algorithms:
writes

everything
to
disk
between
itera>ons!,
extremely

slow
(and:
not
programmer-‐friendly)

è only
very
simple
algorithms
are
feasible
in

MapReduce

■ the Spark “revolution”:
– store
state
between
itera>ons
in
memory

– more
general
opera>ons
than
Hadoop/MapReduce

ICPADS
2015


ICPADS
2015

■ the Spark “revolution”:
– store
state
between
itera>ons
in
memory

– more
general
opera>ons
than
Hadoop/MapReduce

èmuch
faster
than
Hadoop!
(but
s>ll
much
slower
than
MPI)

•  data
locality

•  scalable

•  fault-‐tolerant

•  “implicit”
communica>on
(restricted
(vs
MPI),
“shuﬄe”)

sea change (vs Hadoop): more advanced iterative algorithms for
Data Analytics/Machine Learning are feasible in Spark

ICPADS
2015

k

2. recommendation – the Netflix prize problem
■ sparse ratings matrix R
■ k latent features: user factors U, movie factors M
■ similar to SVD, but only match known ratings
■ minimize f=||R – UTM||2’ , and UTM gives predicted
ratings (collaborative filtering)
R

n
users

m
movies

1

2

5
i

j

≈
UT

i

j

x

x

x

x

x

x
M

k

ICPADS
2015

k

recommendation – the Netflix prize problem
minimize f=||R – UTM||2’ : alternating least squares (ALS)
■ minimize ||R – U(0)T M(0)||2’ : freeze U(0), compute M(0) (LS)
■ minimize ||R – U(1)TM(0)||2’ : freeze M(0), compute U(1) (LS)
■ ... : local least squares problems (parallelizable)
R

n
users

m
movies

1

2

5
i

j

≈
UT

i

j

x

x

x

x

x

x
M

k

ICPADS
2015

recommendation – the Netflix prize problem
minimize f=||R – UTM||2’ : alternating least squares (ALS)
■ ALS can converge very slowly (block nonlinear Gauss-Seidel)
(g
=
grad
f
=
0)

ICPADS
2015

3. our contribution: an algorithm to speed up ALS
for recommendation
min f(U,M)=||R – UTM||2’ , or g(U,M) = grad f(U,M) = 0
■ nonlinear conjugate gradient (NCG) optimization
algorithm for min f(x):

ICPADS
2015

our contribution: an algorithm to speed up ALS
for recommendation
min f(x)=||R – UTM||2’ , or g(x) = grad f(x) = 0
■ our idea: use ALS as a nonlinear preconditioner for NCG
define a preconditioned gradient direction:
(De
Sterck
and
Winlaw,
2015)

ICPADS
2015

for recommendation
(NCG
accelerates
ALS)

ICPADS
2015

for recommendation

ICPADS
2015

for recommendation
ALS-‐NCG
is
much

faster
than
the
widely

used
ALS!

ICPADS
2015

4. our contribution: efficient parallel speedup of
ALS recommendation in Spark
■ Spark “Resilient Distributed Datasets” (RDDs)
– par>>oned
collec>on
of
(key,
value)

pairs

– can
be
cached
in
memory

– built
using
data
ﬂow
operators
on

other
RDDs
(map,
join,
group-‐by-‐key,

reduce-‐by-‐key,
...)

– fault-‐tolerance:
rebuild
from
lineage

– “implicit”
communica>on
(shuﬄing)

(≠
MPI)

key
(value1,
value2,
...)

0

1

2

3

ICPADS
2015

our contribution: efficient parallel speedup of ALS
■ efficient Spark programming: similar challenges as efficient
GPU programming with CUDA!
– of
course,
they
have
different
design
objec>ves
(GPU:

close
to
metal,
as
fast
as
possible;
Spark:
scalable,
fault-‐tolerant,
data
locality...)

– but
...
similari>es
in
how
one
gets
good
performance:

•  Spark,
CUDA:
it
is
easy
to
write
code
that
produces
the

correct
result
(but
may
be
very
far
from
achievable
speed)

• 
Spark,
CUDA:
it
is
very
hard
to
write
efficient
code!

–  implementa>on
choices
that
are
crucial
for
performance
are
most

ofen
not
explicit
in
the
language

–  programmer
needs
very
extensive
“under
the
hood”
knowledge
to

write
efficient
code

–  this
is
a
research
topic
(also
for
Spark),
moving
target

ICPADS
2015

■ existing implementation of ALS in Spark (Chris Johnson,
Spotify) minimize f=||R – UTM||2’
– store
both
R
and
RT

– local
LS
problems:
to
update
user
factor
i,
need
all

movie
factors
j
that
i
has
rated
(shuﬄe!)
(eﬃcient)

R

0

1

2

3

0
1
2
3

RT

0
1
2
3

M

U

1
0
2
3

i
j1 j2

ICPADS
2015

■ our work: efficient parallel implementation of ALS-NCG in Spark
minimize f(x)=||R – UTM||2’
– store
our
vectors
x
and
g

consistent
with
ALS
RDDs,

and
employ
similar
efficient

shuffling
scheme
for
gradient

– BLAS
vector
opera>ons

– line
search:
f(x)
is
a
polynomial

of
degree
4:
compute

coefficients
once
in
parallel

U

ICPADS
2015

■ performance: linear granularity scaling for ALS-NCG as for ALS
(no
new
parallel

boFlenecks
for
the

more
advanced

algorithm)

ICPADS
2015

■ performance: ALS-NCG much faster than ALS (20M MovieLens
data, 8 nodes/128 cores)

ICPADS
2015

■ performance: ALS-NCG speeds up ALS on 16 nodes/256 cores
in Spark for 800M ratings by a factor of about 5
(great
speedup,
in
parallel,

in
Spark,
for
large
problem

on
256
cores)

ICPADS
2015

some general conclusions ...
■ Spark enables advanced algorithms for Big Data analytics
(linear algebra, optimization, machine learning, ...) (lots of
work: investigate algorithms, implementations, scalability, ...
in Spark)
■ Spark offers a suitable environment for compute-intensive
work!
■ slower than MPI/HPC, but data locality, fault-tolerance,
situated within Big Data “eco-system” (HDFS data, familiar
software stack, ...)
■ will HPC and Big Data hardware/software converge? (also
for “exascale” ...), and if so, which aspects of the Spark
(and others ...) or MPI/HPC approaches will prevail?

Speeding up Distributed Big Data Recommendation in Spark

More Related Content

What's hot (20)

Similar to Speeding up Distributed Big Data Recommendation in Spark (20)

Recently uploaded (20)

Speeding up Distributed Big Data Recommendation in Spark