R user group 2011 09

8/9/2013 © MapR Confidential 1
R
Hadoop
and MapR

The bad old days (i.e. now)
• Hadoop is a silo
• HDFS isn’t a normal file system
• Hadoop doesn’t really like C++
• R is limited
• One machine, one memory space
• Isn’t there any way we can just get along?

The white knight
• MapR changes things
• Lots of new stuff like snapshots, NFS
• All you need to know, you already know
• NFS provides cluster wide file access
• Everything works the way you expect
• Performance high enough to use as a message bus

Example, out-of-core SVD
• SVD provides compressed matrix form
• Based on sum of rank-1 matrices
A =s1u1 ¢v1 +s2u2 ¢v2 +e
± ±≈ + + ?

More on SVD
• SVD provides a very nice basis
Ax = A aiviå = s juj ¢vj
j
å
é
ë
ê
ê
ù
û
ú
ú
aivi
i
å
é
ë
ê
ù
û
ú= aisiui
i
å

• And a nifty approximation property
Ax =s1a1u1 +s2a2u2 + siaiui
i>2
å
e 2
£ si
2
i>2
å

Also known as …
• Latent Semantic Indexing
• PCA
• Eigenvectors

An application, approximate translation
• Translation distributes over concatenation
• But counting turns concatenation into
addition
• This means that translation is linear!
T(s1 | s2 )=T(s1)| T(s2 )
k(s1 | s2 )= k(s1) + k(s2 )
k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))

ish

Traditional computation
• Products of A are dominated by large singular
values and corresponding vectors
• Subtracting these dominate singular values
allows the next ones to appear
• Lanczos method, generally Krylov sub-space
A ¢A A( )
n
=US2n+1
¢V

But …

The gotcha
• Iteration in Hadoop is death
• Huge process invocation costs
• Lose all memory residency of data
• Total lost cause

Randomness to the rescue
• To save the day, run all iterations at the same
time
Y = AW
QR = Y
B = ¢Q A
US ¢V = B
QU( )S ¢V » A
==
A

In R
lsa = function(a, k, p) {
n = dim(a)[1]
m = dim(a)[2]
y = a %*% matrix(rnorm(m*(k+p)), nrow=m)
y.qr = qr(y)
b = t(qr.Q(y.qr)) %*% a
b.qr = qr(t(b))
svd = svd(t(qr.R(b.qr)))
list(u=qr.Q(y.qr) %*% svd$u[,1:k],
d=svd$d[1:k],
v=qr.Q(b.qr) %*% svd$v[,1:k])
}

Not good enough yet
• Limited to memory size
• After memory limits, feature extraction
dominates

Hybrid architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SVD
Map-reduce
Via NFS

Hybrid architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Map-reduce
Via NFS
R
Visualization
Sequential
SVD

Randomness to the rescue
• To save the day again, use blocks
Yi = AiW
¢R R = ¢Y Y = ¢Yi Yiå
Bj = AiWR-1
( )Aij
i
å
LL' = B ¢B
US ¢V = L
AWR-1
U( )S L-1
B ¢V( )» A
==
=

Hybrid architecture
Map-reduce
Feature extraction
and
down sampling Via NFS
R
Visualization
Map-reduce
Block-wise
parallel
SVD

Conclusions
• Inter-operability allows massively scalability
• Prototyping in R not wasted
• Map-reduce iteration not needed for SVD
• Feasible scale ~10^9 non-zeros or more

R user group 2011 09

More Related Content

What's hot (19)

Viewers also liked (7)

Similar to R user group 2011 09 (20)

More from MapR Technologies (20)

Recently uploaded (20)

R user group 2011 09