Parallel R in snow (english after 2nd slide)

Gagnez du temp en parall´lisant sous R
e

Maxime Tˆ
o

June 12, 2012

Parall´liser sous R
e

On utilise ici le package SNOW:
http://www.sfu.ca/~sblay/R/snow.html

This presentation is based on my own practice of R. I do not know
if it is optimal, but it made me gain a lot of time...

e
How does parallel computing work?
Using the snow package “we open as many R session as the
number of nodes we choose”:
library(snow)
cl <- makeCluster(3, type = "SOCK")

e

The clusterEvalQ() function allows to execute R code on all
sessions:

clusterEvalQ(cl, ls())
> clusterEvalQ(cl, 1 + 1)
[[1]]
[1] 2
[[2]]
[1] 2
[[3]]
[1] 2

e

Nodes may be called independently:

> clusterEvalQ(cl[1], a <- 1)
> clusterEvalQ(cl, a)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

e

The snow package comes with many parallelized versions of
usual R functions as parLapply, parApply, etc. which are not
always efficients:

> a <- matrix(rnorm(10000000), ncol = 1000)
> system.time(apply(a, 1, sum))
utilisateur syst`me
e e
ćoul´
e
0.27 0.02 0.28
> system.time(parApply(cl, a, 1, sum))
utilisateur syst`me
e e
ćoul´
e
0.67 0.39 1.09

e

Using parallel code is not always eﬃcient:
It always takes some time to serialize and unserialize data
If the data is huge R may need some time to copy it...

e

One solution is to ﬁrst export data to all nodes and then
execute the code on each node:

> #### First Export:
> columns <- clusterSplit(cl, 1:10000)
> for (cc in 1:3){
+ aa <- a[columns[[cc]],]
+ clusterExport(cl[cc], "aa")
+ }
> #### Then execute
>
> system.time(do.call("c",
clusterEvalQ(cl, apply(aa, 1, sum))))
utilisateur syst`me
e e
´coul´
e
0.00 0.00 0.16

e

Of course, it is not necessary optimal to always export the data
first... but in many cases it may be usefull:
If one has many computation to do on one dataset
For any iterative method:
Bootstrap
Iterative estimation: ML, GMM, etc.
The idea is to first export data and then execute the code on
the different nodes
Exporting data is the costly step. Making a synthesis of the
results is often quite easy (sum, c, cbind, etc.)

We simple problem

We want to estimate a probit model
ML estimation is iterative. You need to estimate partial
derivatives for the gradient and the hessian matrix
thus you need to evaluate the objective function many many
times to obtain numerical derivatives
Reducing the time of one iteration reduces the whole time of
iteration a lot...

The probit model

The model is given by:

Y ∗ = X β + varepsilon
Y = 1{Y ∗ >0}

The individual contribution to the likelihood is then :

L = Φ(X β)Y Φ(−X β)(1−Y )

A very simple problem

> n <- 5000000
> param <- c(1,2,-.5)
> X1 <- rnorm(n)
> X2 <- rnorm(n, mean = 1, sd = 2)
> Ys <- param[1] + param[2] * X1 +
+ param[3] * X2 + rnorm(n)
> Y <- Ys > 0
> probit <- function(para, y, x1, x2){
+ mu <- para[1] + para[2] * x1 + para[3] * x2
+ sum(pnorm(mu, log = T)*y + pnorm(-mu, log = T)*(1 - y))
+ }
>
> system.time(test1 <- probit(param, Y, X1, X2))
utilisateur syst`me
e e
´coul´
e
1.72 0.08 1.80

Make a parallel version

We build a parallel version of our program doing the following
steps:
1. Make clusters
2. Divide the data over the nodes
3. Write the likelihood
4. Execute the likelihood on each node
5. Collect the results

Divide data:
> nn <- clusterSplit(cl, 1:n)
> for (cc in 1:3){
+ YY <- Y[nn[[cc]]]
+ XX1 <- X1[nn[[cc]]]
+ XX2 <- X2[nn[[cc]]]
+ clusterExport(cl[cc], c("YY", "XX1", "XX2"))
+ }
> clusterExport(cl, "probit")
> clusterEvalQ(cl, ls())
[[1]]
[1] "probit" "XX1" "XX2" "YY"

[[2]]

[[3]]

Write a new version of the likelihood:
> gets<-function(n, v) {
+ assign(n,v, envir=.GlobalEnv);NULL
+ }
> lik <- function(para){
+ clusterCall(cl, gets ,"para", get("para"))
+ do.call("sum",
+ clusterEvalQ(cl, probit(para, YY, XX1, XX2)))
+ }

Execute and compare theg results:
> system.time(test2 <- lik(param)) ## 1.5 sec
utilisateur syst`me
e e
´coul´
e
0.00 0.00 0.78
> c(test1, test2) ## Same results
[1] -1432674 -1432674

Conclusion

By using parallel versions of R, one may gain a lot of time...
A wrong use of R packages may also be costly...
Of course, for probit problem, use glm package...
Don’t forget to close the nodes:
> stopCluster(cl)

Parallel R in snow (english after 2nd slide)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Parallel R in snow (english after 2nd slide) (20)

More from Cdiscount (17)

Recently uploaded (20)

Parallel R in snow (english after 2nd slide)