SlideShare a Scribd company logo
Gagnez du temp en parall´lisant sous R
                        e

              Maxime Tˆ
                      o


             June 12, 2012
Parall´liser sous R
      e




       On utilise ici le package SNOW:
       http://www.sfu.ca/~sblay/R/snow.html
This presentation is based on my own practice of R. I do not know
if it is optimal, but it made me gain a lot of time...
Parall´liser sous R
      e
   How does parallel computing work?
      Using the snow package “we open as many R session as the
      number of nodes we choose”:
      library(snow)
      cl <- makeCluster(3, type = "SOCK")
Parall´liser sous R
      e

       The clusterEvalQ() function allows to execute R code on all
       sessions:

    clusterEvalQ(cl, ls())
   > clusterEvalQ(cl, 1 + 1)
   [[1]]
   [1] 2
   [[2]]
   [1] 2
   [[3]]
   [1] 2
Parall´liser sous R
      e

       Nodes may be called independently:

   > clusterEvalQ(cl[1], a <- 1)
   > clusterEvalQ(cl[2], a <- 2)
   > clusterEvalQ(cl[3], a <- 3)
   > clusterEvalQ(cl, a)
   [[1]]
   [1] 1

   [[2]]
   [1] 2

   [[3]]
   [1] 3
Parall´liser sous R
      e


       The snow package comes with many parallelized versions of
       usual R functions as parLapply, parApply, etc. which are not
       always efficients:

   > a <- matrix(rnorm(10000000), ncol = 1000)
   > system.time(apply(a, 1, sum))
   utilisateur     syst`me
                       e        e
                                ´coul´
                                     e
          0.27        0.02        0.28
   > system.time(parApply(cl, a, 1, sum))
   utilisateur     syst`me
                       e        e
                                ´coul´
                                     e
          0.67        0.39        1.09
Parall´liser sous R
      e




   Using parallel code is not always efficient:
       It always takes some time to serialize and unserialize data
       If the data is huge R may need some time to copy it...
Parall´liser sous R
      e

       One solution is to first export data to all nodes and then
       execute the code on each node:

   > #### First Export:
   > columns <- clusterSplit(cl, 1:10000)
   > for (cc in 1:3){
   + aa <- a[columns[[cc]],]
   + clusterExport(cl[cc], "aa")
   + }
   > #### Then execute
   >
   > system.time(do.call("c",
   clusterEvalQ(cl, apply(aa, 1, sum))))
   utilisateur     syst`me
                       e        e
                                ´coul´
                                     e
          0.00        0.00        0.16
Parall´liser sous R
      e


   Of course, it is not necessary optimal to always export the data
   first... but in many cases it may be usefull:
       If one has many computation to do on one dataset
       For any iterative method:
            Bootstrap
            Iterative estimation: ML, GMM, etc.
       The idea is to first export data and then execute the code on
       the different nodes
       Exporting data is the costly step. Making a synthesis of the
       results is often quite easy (sum, c, cbind, etc.)
We simple problem




      We want to estimate a probit model
      ML estimation is iterative. You need to estimate partial
      derivatives for the gradient and the hessian matrix
      thus you need to evaluate the objective function many many
      times to obtain numerical derivatives
      Reducing the time of one iteration reduces the whole time of
      iteration a lot...
The probit model



   The model is given by:

                        Y ∗ = X β + varepsilon
                         Y    = 1{Y ∗ >0}

   The individual contribution to the likelihood is then :

                      L = Φ(X β)Y Φ(−X β)(1−Y )
A very simple problem

   > n       <- 5000000
   > param   <- c(1,2,-.5)
   > X1      <- rnorm(n)
   > X2      <- rnorm(n, mean = 1, sd = 2)
   > Ys      <- param[1] + param[2] * X1 +
   + param[3] * X2 + rnorm(n)
   > Y <- Ys > 0
   > probit <- function(para, y, x1, x2){
   + mu <- para[1] + para[2] * x1 + para[3] * x2
   + sum(pnorm(mu, log = T)*y + pnorm(-mu, log = T)*(1 - y))
   + }
   >
   > system.time(test1 <- probit(param, Y, X1, X2))
   utilisateur     syst`me
                       e        e
                                ´coul´
                                     e
          1.72        0.08        1.80
Make a parallel version



   We build a parallel version of our program doing the following
   steps:
    1. Make clusters
    2. Divide the data over the nodes
    3. Write the likelihood
    4. Execute the likelihood on each node
    5. Collect the results
Divide data:
> nn <- clusterSplit(cl, 1:n)
> for (cc in 1:3){
+ YY <- Y[nn[[cc]]]
+ XX1 <- X1[nn[[cc]]]
+ XX2 <- X2[nn[[cc]]]
+ clusterExport(cl[cc], c("YY", "XX1", "XX2"))
+ }
> clusterExport(cl, "probit")
> clusterEvalQ(cl, ls())
[[1]]
[1] "probit" "XX1"    "XX2"    "YY"

[[2]]
[1] "probit" "XX1"   "XX2"    "YY"

[[3]]
[1] "probit" "XX1"   "XX2"    "YY"
Write a new version of the likelihood:
>   gets<-function(n, v) {
+   assign(n,v, envir=.GlobalEnv);NULL
+   }
>   lik <- function(para){
+   clusterCall(cl, gets ,"para", get("para"))
+   do.call("sum",
+       clusterEvalQ(cl, probit(para, YY, XX1, XX2)))
+   }
Execute and compare theg results:
> system.time(test2 <- lik(param)) ## 1.5 sec
utilisateur     syst`me
                    e        e
                             ´coul´
                                  e
       0.00        0.00        0.78
> c(test1, test2) ## Same results
[1] -1432674 -1432674
Conclusion




      By using parallel versions of R, one may gain a lot of time...
      A wrong use of R packages may also be costly...
      Of course, for probit problem, use glm package...
      Don’t forget to close the nodes:
      > stopCluster(cl)

More Related Content

PDF
Gur1009
PDF
Yevhen Tatarynov "My .NET Application Allocates too Much Memory. What Can I Do?"
PDF
Yevhen Tatarynov "From POC to High-Performance .NET applications"
PDF
PostgreSQL performance improvements in 9.5 and 9.6
PPTX
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PDF
Oleksandr Kutsan "Using katai struct to describe the process of working with ...
PDF
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
Gur1009
Yevhen Tatarynov "My .NET Application Allocates too Much Memory. What Can I Do?"
Yevhen Tatarynov "From POC to High-Performance .NET applications"
PostgreSQL performance improvements in 9.5 and 9.6
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Oleksandr Kutsan "Using katai struct to describe the process of working with ...
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open

What's hot (20)

PPTX
Operating System Engineering
PDF
PostgreSQL 9.6 새 기능 소개
PPTX
Programming Assignment Help
PDF
Programming with Python and PostgreSQL
PPTX
Functional Reactive Programming with RxJS
PDF
Performance improvements in PostgreSQL 9.5 and beyond
PDF
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
PDF
Time Series Meetup: Virtual Edition | July 2020
PDF
The Ring programming language version 1.9 book - Part 90 of 210
PPTX
Psycopg2 - Connect to PostgreSQL using Python Script
PDF
Леонид Шевцов «Clojure в деле»
PDF
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
PDF
Declare Your Language: Virtual Machines & Code Generation
PDF
Compact and safely: static DSL on Kotlin
PDF
PostgreSQL: Data analysis and analytics
PDF
Accelerating Local Search with PostgreSQL (KNN-Search)
PPTX
Operating System Engineering Quiz
PDF
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
PPTX
Operating System Assignment Help
PPTX
Computer Science Homework Help
Operating System Engineering
PostgreSQL 9.6 새 기능 소개
Programming Assignment Help
Programming with Python and PostgreSQL
Functional Reactive Programming with RxJS
Performance improvements in PostgreSQL 9.5 and beyond
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
Time Series Meetup: Virtual Edition | July 2020
The Ring programming language version 1.9 book - Part 90 of 210
Psycopg2 - Connect to PostgreSQL using Python Script
Леонид Шевцов «Clojure в деле»
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Declare Your Language: Virtual Machines & Code Generation
Compact and safely: static DSL on Kotlin
PostgreSQL: Data analysis and analytics
Accelerating Local Search with PostgreSQL (KNN-Search)
Operating System Engineering Quiz
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
Operating System Assignment Help
Computer Science Homework Help
Ad

Viewers also liked (20)

PDF
Exports de r vers office
PDF
Introduction à la cartographie avec R
PDF
Fltau r interface
PDF
Incorporer du C dans R, créer son package
PDF
R aux enquêtes de conjoncture
PPT
Premier pas de web scrapping avec R
PPTX
Dataiku r users group v2
PDF
R in latex
PDF
HADOOP + R
PDF
Big data with r
PPTX
R2DOCX : R + WORD
PDF
R fait du la tex
PDF
PDF
Ugc net solutions at target ies
PDF
Solution manual for modern processor design by john paul shen and mikko h. li...
PDF
Full solution manual for modern processor design by john paul shen and mikko ...
PDF
RStudio is good for you
PDF
Cartographie avec igraph sous R (Partie 2)
PDF
Cartographie avec igraph sous R (Partie 1)
PPT
Première approche de cartographie sous R
Exports de r vers office
Introduction à la cartographie avec R
Fltau r interface
Incorporer du C dans R, créer son package
R aux enquêtes de conjoncture
Premier pas de web scrapping avec R
Dataiku r users group v2
R in latex
HADOOP + R
Big data with r
R2DOCX : R + WORD
R fait du la tex
Ugc net solutions at target ies
Solution manual for modern processor design by john paul shen and mikko h. li...
Full solution manual for modern processor design by john paul shen and mikko ...
RStudio is good for you
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 1)
Première approche de cartographie sous R
Ad

Similar to Parallel R in snow (english after 2nd slide) (20)

PDF
Do snow.rwn
PDF
Simple, fast, and scalable torch7 tutorial
PPTX
Algorithm analysis.pptx
PDF
Parallel Computing with R
PDF
MLE Example
PDF
Effective Numerical Computation in NumPy and SciPy
PDF
Complex models in ecology: challenges and solutions
PDF
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
PDF
Write Python for Speed
DOCX
assignment_7_sc report for soft comptuing
PPTX
Data Structure Algorithm -Algorithm Complexity
PDF
Parallel Computing with R
PPTX
Data Structures and Agorithm: DS 22 Analysis of Algorithm.pptx
PDF
Julia - Easier, Better, Faster, Stronger
PPTX
Programming python quick intro for schools
PPT
Learn Matlab
PDF
Testing in those hard to reach places
 
PDF
關於測試,我說的其實是......
PDF
alexnet.pdf
PDF
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Do snow.rwn
Simple, fast, and scalable torch7 tutorial
Algorithm analysis.pptx
Parallel Computing with R
MLE Example
Effective Numerical Computation in NumPy and SciPy
Complex models in ecology: challenges and solutions
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Write Python for Speed
assignment_7_sc report for soft comptuing
Data Structure Algorithm -Algorithm Complexity
Parallel Computing with R
Data Structures and Agorithm: DS 22 Analysis of Algorithm.pptx
Julia - Easier, Better, Faster, Stronger
Programming python quick intro for schools
Learn Matlab
Testing in those hard to reach places
 
關於測試,我說的其實是......
alexnet.pdf
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014

More from Cdiscount (17)

PDF
R Devtools
PDF
Presentation r markdown
PDF
Paris2012 session4
PDF
Paris2012 session3b
PPTX
Scm prix blé_2012_11_06
PPT
Scm indicateurs prospectifs_2012_11_06
PDF
Scm risques
PDF
State Space Model
PDF
Paris2012 session2
PDF
Paris2012 session1
PPT
Prévisions trafic aérien
PDF
Robust sequentiel learning
PDF
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
PDF
Comptabilité Nationale avec R
PDF
Prévision de consommation électrique avec adaptive GAM
PDF
Forecasting GDP profile with an application to French Business Surveys
PDF
Prediction in dynamic Graphs
R Devtools
Presentation r markdown
Paris2012 session4
Paris2012 session3b
Scm prix blé_2012_11_06
Scm indicateurs prospectifs_2012_11_06
Scm risques
State Space Model
Paris2012 session2
Paris2012 session1
Prévisions trafic aérien
Robust sequentiel learning
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Comptabilité Nationale avec R
Prévision de consommation électrique avec adaptive GAM
Forecasting GDP profile with an application to French Business Surveys
Prediction in dynamic Graphs

Recently uploaded (20)

PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Approach and Philosophy of On baking technology
PDF
August Patch Tuesday
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mushroom cultivation and it's methods.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
A Presentation on Touch Screen Technology
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Encapsulation theory and applications.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A Presentation on Artificial Intelligence
Getting Started with Data Integration: FME Form 101
SOPHOS-XG Firewall Administrator PPT.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TLE Review Electricity (Electricity).pptx
cloud_computing_Infrastucture_as_cloud_p
Approach and Philosophy of On baking technology
August Patch Tuesday
Univ-Connecticut-ChatGPT-Presentaion.pdf
Unlocking AI with Model Context Protocol (MCP)
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mushroom cultivation and it's methods.pdf
Web App vs Mobile App What Should You Build First.pdf
Enhancing emotion recognition model for a student engagement use case through...
A Presentation on Touch Screen Technology
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Encapsulation theory and applications.pdf

Parallel R in snow (english after 2nd slide)

  • 1. Gagnez du temp en parall´lisant sous R e Maxime Tˆ o June 12, 2012
  • 2. Parall´liser sous R e On utilise ici le package SNOW: http://www.sfu.ca/~sblay/R/snow.html
  • 3. This presentation is based on my own practice of R. I do not know if it is optimal, but it made me gain a lot of time...
  • 4. Parall´liser sous R e How does parallel computing work? Using the snow package “we open as many R session as the number of nodes we choose”: library(snow) cl <- makeCluster(3, type = "SOCK")
  • 5. Parall´liser sous R e The clusterEvalQ() function allows to execute R code on all sessions: clusterEvalQ(cl, ls()) > clusterEvalQ(cl, 1 + 1) [[1]] [1] 2 [[2]] [1] 2 [[3]] [1] 2
  • 6. Parall´liser sous R e Nodes may be called independently: > clusterEvalQ(cl[1], a <- 1) > clusterEvalQ(cl[2], a <- 2) > clusterEvalQ(cl[3], a <- 3) > clusterEvalQ(cl, a) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3
  • 7. Parall´liser sous R e The snow package comes with many parallelized versions of usual R functions as parLapply, parApply, etc. which are not always efficients: > a <- matrix(rnorm(10000000), ncol = 1000) > system.time(apply(a, 1, sum)) utilisateur syst`me e e ´coul´ e 0.27 0.02 0.28 > system.time(parApply(cl, a, 1, sum)) utilisateur syst`me e e ´coul´ e 0.67 0.39 1.09
  • 8. Parall´liser sous R e Using parallel code is not always efficient: It always takes some time to serialize and unserialize data If the data is huge R may need some time to copy it...
  • 9. Parall´liser sous R e One solution is to first export data to all nodes and then execute the code on each node: > #### First Export: > columns <- clusterSplit(cl, 1:10000) > for (cc in 1:3){ + aa <- a[columns[[cc]],] + clusterExport(cl[cc], "aa") + } > #### Then execute > > system.time(do.call("c", clusterEvalQ(cl, apply(aa, 1, sum)))) utilisateur syst`me e e ´coul´ e 0.00 0.00 0.16
  • 10. Parall´liser sous R e Of course, it is not necessary optimal to always export the data first... but in many cases it may be usefull: If one has many computation to do on one dataset For any iterative method: Bootstrap Iterative estimation: ML, GMM, etc. The idea is to first export data and then execute the code on the different nodes Exporting data is the costly step. Making a synthesis of the results is often quite easy (sum, c, cbind, etc.)
  • 11. We simple problem We want to estimate a probit model ML estimation is iterative. You need to estimate partial derivatives for the gradient and the hessian matrix thus you need to evaluate the objective function many many times to obtain numerical derivatives Reducing the time of one iteration reduces the whole time of iteration a lot...
  • 12. The probit model The model is given by: Y ∗ = X β + varepsilon Y = 1{Y ∗ >0} The individual contribution to the likelihood is then : L = Φ(X β)Y Φ(−X β)(1−Y )
  • 13. A very simple problem > n <- 5000000 > param <- c(1,2,-.5) > X1 <- rnorm(n) > X2 <- rnorm(n, mean = 1, sd = 2) > Ys <- param[1] + param[2] * X1 + + param[3] * X2 + rnorm(n) > Y <- Ys > 0 > probit <- function(para, y, x1, x2){ + mu <- para[1] + para[2] * x1 + para[3] * x2 + sum(pnorm(mu, log = T)*y + pnorm(-mu, log = T)*(1 - y)) + } > > system.time(test1 <- probit(param, Y, X1, X2)) utilisateur syst`me e e ´coul´ e 1.72 0.08 1.80
  • 14. Make a parallel version We build a parallel version of our program doing the following steps: 1. Make clusters 2. Divide the data over the nodes 3. Write the likelihood 4. Execute the likelihood on each node 5. Collect the results
  • 15. Divide data: > nn <- clusterSplit(cl, 1:n) > for (cc in 1:3){ + YY <- Y[nn[[cc]]] + XX1 <- X1[nn[[cc]]] + XX2 <- X2[nn[[cc]]] + clusterExport(cl[cc], c("YY", "XX1", "XX2")) + } > clusterExport(cl, "probit") > clusterEvalQ(cl, ls()) [[1]] [1] "probit" "XX1" "XX2" "YY" [[2]] [1] "probit" "XX1" "XX2" "YY" [[3]] [1] "probit" "XX1" "XX2" "YY"
  • 16. Write a new version of the likelihood: > gets<-function(n, v) { + assign(n,v, envir=.GlobalEnv);NULL + } > lik <- function(para){ + clusterCall(cl, gets ,"para", get("para")) + do.call("sum", + clusterEvalQ(cl, probit(para, YY, XX1, XX2))) + }
  • 17. Execute and compare theg results: > system.time(test2 <- lik(param)) ## 1.5 sec utilisateur syst`me e e ´coul´ e 0.00 0.00 0.78 > c(test1, test2) ## Same results [1] -1432674 -1432674
  • 18. Conclusion By using parallel versions of R, one may gain a lot of time... A wrong use of R packages may also be costly... Of course, for probit problem, use glm package... Don’t forget to close the nodes: > stopCluster(cl)