SlideShare a Scribd company logo
Streaming Data,
Concurrency And R

     Rory Winston

   rory@theresearchkitchen.com
About Me




      Independent Software Consultant
      M.Sc. Applied Computing, 2000
      M.Sc. Finance, 2008
      Apache Committer
      Working in the financial sector for the last 7 years or so
      Interested in practical applications of functional languages and
      machine learning
      Relatively recent convert to R ( ≈ 2 years)
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
Parallelization vs. Concurrency



        R interpreter is single threaded
        Some historical context for this (BLAS implementations)
        Not necessarily a limitation in the general context
        Multithreading can be complex and problematic
        Instead a focus on parallelization:
             Distributed computation: gridR, nws, snow
             Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0
             Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc.
        Parallelization suits cpu-bound large data processing
        applications
Other Scalability and Performance Work




        JIT/bytecode compilation (Ra)
        Implicit vectorization a la Matlab (code analysis)
        Large (≥ RAM) dataset handling (bigmemory,ff)
        Many incremental performance improvements (e.g. less
        internal copying)
        Next: GPU/massive multicore...?
What Benefit Concurrency?




       Real-time (streaming to be more precise) data analysis
       Growing Interest in using R for streaming data, not just offline
       analyis
       GUI toolkit integration
       Fine-grained control over independent task execution
       "I believe that explicit concurrency management tools (i.e. a
       threads toolkit) are what we really need in R at this point." -
       Luke Tierney, 2001
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Example Application




        Based on work I did last year and presented at UseR! 2008
        Wrote a real-time and historical market data service from
        Reuters/R
        The real-time interface used the Reuters C++ API
        R extension in C++ that spawned listening thread and
        handled updates
Simplified Architecture




                                R


                         extension (C++)



                           realtime bus
Example Usage



          rsub <- function(duration, items, callback)


   The call rsub will subscribe to the specified rate(s) for the duration
   of time specified by duration (ms). When a tick arrives, the
   callback function callback is invoked, with a data frame
   containing the fields specified in items.

   Multiple market data items may be subscribed to, and any
   combination of fields may be be specified.

   Uses the underlying RFA API, which provides a C++ interface to
   real-time market updates.
Real-Time Example


   # Specify field names to retrieve
   fields <- c("BID","ASK","TIMCOR")

   # Subscribe to EUR/USD and GBP/USD ticks
   items <- list()
   items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields)
   items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields)

   # Simple Callback Function
   callback <- function(df) { print(paste("Received",df)) }

   # Subscribe for 1 hour
   ONE_HOUR <- 1000*(60)^2
   rsub(ONE_HOUR, items, callback)
Issues With This Approach




        As R interpreter is single threaded, cannot spawn thread for
        callbacks
        Thus, interpreter thread is locked for the duration of
        subscription
        Not a great user experience
        Need to find alternative mechanism
Alternative Approach



        If we cannot run subscriber threads in-process, need to
        decouple
        Standard approach: add an extra layer and use some form of
        IPC
        For instance, we could:
            Subscribe in a dedicated R process (A)
            Push incoming data onto a socket
            R process (B) reads from a listening socket
        Sockets could also be another IPC primitive, e.g. pipes
        Also note that R supports asynchronous I/O (?isIncomplete)
        Look at the ibrokers package for examples of this
The bigmemoRy package



       From the description: "Use C++ to create, store,
       access, and manipulate massive matrices"
       Allows creation of large matrices
       These matrices can be mapped to files/shared memory
       It is the shared memory functionality that we will use
       The next version (3.0) will be unveiled at UseR! 2009

   big.matrix(nrow, ncol, type = "integer", ....)
   shared.big.matrix(nrow, ncol, type = "integer", ...)
   filebacked.big.matrix(nrow, ncol, type = "integer", ...)
Sample Usage




   > library(bigmemory) # Note: I'm using pre-release
   > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000)
   > X
   An object of class “big.matrix”
   Slot "address":
   <pointer: 0x7378a0>
Create Shared Memory Descriptor

   > desc <- describe(X)
   > desc
   $sharedType
   [1] "SharedMemory"

   $sharedName
   [1] "53f14925-dca1-42a8-a547-e1bccae999ce"

   $nrow
   [1] 1000

   $ncol
   [1] 1000

   $rowNames
   NULL
Export the Descriptor




    In R session 1:

    > dput(desc, file="~/matrix.desc")

    In R session 2:

    > library(bigmemory)
    > desc <- dget("~/matrix.desc")
    > X <- attach.big.matrix(desc)

    Now R sessions A and B share the same big.matrix instance
Share Data Between Sessions




   R session 1:

   > X[1,1] <- 1.2345

   R session 2:

   > X[1,1]
   [1] 1.2345

   Thus, streaming data can be continuously fed into session A
   And concurrently processed in session B
Summary




      Lack of threads not a barrier to concurrent analysis
      Packages like bigmemory, nws, etc. facilitate decoupling via
      IPC
      nws goes a step further, with a distributed workspace
      Many applications for streaming data:
          Data collection/monitoring
          Development of pricing/risk algorithms
          Low-frequency execution (??)
          ...
References




        http://cran.r-project.org/web/packages/bigmemory/
        http://www.cs.uiowa.edu/ luke/R/thrgui/
        http://www.milbo.users.sonic.net/ra/index.html
        http://www.cs.kent.ac.uk/projects/cxxr/
        http://www.theresearchkitchen.com/blog

More Related Content

PDF
Real-TIme Market Data in R
PDF
Introduction to kdb+
PDF
Streaming Data in R
PDF
Creating R Packages
PDF
Real time applications using the R Language
PDF
Using the R Language in BI and Real Time Applications (useR 2015)
PPTX
Big data real time R - useR! 2013 - David Smith
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
Real-TIme Market Data in R
Introduction to kdb+
Streaming Data in R
Creating R Packages
Real time applications using the R Language
Using the R Language in BI and Real Time Applications (useR 2015)
Big data real time R - useR! 2013 - David Smith
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...

Viewers also liked (17)

PPT
Pti finish
PPSX
Yoleo
PDF
Kat01 2012
PPTX
Inventario
DOCX
conroling slides by sohar bakhsh
KEY
Equine Emergencies Part 4
PPT
7. susret 17.11.2011. konkretno lice boga oca
PPTX
Figurative Painter - Vicente Romero Redondo
PPTX
slideshow_funerals
PPTX
Best kitchen knives
PPTX
5 Worst States for Identity Theft
PPTX
Food Combining For Beginners.
PDF
Market advertizing
KEY
Unit 1-vocab jarod f
PDF
Safety Meeting Starter (SMS) Aug 2012
PPTX
Why Is Tympanometry Performed?
DOC
Unit fourteen will future
Pti finish
Yoleo
Kat01 2012
Inventario
conroling slides by sohar bakhsh
Equine Emergencies Part 4
7. susret 17.11.2011. konkretno lice boga oca
Figurative Painter - Vicente Romero Redondo
slideshow_funerals
Best kitchen knives
5 Worst States for Identity Theft
Food Combining For Beginners.
Market advertizing
Unit 1-vocab jarod f
Safety Meeting Starter (SMS) Aug 2012
Why Is Tympanometry Performed?
Unit fourteen will future
Ad

Similar to Streaming Data and Concurrency in R (20)

PPTX
Introduction to R
PPTX
DOC-20240829-WA0001 power point presentation
PPT
Basics of R-Programming with example.ppt
PPT
Basocs of statistics with R-Programming.ppt
PPT
R-Programming.ppt it is based on R programming language
PDF
effectivegraphsmro1
PPT
R programming by ganesh kavhar
PPT
R Programming for Statistical Applications
PPT
R-programming with example representation.ppt
PDF
R - the language
PDF
Intro to R for SAS and SPSS User Webinar
PDF
An Analytics Toolkit Tour
PDF
Introduction to R
PPTX
Big data analytics using R
PDF
a_very_brief_introduction_to_r.pdfhshkdjdn
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
PDF
Open source analytics
PPTX
Big data analytics with R tool.pptx
PPTX
Analytics Beyond RAM Capacity using R
Introduction to R
DOC-20240829-WA0001 power point presentation
Basics of R-Programming with example.ppt
Basocs of statistics with R-Programming.ppt
R-Programming.ppt it is based on R programming language
effectivegraphsmro1
R programming by ganesh kavhar
R Programming for Statistical Applications
R-programming with example representation.ppt
R - the language
Intro to R for SAS and SPSS User Webinar
An Analytics Toolkit Tour
Introduction to R
Big data analytics using R
a_very_brief_introduction_to_r.pdfhshkdjdn
The Powerful Marriage of Hadoop and R (David Champagne)
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Open source analytics
Big data analytics with R tool.pptx
Analytics Beyond RAM Capacity using R
Ad

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Mushroom cultivation and it's methods.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
August Patch Tuesday
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Spectroscopy.pptx food analysis technology
Mushroom cultivation and it's methods.pdf
TLE Review Electricity (Electricity).pptx
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Encapsulation_ Review paper, used for researhc scholars
August Patch Tuesday
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Assigned Numbers - 2025 - Bluetooth® Document

Streaming Data and Concurrency in R

  • 1. Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com
  • 2. About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial sector for the last 7 years or so Interested in practical applications of functional languages and machine learning Relatively recent convert to R ( ≈ 2 years)
  • 3. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 4. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 5. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 6. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 7. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 8. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 9. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 10. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 11. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 12. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 13. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 14. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 15. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 16. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 17. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 18. Parallelization vs. Concurrency R interpreter is single threaded Some historical context for this (BLAS implementations) Not necessarily a limitation in the general context Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0 Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc. Parallelization suits cpu-bound large data processing applications
  • 19. Other Scalability and Performance Work JIT/bytecode compilation (Ra) Implicit vectorization a la Matlab (code analysis) Large (≥ RAM) dataset handling (bigmemory,ff) Many incremental performance improvements (e.g. less internal copying) Next: GPU/massive multicore...?
  • 20. What Benefit Concurrency? Real-time (streaming to be more precise) data analysis Growing Interest in using R for streaming data, not just offline analyis GUI toolkit integration Fine-grained control over independent task execution "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001
  • 21. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 22. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 23. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 24. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 25. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 26. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 27. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 28. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 29. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 30. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 31. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 32. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 33. Example Application Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension in C++ that spawned listening thread and handled updates
  • 34. Simplified Architecture R extension (C++) realtime bus
  • 35. Example Usage rsub <- function(duration, items, callback) The call rsub will subscribe to the specified rate(s) for the duration of time specified by duration (ms). When a tick arrives, the callback function callback is invoked, with a data frame containing the fields specified in items. Multiple market data items may be subscribed to, and any combination of fields may be be specified. Uses the underlying RFA API, which provides a C++ interface to real-time market updates.
  • 36. Real-Time Example # Specify field names to retrieve fields <- c("BID","ASK","TIMCOR") # Subscribe to EUR/USD and GBP/USD ticks items <- list() items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields) items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields) # Simple Callback Function callback <- function(df) { print(paste("Received",df)) } # Subscribe for 1 hour ONE_HOUR <- 1000*(60)^2 rsub(ONE_HOUR, items, callback)
  • 37. Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism
  • 38. Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes Also note that R supports asynchronous I/O (?isIncomplete) Look at the ibrokers package for examples of this
  • 39. The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use The next version (3.0) will be unveiled at UseR! 2009 big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...)
  • 40. Sample Usage > library(bigmemory) # Note: I'm using pre-release > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>
  • 41. Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL
  • 42. Export the Descriptor In R session 1: > dput(desc, file="~/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("~/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance
  • 43. Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B
  • 44. Summary Lack of threads not a barrier to concurrent analysis Packages like bigmemory, nws, etc. facilitate decoupling via IPC nws goes a step further, with a distributed workspace Many applications for streaming data: Data collection/monitoring Development of pricing/risk algorithms Low-frequency execution (??) ...
  • 45. References http://cran.r-project.org/web/packages/bigmemory/ http://www.cs.uiowa.edu/ luke/R/thrgui/ http://www.milbo.users.sonic.net/ra/index.html http://www.cs.kent.ac.uk/projects/cxxr/ http://www.theresearchkitchen.com/blog