SlideShare a Scribd company logo
R Analytics
in the Cloud
Introduction
   Radek Maciaszek
     DataMine Lab (www.dataminelab.com) - Data mining,
      business intelligence and data warehouse
      consultancy.
     MSc in Bioinformatics at Birkbeck, University of
      London.
     Project at UCL Institute of Healthy Ageing under
      supervision of Dr Eugene Schuster.




                                                          2
Primer in Bioinformatics
   Bioinformatics - applying computer
    science to biology (DNA, Proteins,
    Drug discovery, etc)
   Ageing strategy – solve it in simple
    organism and apply findings to more
    complex organisms (i.e. humans).
   Goal: find genes responsible for ageing

Caenorhabditis Elegans
                                              3
Central dogma of molecular biology




Genes are encoded
by the DNA.                                             Microarray
                                                        (100 x 100)
                • Database of 50 curated experiments.
                • 10k genes compare to each other
                                                                      4
Why R?
   Very popular in bioinformatics
   Functional, scripting programming
    language
   Swiss-army knife for statistician
   Designed by statisticians for
    statisticians
   Lots of ready to use packages (CRAN)


                                           5
R limitations & Hadoop
   Data needs to fit in the memory
   Single-threaded
   Hadoop integration:
       Hadoop Streaming
       Rhipe: http://ml.stat.purdue.edu/rhipe/
       Segue: http://code.google.com/p/segue/




                                                  6
Segue
   Works with Amazon Elastic MapReduce.
   Creates a cluster for you.
   Designed for Big Computations (rather than
    Big Data)
   Implements a cloud version of lapply()
    function.



                                                 7
Segue workflow (emrlapply)

List (local)




               List (remote)


                               Amazon AWS   8
R very quick example
m <- list(a = 1:10, b = exp(-3:3))
lapply(m, mean)
$a
[1] 5.5
$b
[1] 4.535125

lapply(X, FUN) returns a list of the same length as X,
each element of which is the result of applying FUN to
the corresponding element of X.
                                                         9
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
  A.vector <- experiments.matrix[probe,]
  p.values <- c()
  for(probe.name in rownames(experiments.matrix)) {
     B.vector <- experiments.matrix[probe.name,]
     p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
  }
  return (p.values)
}
                                                                     RNA Probes
> pearson.cor <- lapply(probes, AnalysePearsonCorelation)

Moving to the cloud in 3 lines of code!



                                                                                  10
Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
  A.vector <- experiments.matrix[probe,]
  p.values <- c()
  for(probe.name in rownames(experiments.matrix)) {
     B.vector <- experiments.matrix[probe.name,]
     p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
  }
  return (p.values)
}
                                                                     RNA Probes
> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)
> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”,
               slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”,
               slaveInstanceType=”c1.xlarge”, copy.image=TRUE)
> pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)
> stopCluster(myCluster)                                                          11
Discovering genes




                       Topomaps of clustered genes
This work was based on a similar approach to:
A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al.,   12
Science 293, 2087 (2001)
Conclusions
   R is great for statistics.
   It’s easy to scale up R using Segue.
   We are all going to live very long.




                                           13
Thanks!
   Questions?

   References:
    http://code.google.com/r/radek-segue/
    http://www.dataminelab.com




                                            14

More Related Content

PPTX
DataStructure Concepts-HEAP,HASH,Graph
PPTX
Pig: Data Analysis Tool in Cloud
PPTX
Heap_Sort1.pptx
PPTX
heap Sort Algorithm
PDF
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
PPTX
The Transformation of Systems Biology Into A Large Data Science
PPTX
Binary Heap Tree, Data Structure
PDF
Scikit-Learn in Particle Physics
DataStructure Concepts-HEAP,HASH,Graph
Pig: Data Analysis Tool in Cloud
Heap_Sort1.pptx
heap Sort Algorithm
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
The Transformation of Systems Biology Into A Large Data Science
Binary Heap Tree, Data Structure
Scikit-Learn in Particle Physics

Viewers also liked (11)

PDF
Extending lifespan with Hadoop and R
PPTX
Experience with Kafka & Storm
PPTX
Data Stream Algorithms in Storm and R
PDF
Real time analytics with Netty, Storm, Kafka
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Storm: distributed and fault-tolerant realtime computation
PDF
Realtime Analytics with Storm and Hadoop
PPTX
Yahoo compares Storm and Spark
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
Extending lifespan with Hadoop and R
Experience with Kafka & Storm
Data Stream Algorithms in Storm and R
Real time analytics with Netty, Storm, Kafka
Resource Aware Scheduling in Apache Storm
Scaling Apache Storm - Strata + Hadoop World 2014
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Yahoo compares Storm and Spark
Apache Storm 0.9 basic training - Verisign
Hadoop Summit Europe 2014: Apache Storm Architecture
Ad

Similar to R Analytics in the Cloud (20)

PDF
R - the language
PPTX
R language tutorial
PPT
r,rstats,r language,r packages
PDF
R tutorial
PDF
Practical data science_public
PPT
Easy R
PPT
Basics of R-Programming with example.ppt
PPT
Basocs of statistics with R-Programming.ppt
PPT
R-Programming.ppt it is based on R programming language
PPT
R studio
PPTX
DATA MINING USING R (1).pptx
PPT
Rtutorial
PPTX
R Language Introduction
PDF
Reference card for R
PDF
Short Reference Card for R users.
PDF
R basics
PPTX
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
PDF
@ R reference
PDF
R command cheatsheet.pdf
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
R - the language
R language tutorial
r,rstats,r language,r packages
R tutorial
Practical data science_public
Easy R
Basics of R-Programming with example.ppt
Basocs of statistics with R-Programming.ppt
R-Programming.ppt it is based on R programming language
R studio
DATA MINING USING R (1).pptx
Rtutorial
R Language Introduction
Reference card for R
Short Reference Card for R users.
R basics
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
@ R reference
R command cheatsheet.pdf
The Powerful Marriage of Hadoop and R (David Champagne)
Ad

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
KodekX | Application Modernization Development
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
MIND Revenue Release Quarter 2 2025 Press Release
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing

R Analytics in the Cloud

  • 2. Introduction  Radek Maciaszek  DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  MSc in Bioinformatics at Birkbeck, University of London.  Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster. 2
  • 3. Primer in Bioinformatics  Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc)  Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans).  Goal: find genes responsible for ageing Caenorhabditis Elegans 3
  • 4. Central dogma of molecular biology Genes are encoded by the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other 4
  • 5. Why R?  Very popular in bioinformatics  Functional, scripting programming language  Swiss-army knife for statistician  Designed by statisticians for statisticians  Lots of ready to use packages (CRAN) 5
  • 6. R limitations & Hadoop  Data needs to fit in the memory  Single-threaded  Hadoop integration:  Hadoop Streaming  Rhipe: http://ml.stat.purdue.edu/rhipe/  Segue: http://code.google.com/p/segue/ 6
  • 7. Segue  Works with Amazon Elastic MapReduce.  Creates a cluster for you.  Designed for Big Computations (rather than Big Data)  Implements a cloud version of lapply() function. 7
  • 8. Segue workflow (emrlapply) List (local) List (remote) Amazon AWS 8
  • 9. R very quick example m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean) $a [1] 5.5 $b [1] 4.535125 lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. 9
  • 10. Segue – large scale example > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } RNA Probes > pearson.cor <- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! 10
  • 11. Segue – large scale example > AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } RNA Probes > # pearson.cor <- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) 11
  • 12. Discovering genes Topomaps of clustered genes This work was based on a similar approach to: A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., 12 Science 293, 2087 (2001)
  • 13. Conclusions  R is great for statistics.  It’s easy to scale up R using Segue.  We are all going to live very long. 13
  • 14. Thanks!  Questions?  References: http://code.google.com/r/radek-segue/ http://www.dataminelab.com 14

Editor's Notes

  • #5: Check Segue, LISP, R, circle