SlideShare a Scribd company logo
© Hortonworks Inc. 2012
Enabling R on Hadoop
July 11, 2013
Page 1
© Hortonworks Inc. 2012
Your Presenters
Ravi Mutyala
Systems Architect
Page 2
Paul Codding
Solutions Engineer
© Hortonworks Inc. 2012
Agenda
• A Brief History of R
• How R is Typically Used
• How R is Used with Hadoop
• Getting Started
Page 3
© Hortonworks Inc. 2012
A Brief History of R
Page 4
© Hortonworks Inc. 2012
History of R
Page 5
1976: S
Fortran
John
Chambers
S
1988: S V3
written in C
& statistical
models
included
1998: S V4
1991: R
Created by
Ross Ihaka &
Robert
Gentleman
R
1997: R
Core Group
Formed
2000: R
Version 1.0
released
© Hortonworks Inc. 2012
How R is Typically Used
Page 6
© Hortonworks Inc. 2012
Main Uses of R
• Statistical Analysis & Modeling
– Classification
– Scoring
– Ranking
– Clustering
– Finding relationships
– Characterization
• Common Uses
– Interactive Data Analysis
– General Purpose Statistics
– Predictive Modeling
Page 7
© Hortonworks Inc. 2012
How R is Used with Hadoop
Page 8
© Hortonworks Inc. 2012
Hadoop Components
Page 9
OS	
   Cloud	
   VM	
   Appliance	
  
PLATFORM	
  SERVICES	
  
HADOOP	
  CORE	
  
DATA	
  
SERVICES	
  
OPERATIONAL	
  
SERVICES	
  
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness: HA,
DR, Snapshots, Security, …
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  
Distributed
Storage & ProcessingHDFS	
   YARN	
  (in	
  2.0)	
  
WEBHDFS	
   MAP	
  REDUCE	
  
HCATALOG	
  
HIVE	
  PIG	
  
HBASE	
  
SQOOP	
  
FLUME	
  
OOZIE	
  
AMBARI	
  
© Hortonworks Inc. 2012
Hadoop Components & R
Page 10
OS	
   Cloud	
   VM	
   Appliance	
  
PLATFORM	
  SERVICES	
  
HADOOP	
  CORE	
  
DATA	
  
SERVICES	
  
OPERATIONAL	
  
SERVICES	
  
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness: HA,
DR, Snapshots, Security, …
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  
Distributed
Storage & ProcessingHDFS	
   YARN	
  (in	
  2.0)	
  
WEBHDFS	
   MAP	
  REDUCE	
  
HCATALOG	
  
HIVE	
  PIG	
  
HBASE	
  
SQOOP	
  
FLUME	
  
OOZIE	
  
AMBARI	
  
Data Service Components
•  Hive
•  HBase
Hadoop Core
•  Map Reduce
•  HDFS
© Hortonworks Inc. 2012
Options for R on Hadoop
• Options
– RODBC/RJDBC
– RHive
– RHadoop
• Analysis
– Focus
– Integration Ease
– Benefits
– Limitations
Page 11
RHadoop
RODBC/RJDBC
RHive
© Hortonworks Inc. 2012
RODBC/RJDBC
• Focus
– SQL Access from R
• Integration Ease
– Install Hortonworks Hive ODBC Driver
– Install Hive libraries
• Benefits
– Low impact on existing R scripts leveraging other DB packages
– Not required to install Hadoop configuration/binaries on client
machines
• Limitations
– Parallelism limited to Hive
– Result set size
Page 12
© Hortonworks Inc. 2012
Deployment Considerations
Page 13
TT , DN
.
.
.
.
.
.
.
TT , DNJTNNHS
© Hortonworks Inc. 2012
RHive
• Focus
– Broad access to Hive and HDFS
• Integration Ease
– Requires Hadoop binaries, libraries, and configuration files on
client machines
– Uses Java DFS Client and HiveServer
• Benefits
– Wide range of features expressed through HQL
– rhive-apply R Distributed apply function using HQL
• Limitations
– Requires heavy client deployment
– Dependent on HiveServer, and can’t be used with HiveServer2
Page 14
© Hortonworks Inc. 2012
Deployment Considerations
Page 15
TT + DN
.
.
.
.
.
.
.
TT + DN
JT
R Edge
Node
NNHS
© Hortonworks Inc. 2012
RHadoop
• Focus
– Tight integration with core Hadoop components
• Benefit
– Ability to run R on a massively distributed system
– Ability to work with full data sets instead of sample sets
• Additional Information
– https://github.com/RevolutionAnalytics/RHadoop/wiki
Page 16
© Hortonworks Inc. 2012
RHadoop Architecture
Page 17
R
rhdfs
rhbase
rmr2
HDFS
HBase Thrift
Gateway
Map Reduce
HBase
Streaming
R
R
R
R
© Hortonworks Inc. 2012
rhdfs
• Access HDFS from R
• Read from HDFS to R dataframe
• Write from R dataframe to HDFS
• 1.0.6 adds support for Windows (using HDP)
Page 18
© Hortonworks Inc. 2012
rhdfs
• Hadoop CLI Commands & rhdfs equivalent
• hadoop fs –ls /
– hdfs.ls(“/”)
• hadoop fs –mkdir /user/rhdfs/ppt
– hdfs.mkdir(“/user/rhdfs/ppt”)
• hadoop fs –put 1.txt /user/rhfds/ppt/
– localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”)
– hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”)
• hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt
– hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”)
• hadoop fs –rm /user/rhdfs/ppt/1.txt
– hdfs.delete(“/user/rhdfs/ppt/1.txt”)
Page 19
© Hortonworks Inc. 2012
rhbase
• Access and change data within HBase
• Uses Thrift API
• Command Examples
– hb.new.table
– hb.insert
– hb.scan.ex
– hb.scan
Page 20
© Hortonworks Inc. 2012
rmr2
• Enables writing MapReduce jobs using R
• Ability to parallelize algorithms
• Ability to use big data sets without needing to sample
data
• mapreduce(input, output, map, reduce, …)
• Reduces takes a key and a collection of values which
could be vector, list, data frame or matrix
• 2.2.1 adds support for Windows (using HDP)
Page 21
© Hortonworks Inc. 2012
Sample code - wordcount
Page 22
wc.map = !
function(., lines) {!
keyval(!
unlist(!
strsplit(!
x = lines,!
split = pattern)),!
1)}!
wc.reduce =!
function(word, counts ) {!
keyval(word, sum(counts))}!
!
mapreduce(!
input = input ,!
output = output,!
input.format = "text",!
map = wc.map,!
reduce = wc.reduce,!
combine = T)}!
© Hortonworks Inc. 2012
More Sample Code
Page 23
groups = rbinom(32, n = 50, prob = 0.4)!
tapply(groups, groups, length)!
groups = to.dfs(groups)!
from.dfs(!
mapreduce(!
input = groups,!
map = function(., v) keyval(v, 1),!
reduce =!
function(k, vv)!
keyval(k, length(vv))))!
© Hortonworks Inc. 2012
Deployment Considerations
Page 24
TT , DN,
RS
R
.
.
.
.
.
.
.
TT , DN,
RS
RJT
R Edge
Node
NN
HT
G
© Hortonworks Inc. 2012
RHadoop
• Limitations
– Requires installation of R on all TaskTracker nodes
– Does not automatically parallelize algorithms
– Different slot/memory configuration recommended to leave
memory and CPU resources for R
Page 25
OS
Map Reduce
OS
Map Reduce
R
© Hortonworks Inc. 2012
Getting Started
Page 26
© Hortonworks Inc. 2012
Your Fastest On-ramp to Enterprise Hadoop™!
Page 27
http://hortonworks.com/products/hortonworks-sandbox/
The Sandbox lets you experience Apache Hadoop from the convenience of your own
laptop – no data center, no cloud and no internet connection needed!
The Hortonworks Sandbox is:
•  A free download: http://hortonworks.com/products/hortonworks-sandbox/
•  A complete, self contained virtual machine with Apache Hadoop pre-configured
•  A personal, portable and standalone Hadoop environment
•  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
© Hortonworks Inc. 2012
Installation
• Install R on all nodes
• Install dependent
packages
– RJSONIO
– itertools
– digest
– Rcpp
– rJava
– functional
– RCurl
– httr
– plyr
• Download & Install
RHadoop Packages
– rmr2
– rhdfs
– rhbase (requires Thrift)
Page 28
© Hortonworks Inc. 2012
Questions & Answers
TRY
Download HDP at hortonworks.com
LEARN
Applying Data Science using Apache
Hadoop Training
FOLLOW
twitter: @hortonworks
Facebook: facebook.com/hortonworks
Page 29
Further questions & comments:
paul@hortonworks.com
ravi@hortonworks.com

More Related Content

PDF
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
PPTX
SQLBits XI - ETL with Hadoop
PDF
May 2013 HUG: HCatalog/Hive Data Out
PDF
Basics of big data analytics hadoop
PDF
Integrating R & Hadoop - Text Mining & Sentiment Analysis
PPTX
Hadoop and mysql by Chris Schneider
PPTX
R for hadoopers
PPTX
Analyzing Real-World Data with Apache Drill
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
SQLBits XI - ETL with Hadoop
May 2013 HUG: HCatalog/Hive Data Out
Basics of big data analytics hadoop
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Hadoop and mysql by Chris Schneider
R for hadoopers
Analyzing Real-World Data with Apache Drill

What's hot (20)

PDF
Integration of HIve and HBase
PDF
Hadoop ecosystem
PPTX
Real time hadoop + mapreduce intro
PPTX
M7 and Apache Drill, Micheal Hausenblas
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPTX
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
PPTX
Introduction to Apache Drill
KEY
Large scale ETL with Hadoop
PPTX
SQL-on-Hadoop with Apache Drill
PPTX
Drilling into Data with Apache Drill
PPTX
February 2014 HUG : Pig On Tez
PPTX
Apache Drill
PPTX
Hive at Yahoo: Letters from the trenches
PPTX
Hadoop hbase mapreduce
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PPTX
Spark vstez
PPTX
Hive+Tez: A performance deep dive
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Integration of HIve and HBase
Hadoop ecosystem
Real time hadoop + mapreduce intro
M7 and Apache Drill, Micheal Hausenblas
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Introduction to Apache Drill
Large scale ETL with Hadoop
SQL-on-Hadoop with Apache Drill
Drilling into Data with Apache Drill
February 2014 HUG : Pig On Tez
Apache Drill
Hive at Yahoo: Letters from the trenches
Hadoop hbase mapreduce
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Spark vstez
Hive+Tez: A performance deep dive
Big data Hadoop Analytic and Data warehouse comparison guide
Ad

Similar to Enabling R on Hadoop (20)

PPTX
R Hadoop integration
PDF
Getting started with R & Hadoop
PDF
Running R on Hadoop - CHUG - 20120815
PPTX
Fundamental of Big Data with Hadoop and Hive
PDF
R and-hadoop
PDF
How to use hadoop and r for big data parallel processing
PPTX
RHadoop - beginners
PPTX
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
PPTX
Hdp r-google charttools-webinar-3-5-2013 (2)
PDF
Microsoft R - Data Science at Scale
PDF
Tools and techniques for data science
PPTX
Integration Method of R and Hadoop and Intro
PPTX
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
PPTX
Hadoop and Big Data: Revealed
PPTX
201305 hadoop jpl-v3
PPTX
Big data ppt
PDF
"R, Hadoop, and Amazon Web Services (20 December 2011)"
PDF
R, Hadoop and Amazon Web Services
R Hadoop integration
Getting started with R & Hadoop
Running R on Hadoop - CHUG - 20120815
Fundamental of Big Data with Hadoop and Hive
R and-hadoop
How to use hadoop and r for big data parallel processing
RHadoop - beginners
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hdp r-google charttools-webinar-3-5-2013 (2)
Microsoft R - Data Science at Scale
Tools and techniques for data science
Integration Method of R and Hadoop and Intro
The Powerful Marriage of Hadoop and R (David Champagne)
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop and Big Data: Revealed
201305 hadoop jpl-v3
Big data ppt
"R, Hadoop, and Amazon Web Services (20 December 2011)"
R, Hadoop and Amazon Web Services
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Monthly Chronicles - July 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf

Enabling R on Hadoop

  • 1. © Hortonworks Inc. 2012 Enabling R on Hadoop July 11, 2013 Page 1
  • 2. © Hortonworks Inc. 2012 Your Presenters Ravi Mutyala Systems Architect Page 2 Paul Codding Solutions Engineer
  • 3. © Hortonworks Inc. 2012 Agenda • A Brief History of R • How R is Typically Used • How R is Used with Hadoop • Getting Started Page 3
  • 4. © Hortonworks Inc. 2012 A Brief History of R Page 4
  • 5. © Hortonworks Inc. 2012 History of R Page 5 1976: S Fortran John Chambers S 1988: S V3 written in C & statistical models included 1998: S V4 1991: R Created by Ross Ihaka & Robert Gentleman R 1997: R Core Group Formed 2000: R Version 1.0 released
  • 6. © Hortonworks Inc. 2012 How R is Typically Used Page 6
  • 7. © Hortonworks Inc. 2012 Main Uses of R • Statistical Analysis & Modeling – Classification – Scoring – Ranking – Clustering – Finding relationships – Characterization • Common Uses – Interactive Data Analysis – General Purpose Statistics – Predictive Modeling Page 7
  • 8. © Hortonworks Inc. 2012 How R is Used with Hadoop Page 8
  • 9. © Hortonworks Inc. 2012 Hadoop Components Page 9 OS   Cloud   VM   Appliance   PLATFORM  SERVICES   HADOOP  CORE   DATA   SERVICES   OPERATIONAL   SERVICES   Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness: HA, DR, Snapshots, Security, … HORTONWORKS     DATA  PLATFORM  (HDP)   Distributed Storage & ProcessingHDFS   YARN  (in  2.0)   WEBHDFS   MAP  REDUCE   HCATALOG   HIVE  PIG   HBASE   SQOOP   FLUME   OOZIE   AMBARI  
  • 10. © Hortonworks Inc. 2012 Hadoop Components & R Page 10 OS   Cloud   VM   Appliance   PLATFORM  SERVICES   HADOOP  CORE   DATA   SERVICES   OPERATIONAL   SERVICES   Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness: HA, DR, Snapshots, Security, … HORTONWORKS     DATA  PLATFORM  (HDP)   Distributed Storage & ProcessingHDFS   YARN  (in  2.0)   WEBHDFS   MAP  REDUCE   HCATALOG   HIVE  PIG   HBASE   SQOOP   FLUME   OOZIE   AMBARI   Data Service Components •  Hive •  HBase Hadoop Core •  Map Reduce •  HDFS
  • 11. © Hortonworks Inc. 2012 Options for R on Hadoop • Options – RODBC/RJDBC – RHive – RHadoop • Analysis – Focus – Integration Ease – Benefits – Limitations Page 11 RHadoop RODBC/RJDBC RHive
  • 12. © Hortonworks Inc. 2012 RODBC/RJDBC • Focus – SQL Access from R • Integration Ease – Install Hortonworks Hive ODBC Driver – Install Hive libraries • Benefits – Low impact on existing R scripts leveraging other DB packages – Not required to install Hadoop configuration/binaries on client machines • Limitations – Parallelism limited to Hive – Result set size Page 12
  • 13. © Hortonworks Inc. 2012 Deployment Considerations Page 13 TT , DN . . . . . . . TT , DNJTNNHS
  • 14. © Hortonworks Inc. 2012 RHive • Focus – Broad access to Hive and HDFS • Integration Ease – Requires Hadoop binaries, libraries, and configuration files on client machines – Uses Java DFS Client and HiveServer • Benefits – Wide range of features expressed through HQL – rhive-apply R Distributed apply function using HQL • Limitations – Requires heavy client deployment – Dependent on HiveServer, and can’t be used with HiveServer2 Page 14
  • 15. © Hortonworks Inc. 2012 Deployment Considerations Page 15 TT + DN . . . . . . . TT + DN JT R Edge Node NNHS
  • 16. © Hortonworks Inc. 2012 RHadoop • Focus – Tight integration with core Hadoop components • Benefit – Ability to run R on a massively distributed system – Ability to work with full data sets instead of sample sets • Additional Information – https://github.com/RevolutionAnalytics/RHadoop/wiki Page 16
  • 17. © Hortonworks Inc. 2012 RHadoop Architecture Page 17 R rhdfs rhbase rmr2 HDFS HBase Thrift Gateway Map Reduce HBase Streaming R R R R
  • 18. © Hortonworks Inc. 2012 rhdfs • Access HDFS from R • Read from HDFS to R dataframe • Write from R dataframe to HDFS • 1.0.6 adds support for Windows (using HDP) Page 18
  • 19. © Hortonworks Inc. 2012 rhdfs • Hadoop CLI Commands & rhdfs equivalent • hadoop fs –ls / – hdfs.ls(“/”) • hadoop fs –mkdir /user/rhdfs/ppt – hdfs.mkdir(“/user/rhdfs/ppt”) • hadoop fs –put 1.txt /user/rhfds/ppt/ – localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”) – hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”) • hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt – hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”) • hadoop fs –rm /user/rhdfs/ppt/1.txt – hdfs.delete(“/user/rhdfs/ppt/1.txt”) Page 19
  • 20. © Hortonworks Inc. 2012 rhbase • Access and change data within HBase • Uses Thrift API • Command Examples – hb.new.table – hb.insert – hb.scan.ex – hb.scan Page 20
  • 21. © Hortonworks Inc. 2012 rmr2 • Enables writing MapReduce jobs using R • Ability to parallelize algorithms • Ability to use big data sets without needing to sample data • mapreduce(input, output, map, reduce, …) • Reduces takes a key and a collection of values which could be vector, list, data frame or matrix • 2.2.1 adds support for Windows (using HDP) Page 21
  • 22. © Hortonworks Inc. 2012 Sample code - wordcount Page 22 wc.map = ! function(., lines) {! keyval(! unlist(! strsplit(! x = lines,! split = pattern)),! 1)}! wc.reduce =! function(word, counts ) {! keyval(word, sum(counts))}! ! mapreduce(! input = input ,! output = output,! input.format = "text",! map = wc.map,! reduce = wc.reduce,! combine = T)}!
  • 23. © Hortonworks Inc. 2012 More Sample Code Page 23 groups = rbinom(32, n = 50, prob = 0.4)! tapply(groups, groups, length)! groups = to.dfs(groups)! from.dfs(! mapreduce(! input = groups,! map = function(., v) keyval(v, 1),! reduce =! function(k, vv)! keyval(k, length(vv))))!
  • 24. © Hortonworks Inc. 2012 Deployment Considerations Page 24 TT , DN, RS R . . . . . . . TT , DN, RS RJT R Edge Node NN HT G
  • 25. © Hortonworks Inc. 2012 RHadoop • Limitations – Requires installation of R on all TaskTracker nodes – Does not automatically parallelize algorithms – Different slot/memory configuration recommended to leave memory and CPU resources for R Page 25 OS Map Reduce OS Map Reduce R
  • 26. © Hortonworks Inc. 2012 Getting Started Page 26
  • 27. © Hortonworks Inc. 2012 Your Fastest On-ramp to Enterprise Hadoop™! Page 27 http://hortonworks.com/products/hortonworks-sandbox/ The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: •  A free download: http://hortonworks.com/products/hortonworks-sandbox/ •  A complete, self contained virtual machine with Apache Hadoop pre-configured •  A personal, portable and standalone Hadoop environment •  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
  • 28. © Hortonworks Inc. 2012 Installation • Install R on all nodes • Install dependent packages – RJSONIO – itertools – digest – Rcpp – rJava – functional – RCurl – httr – plyr • Download & Install RHadoop Packages – rmr2 – rhdfs – rhbase (requires Thrift) Page 28
  • 29. © Hortonworks Inc. 2012 Questions & Answers TRY Download HDP at hortonworks.com LEARN Applying Data Science using Apache Hadoop Training FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks Page 29 Further questions & comments: paul@hortonworks.com ravi@hortonworks.com