SlideShare a Scribd company logo
Big Data Analysis With
RHadoop
David Chiu (Yu-Wei, Chiu)
@ML/DM Monday
2014/03/17
About Me
 Co-Founder of NumerInfo
 Ex-Trend Micro Engineer
 ywchiu-tw.appspot.com
R + Hadoop
http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg
 Scaling R
Hadoop enables R to do parallel computing
 Do not have to learn new language
Learning to use Java takes time
Why Using RHadoop
Rhadoop Architecture
rhdfs
rhbase
rmr2
HDFS
MapReduce
Hbase Thrift
Gateway
Streaming API
Hbase
R
 Enable developer to write Mapper/Reducer in
any scripting language(R, python, perl)
 Mapper, reducer, and optional combiner
processes are written to read from standard
input and to write to standard output
 Streaming Job would have additional overhead
of starting a scripting VM
Streaming v.s. Native Java.
 Writing MapReduce Using R
 mapreduce function
Mapreduce(input output, map, reduce…)
 Changelog
rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0
rmr 2.3.0 (2013/10/07): support plyrmr
rmr2
 Access HDFS From R
 Exchange data from R dataframe and HDFS
rhdfs
 Exchange data from R to Hbase
 Using Thrift API
rhbase
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
RHadoop
Installation
 R and related packages should be installed on
each tasknode of the cluster
 A Hadoop cluster, CDH3 and higher or Apache
1.0.2 and higher but limited to mr1, not mr2.
Compatibility with mr2 from Apache 2.2.0 or
HDP2
Prerequisites
Getting Ready (Cloudera VM)
 Download
http://www.cloudera.com/content/cloudera-content/cloudera-
docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
 This VM runs
CentOS 6.2
CDH4.4
R 3.0.1
Java 1.6.0_32
CDH 4.4
Get RHadoop
 https://github.com/RevolutionAnalytics/RHadoop
/wiki/Downloads
Installing rmr2 dependencies
 Make sure the package is installed system wise
$ sudo R
> install.packages(c("codetools", "R", "Rcpp",
"RJSONIO", "bitops", "digest", "functional", "stringr",
"plyr", "reshape2", "rJava“, “caTools”))
Install rmr2
$ wget --no-check-certificate
https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Installing…
 http://cran.r-project.org/src/contrib/Archive/Rcpp/
Downgrade Rcpp
$ wget --no-check-certificate http://cran.r-
project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t
ar.gz
$sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
Install Rcpp_0.11.0
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Install rmr2 again
Install RHDFS
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/rhdfs/m
aster/build/rhdfs_1.0.8.tar.gz
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD
INSTALL rhdfs_1.0.8.tar.gz
Enable hdfs
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
> library(rmr2)
> library(rhdfs)
> hdfs.init()
Javareconf error
$ sudo R CMD javareconf
javareconf with correct JAVA_HOME
$ echo $JAVA_HOME
$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
MapReduce
With RHadoop
MapReduce
 mapreduce(input, output, map, reduce)
 Like sapply, lapply, tapply within R
Hello World – For Hadoop
http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
Move File Into HDFS
# Put data into hdfs
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
library(rmr2)
library(rhdfs)
hdfs.init()
hdfs.mkdir(“/user/cloudera/wordcount/data”)
hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data")
$ hadoop fs –mkdir /user/cloudera/wordcount/data
$ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
Wordcount Mapper
map <- function(k,lines) {
words.list <- strsplit(lines, 's')
words <- unlist(words.list)
return( keyval(words, 1) )
}
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
#Mapper
Wordcount Reducer
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
#Reducer
Call Wordcount
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output,
input.format="text", map=map, reduce=reduce)
}
out <- wordcount(hdfs.data, hdfs.out)
Read data from HDFS
results <- from.dfs(out)
results$key[order(results$val, decreasing
= TRUE)][1:10]
$ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 |
sort –k 2 –nr | head –n 10
MapReduce Benchmark
> a.time <- proc.time()
> small.ints2=1:100000
> result.normal = sapply(small.ints2, function(x) x^2)
> proc.time() - a.time
> b.time <- proc.time()
> small.ints= to.dfs(1:100000)
> result = mapreduce(input = small.ints, map = function(k,v)
cbind(v,v^2))
> proc.time() - b.time
sapply
Elapsed 0.982 second
mapreduce
Elapsed 102.755 seconds
 HDFS stores your files as data chunk distributed on
multiple datanodes
 M/R runs multiple programs called mapper on each of
the data chunks or blocks. The (key,value) output of
these mappers are compiled together as result by
reducers.
 It takes time for mapper and reducer being spawned on
these distributed system.
Hadoop Latency
Its not possible to apply built in machine learning
method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
Kmeans Clustering
kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun = function(a,b) norm(as.matrix(a-b), type = 'F')
newCenters =
kmeans.iter(
points,
distfun,
ncenters = ncenters)
# interatively choosing new centers
for(i in 1:iterations) {
newCenters = kmeans.iter(points, distfun,
centers = newCenters)
}
newCenters
}
Kmeans in MapReduce Style
kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL)
{
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) { #give random point as sample
function(k,v) keyval(sample(1:ncenters,1),v)}
else {
function(k,v) { #find center of minimum distance
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)}},
reduce = function(k,vv) keyval(NULL,
apply(do.call(rbind, vv), 2, mean))),
to.data.frame = T)
}
Kmeans in MapReduce Style
One More Thing…
plyrmr
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
Installation plyrmr dependencies
$ yum install libxml2-devel
$ sudo yum install curl-devel
$ sudo R
> Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE
> Install.packages(“devtools”, dependencies = TRUE)
> library(devtools)
> install_github("pryr", "hadley")
> Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz
$ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz
Installation plyrmr
> data(mtcars)
> head(mtcars)
> transform(mtcars, carb.per.cyl = carb/cyl)
> library(plyrmr)
> output(input(mtcars), "/tmp/mtcars")
> as.data.frame(transform(input("/tmp/mtcars"),
carb.per.cyl = carb/cyl))
> output(transform(input("/tmp/mtcars"), carb.per.cyl =
carb/cyl), "/tmp/mtcars.out")
Transform in plyrmr
 where(
select(
mtcars,
carb.per.cyl = carb/cyl,
.replace = FALSE),
carb.per.cyl >= 1)
select and where
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
 as.data.frame(
select(
group(
input("/tmp/mtcars"),
cyl),
mean.mpg = mean(mpg)))
Group by
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
Reference
 https://github.com/RevolutionAnalytics/RHadoop/wik
i
 http://www.slideshare.net/RevolutionAnalytics/rhado
op-r-meets-hadoop
 http://www.slideshare.net/Hadoop_Summit/enabling-
r-on-hadoop
 Website
ywchiu-tw.appspot.com
 Email
david@numerinfo.com
tr.ywchiu@gmail.com
 Company
numerinfo.com
Contact
Big Data Analysis With RHadoop

More Related Content

PDF
Using R with Hadoop
PDF
Introduction of R on Hadoop
PDF
Meeting20150109 v1
PDF
Data Hacking with RHadoop
PDF
Running R on Hadoop - CHUG - 20120815
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Hadoop pig
Using R with Hadoop
Introduction of R on Hadoop
Meeting20150109 v1
Data Hacking with RHadoop
Running R on Hadoop - CHUG - 20120815
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
Hadoop pig

What's hot (20)

PPT
apache pig performance optimizations talk at apachecon 2010
PDF
Hadoop pig
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
PDF
Hadoop Streaming: Programming Hadoop without Java
PPTX
Hadoop MapReduce Streaming and Pipes
PPTX
Overview of Spark for HPC
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
PDF
Hive Anatomy
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
KEY
Hive vs Pig for HadoopSourceCodeReading
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
PPTX
Introduction to Apache Pig
PDF
Integrate Hive and R
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PPT
Hive ICDE 2010
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PPTX
Introduction to MapReduce and Hadoop
PPT
Introduction To Map Reduce
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
apache pig performance optimizations talk at apachecon 2010
Hadoop pig
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop Streaming: Programming Hadoop without Java
Hadoop MapReduce Streaming and Pipes
Overview of Spark for HPC
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Hive Anatomy
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Hive vs Pig for HadoopSourceCodeReading
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Introduction to Apache Pig
Integrate Hive and R
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Hive ICDE 2010
GoodFit: Multi-Resource Packing of Tasks with Dependencies
Introduction to MapReduce and Hadoop
Introduction To Map Reduce
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Ad

Viewers also liked (20)

PDF
Enabling R on Hadoop
PDF
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
PDF
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
PDF
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
PDF
Hadoop, MapReduce and R = RHadoop
PDF
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PPTX
High Performance Predictive Analytics in R and Hadoop
PDF
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
PDF
Integrating R & Hadoop - Text Mining & Sentiment Analysis
PDF
Social Network Analysis With R
PDF
Data Analysis - Making Big Data Work
PDF
07 2
PDF
Machine Learning With R
PPTX
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
PPT
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
PPT
Map Reduce
PDF
R server and spark
PDF
microsoft r server for distributed computing
PPT
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
PPT
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Enabling R on Hadoop
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
Hadoop, MapReduce and R = RHadoop
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
High Performance Predictive Analytics in R and Hadoop
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Social Network Analysis With R
Data Analysis - Making Big Data Work
07 2
Machine Learning With R
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Map Reduce
R server and spark
microsoft r server for distributed computing
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Ad

Similar to Big Data Analysis With RHadoop (20)

PPTX
RHadoop - beginners
PPTX
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
PPTX
SparkNotes
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
PPTX
Using R on High Performance Computers
PDF
Data Science
PPT
Hadoop trainingin bangalore
PPTX
Introduction to HDFS and MapReduce
PDF
Mapreduce by examples
PDF
Apache Spark: What? Why? When?
PPT
Spark training-in-bangalore
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PPTX
Introduction to Apache Spark
PPTX
Stratosphere with big_data_analytics
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PDF
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
Map Reduce
RHadoop - beginners
The Powerful Marriage of Hadoop and R (David Champagne)
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
SparkNotes
20141111 파이썬으로 Hadoop MR프로그래밍
Using R on High Performance Computers
Data Science
Hadoop trainingin bangalore
Introduction to HDFS and MapReduce
Mapreduce by examples
Apache Spark: What? Why? When?
Spark training-in-bangalore
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
Introduction to Apache Spark
Stratosphere with big_data_analytics
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
MAP REDUCE IN DATA SCIENCE.pptx
Map Reduce

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development

Big Data Analysis With RHadoop

  • 1. Big Data Analysis With RHadoop David Chiu (Yu-Wei, Chiu) @ML/DM Monday 2014/03/17
  • 2. About Me  Co-Founder of NumerInfo  Ex-Trend Micro Engineer  ywchiu-tw.appspot.com
  • 4.  Scaling R Hadoop enables R to do parallel computing  Do not have to learn new language Learning to use Java takes time Why Using RHadoop
  • 6.  Enable developer to write Mapper/Reducer in any scripting language(R, python, perl)  Mapper, reducer, and optional combiner processes are written to read from standard input and to write to standard output  Streaming Job would have additional overhead of starting a scripting VM Streaming v.s. Native Java.
  • 7.  Writing MapReduce Using R  mapreduce function Mapreduce(input output, map, reduce…)  Changelog rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0 rmr 2.3.0 (2013/10/07): support plyrmr rmr2
  • 8.  Access HDFS From R  Exchange data from R dataframe and HDFS rhdfs
  • 9.  Exchange data from R to Hbase  Using Thrift API rhbase
  • 10.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  • 12.  R and related packages should be installed on each tasknode of the cluster  A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2 Prerequisites
  • 13. Getting Ready (Cloudera VM)  Download http://www.cloudera.com/content/cloudera-content/cloudera- docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html  This VM runs CentOS 6.2 CDH4.4 R 3.0.1 Java 1.6.0_32
  • 16. Installing rmr2 dependencies  Make sure the package is installed system wise $ sudo R > install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava“, “caTools”))
  • 17. Install rmr2 $ wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
  • 20. $ wget --no-check-certificate http://cran.r- project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t ar.gz $sudo R CMD INSTALL Rcpp_0.11.0.tar.gz Install Rcpp_0.11.0
  • 21. $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz Install rmr2 again
  • 22. Install RHDFS $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/m aster/build/rhdfs_1.0.8.tar.gz $ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz
  • 23. Enable hdfs > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") > library(rmr2) > library(rhdfs) > hdfs.init()
  • 24. Javareconf error $ sudo R CMD javareconf
  • 25. javareconf with correct JAVA_HOME $ echo $JAVA_HOME $ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
  • 27. MapReduce  mapreduce(input, output, map, reduce)  Like sapply, lapply, tapply within R
  • 28. Hello World – For Hadoop http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
  • 29. Move File Into HDFS # Put data into hdfs Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/cloudera/wordcount/data”) hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data") $ hadoop fs –mkdir /user/cloudera/wordcount/data $ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
  • 30. Wordcount Mapper map <- function(k,lines) { words.list <- strsplit(lines, 's') words <- unlist(words.list) return( keyval(words, 1) ) } public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } #Mapper
  • 31. Wordcount Reducer reduce <- function(word, counts) { keyval(word, sum(counts)) } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } #Reducer
  • 32. Call Wordcount hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out') wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)
  • 33. Read data from HDFS results <- from.dfs(out) results$key[order(results$val, decreasing = TRUE)][1:10] $ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 | sort –k 2 –nr | head –n 10
  • 34. MapReduce Benchmark > a.time <- proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time > b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time
  • 37.  HDFS stores your files as data chunk distributed on multiple datanodes  M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers.  It takes time for mapper and reducer being spawned on these distributed system. Hadoop Latency
  • 38. Its not possible to apply built in machine learning method on MapReduce Program kcluster= kmeans((mydata, 4, iter.max=10) Kmeans Clustering
  • 39. kmeans = function(points, ncenters, iterations = 10, distfun = NULL) { if(is.null(distfun)) distfun = function(a,b) norm(as.matrix(a-b), type = 'F') newCenters = kmeans.iter( points, distfun, ncenters = ncenters) # interatively choosing new centers for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters) } newCenters } Kmeans in MapReduce Style
  • 40. kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { #give random point as sample function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { #find center of minimum distance distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T) } Kmeans in MapReduce Style
  • 42.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  • 43. Installation plyrmr dependencies $ yum install libxml2-devel $ sudo yum install curl-devel $ sudo R > Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE > Install.packages(“devtools”, dependencies = TRUE) > library(devtools) > install_github("pryr", "hadley") > Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
  • 45. > data(mtcars) > head(mtcars) > transform(mtcars, carb.per.cyl = carb/cyl) > library(plyrmr) > output(input(mtcars), "/tmp/mtcars") > as.data.frame(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl)) > output(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl), "/tmp/mtcars.out") Transform in plyrmr
  • 46.  where( select( mtcars, carb.per.cyl = carb/cyl, .replace = FALSE), carb.per.cyl >= 1) select and where https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
  • 47.  as.data.frame( select( group( input("/tmp/mtcars"), cyl), mean.mpg = mean(mpg))) Group by https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md