Big Data Analysis With RHadoop

Big Data Analysis With
RHadoop
David Chiu (Yu-Wei, Chiu)
@ML/DM Monday
2014/03/17

About Me
 Co-Founder of NumerInfo
 Ex-Trend Micro Engineer
 ywchiu-tw.appspot.com

R + Hadoop
http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg

 Scaling R
Hadoop enables R to do parallel computing
 Do not have to learn new language
Learning to use Java takes time
Why Using RHadoop

Rhadoop Architecture
rhdfs
rhbase
rmr2
HDFS
MapReduce
Hbase Thrift
Gateway
Streaming API
Hbase
R

 Enable developer to write Mapper/Reducer in
any scripting language(R, python, perl)
 Mapper, reducer, and optional combiner
processes are written to read from standard
input and to write to standard output
 Streaming Job would have additional overhead
of starting a scripting VM
Streaming v.s. Native Java.

 Writing MapReduce Using R
 mapreduce function
Mapreduce(input output, map, reduce…)
 Changelog
rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0
rmr 2.3.0 (2013/10/07): support plyrmr
rmr2

 Access HDFS From R
 Exchange data from R dataframe and HDFS
rhdfs

 Exchange data from R to Hbase
 Using Thrift API
rhbase

 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr

 R and related packages should be installed on
each tasknode of the cluster
 A Hadoop cluster, CDH3 and higher or Apache
1.0.2 and higher but limited to mr1, not mr2.
Compatibility with mr2 from Apache 2.2.0 or
HDP2
Prerequisites

Getting Ready (Cloudera VM)
 Download
http://www.cloudera.com/content/cloudera-content/cloudera-
docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
 This VM runs
CentOS 6.2
CDH4.4
R 3.0.1
Java 1.6.0_32

Get RHadoop
 https://github.com/RevolutionAnalytics/RHadoop
/wiki/Downloads

Installing rmr2 dependencies
 Make sure the package is installed system wise
$ sudo R
> install.packages(c("codetools", "R", "Rcpp",
"RJSONIO", "bitops", "digest", "functional", "stringr",
"plyr", "reshape2", "rJava“, “caTools”))

Install rmr2
$ wget --no-check-certificate
https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz

 http://cran.r-project.org/src/contrib/Archive/Rcpp/
Downgrade Rcpp

$ wget --no-check-certificate http://cran.r-
project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t
ar.gz
$sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
Install Rcpp_0.11.0

$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Install rmr2 again

Install RHDFS
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/rhdfs/m
aster/build/rhdfs_1.0.8.tar.gz
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD
INSTALL rhdfs_1.0.8.tar.gz

Enable hdfs
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
> library(rmr2)
> library(rhdfs)
> hdfs.init()

Javareconf error
$ sudo R CMD javareconf

javareconf with correct JAVA_HOME
$ echo $JAVA_HOME
$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf

MapReduce
 mapreduce(input, output, map, reduce)
 Like sapply, lapply, tapply within R

Hello World – For Hadoop
http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png

Move File Into HDFS
# Put data into hdfs
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
library(rmr2)
library(rhdfs)
hdfs.init()
hdfs.mkdir(“/user/cloudera/wordcount/data”)
hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data")
$ hadoop fs –mkdir /user/cloudera/wordcount/data
$ hadoop fs –put wc_input.txt /user/cloudera/word/count/data

Wordcount Mapper
map <- function(k,lines) {
words.list <- strsplit(lines, 's')
words <- unlist(words.list)
return( keyval(words, 1) )
}
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
#Mapper

Wordcount Reducer
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
#Reducer

Call Wordcount
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output,
input.format="text", map=map, reduce=reduce)
}
out <- wordcount(hdfs.data, hdfs.out)

Read data from HDFS
results <- from.dfs(out)
results$key[order(results$val, decreasing
= TRUE)][1:10]
$ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 |
sort –k 2 –nr | head –n 10

MapReduce Benchmark
> a.time <- proc.time()
> small.ints2=1:100000
> result.normal = sapply(small.ints2, function(x) x^2)
> proc.time() - a.time
> b.time <- proc.time()
> small.ints= to.dfs(1:100000)
> result = mapreduce(input = small.ints, map = function(k,v)
cbind(v,v^2))
> proc.time() - b.time

mapreduce
Elapsed 102.755 seconds

 HDFS stores your files as data chunk distributed on
multiple datanodes
 M/R runs multiple programs called mapper on each of
the data chunks or blocks. The (key,value) output of
these mappers are compiled together as result by
reducers.
 It takes time for mapper and reducer being spawned on
these distributed system.
Hadoop Latency

Its not possible to apply built in machine learning
method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
Kmeans Clustering

kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun = function(a,b) norm(as.matrix(a-b), type = 'F')
newCenters =
kmeans.iter(
points,
distfun,
ncenters = ncenters)
# interatively choosing new centers
for(i in 1:iterations) {
newCenters = kmeans.iter(points, distfun,
centers = newCenters)
}
newCenters
}
Kmeans in MapReduce Style

kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL)
{
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) { #give random point as sample
function(k,v) keyval(sample(1:ncenters,1),v)}
else {
function(k,v) { #find center of minimum distance
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)}},
reduce = function(k,vv) keyval(NULL,
apply(do.call(rbind, vv), 2, mean))),
to.data.frame = T)
}
Kmeans in MapReduce Style

Installation plyrmr dependencies
$ yum install libxml2-devel
$ sudo yum install curl-devel
$ sudo R
> Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE
> Install.packages(“devtools”, dependencies = TRUE)
> library(devtools)
> install_github("pryr", "hadley")
> Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)

$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz
$ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz
Installation plyrmr

> data(mtcars)
> head(mtcars)
> transform(mtcars, carb.per.cyl = carb/cyl)
> library(plyrmr)
> output(input(mtcars), "/tmp/mtcars")
> as.data.frame(transform(input("/tmp/mtcars"),
carb.per.cyl = carb/cyl))
> output(transform(input("/tmp/mtcars"), carb.per.cyl =
carb/cyl), "/tmp/mtcars.out")
Transform in plyrmr

 where(
select(
mtcars,
carb.per.cyl = carb/cyl,
.replace = FALSE),
carb.per.cyl >= 1)
select and where
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md

 as.data.frame(
select(
group(
input("/tmp/mtcars"),
cyl),
mean.mpg = mean(mpg)))
Group by
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md

Reference
 https://github.com/RevolutionAnalytics/RHadoop/wik
i
 http://www.slideshare.net/RevolutionAnalytics/rhado
op-r-meets-hadoop
 http://www.slideshare.net/Hadoop_Summit/enabling-
r-on-hadoop

 Website
ywchiu-tw.appspot.com
 Email
david@numerinfo.com
tr.ywchiu@gmail.com
 Company
numerinfo.com
Contact

Big Data Analysis With RHadoop

Big Data Analysis With RHadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big Data Analysis With RHadoop (20)

Recently uploaded (20)

Big Data Analysis With RHadoop