Is Spark Replacing Hadoop

®
© 2016 MapR Technologies 1
®
© 2016 MapR Technologies
Is Spark Replacing MapReduce? Hadoop?
Keys Botzum, Senior Principal Technologist
March 2016
Last update: March 29, 2016

®
Companies With Spark on MapR In Production
Fortune 500
Global Telecom
Fortune 500
Health Care
Global Financial
Services

®
Cisco: Security Intelligence Operations
Sensor data lands in Hadoop
Streaming for real time
detection and threat alerts
Data next processed on GraphX
and Mahout to build threat
detection models and
accelerated reporting
Additional SQL querying for end
customer reporting and threat
detection

®
Circa 2014 …

®
Next-Gen Genomics
Existing process takes several weeks to
align chemical compounds with genes
ADAM on Spark allows
realignment in a few hours
Geneticists can minimize
engineering dependency

®
Is replacing ?

®
How about Prod. Mgr’s favorite tool –checkbox list!
DAG
Persistent Store
Machine Learning
Graph
Streaming
Batch SQL
Interactive SQL
Security
Resource Management
Multitenancy
Others

®
Pluggable data parallel
framework
HDFS and HBase API based
Persistent Store
•  Proven MapReduce, Hive, Pig
•  YARN introduces pluggability
•  Allows for multiple frameworks
•  Standard for scale out big data store
•  Stores data as files and tables
•  Secure
•  Includes resource management
Wait. What’s Hadoop?

®
Spark and MapReduce are …
•  Scalable frameworks for executing custom code on a cluster
•  Nodes in the cluster work independently to process fragments of
data and also combine those fragments together when
appropriate to yield a final result
•  Can tolerate loss of a node during a computation
•  Require a distributed storage layer for common data view

®
What’s MapReduce
•  Map
–  Loading of the data and defining a set of keys
•  Reduce
–  Collects the organized key-based data to process and output
•  Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)

®
MapReduce Processing Model
•  Define mappers
•  Shuffling is automatic
•  Define reducers
•  For complex work, chain jobs together

®
MapReduce: The Good
•  Built in fault tolerance
•  Optimized IO path
•  Scalable
•  Developer focuses on Map/Reduce, not infrastructure
•  simple? API

®
MapReduce: The Bad
•  Batch oriented
•  Optimized for disk IO
–  Doesn’t leverage memory well
–  Iterative algorithms go through disk IO path again and again
•  Primitive API
–  Developer’s have to build on very simple abstraction
–  Key/Value in/out
–  Even basic things like join require extensive code
•  Result often many files that need to be combined appropriately

®
Batch Interactive Streaming
Framework
Pluggable Persistent Store
•  Powerful API
•  Leverages memory aggressively
•  Batch and streaming
•  MapR-FS, HDFS
•  MapR-DB, HBase, Cassandra
•  MapR-Streams, Kafka
•  S3
What’s Spark?

®
Apache Spark
•  spark.apache.org
•  Originally developed in 2009 in
UC Berkeley’s AMP Lab
•  Fully open sourced in 2010 –
now at Apache Software
Foundation

®
Spark: Ease of Use and Performance
•  Easy to Develop
–  Rich APIs in Java, Scala,
Python, R
–  Interactive shell
•  Fast to Run
–  General execution graphs
–  In-memory storage
Less code, simpler code

®
Resilient Distributed Datasets (RDD)
•  Spark revolves around RDDs
•  Fault-tolerant read only collection of elements that can be
operated on in parallel
•  Cached in memory or on disk
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Newer API based around DataFrames but for this presentation difference isn’t important

®
RDD Operations - Expressive
•  Transformations
–  Creation of a new RDD dataset from an existing
•  map, filter, distinct, union, sample, groupByKey, join, reduce, etc…
•  Actions
–  Return a value after running a computation
•  collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list

®
•  Spark Scala
Easy: Example – Word Count
•  Spark Java•  Hadoop MapReduce Java
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}}
JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String,
Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer,
Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("hdfs://...");
Source: http://spark.apache.org/examples.html#
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

®
Faster for Iterative: PageRank Performance
171
80
23
14
0
50
100
150
200
30 60
Iterationtime(s)
Number of machines
Hadoop
Spark

®
Spark vs. MapReduce
•  Spark is faster than MR for iterative algorithms that fit data in
memory
•  Spark code is easier to write and easier to understand than MR
–  Your programming is closer to the correct abstraction
•  Spark supports batch and streaming model
•  Advantage Spark
–  Caution: not all applications run faster on Spark and Spark may have
limitations for some scenarios

®
Is replacing ?
Is replacing MapReduce?
Quite possibly….with time...with caveats

®
Unified Easy Batch Interactive
Streaming Framework
Pluggable data parallel
framework
HDFS and HBase Persistent
Store
Hadoop is more than MapReduce
Needs a resource manager Includes a resource manager (YARN)

®
Hadoop Supports so Much
•  Alternative batch models: Pig, Cascading, Spark
•  Machine learning: Mahout, SparkML
•  SQL: Hive, Drill, Hive on Tez, Impala, SparkSQL
•  Stream processing: Storm, Flink, Spark, DataTorrent
•  ETL: Sqoop, Flume
•  Storage: file (HDFS/MapR-FS), table (HBase/MapR-DB/Accumulo),
messaging (Kafka/MapR-Streams)
•  Data exploration: Hue
•  And too many excellent commercial tools to list
•  Hypothesis:
–  Infrastructure and data tend to be sticky while execution frameworks evolve rapidly
–  Hadoop’s infrastructure and storage supports a vigorous and growing ecosystem of
“competing” execution engines

®
Perspective
Unified Easy Batch Interactive
Streaming Framework
Pluggable data parallel framework
HDFS and HBase Persistent Store
Interactive SQL
(Drill, Impala,
Hive.next)
Streaming
(Flink, Storm
DataTorrent)
RDBMS
(e.g
SpliceMachine)
Ecosystem
SLA (YARN resource reservation, distro mgmt tools, Pepperdata, …)
Security (Drill Views, Ranger, Sentry, BlueTalon…)
Data Wrangling, discovery and governance (Trifacta, Paxata, Waterline…)

®
Unified Easy Batch Interactive Streaming Framework
Perspective
Ecosystem/Environments
Resource Management – YARN, Mesos, Kubernetes
Deployment – Private OpenStack, Public Cloud, Hybrid
NoSQL/Search
(Cassandra, ES)
In Mem
(SAP
Hana,MemSQL)
RDBMS
(mySQL, Oracle,
etc)
Hadoop
(Hbase, HDFS)

®
Which is More Realistic?
What about
classic
applications and
data sharing?
Spark becomes primary execution framework
Hadoop remains primary storage and execution framework

®
Is replacing ?
Is replacing MapReduce?
Quite possibly….with time...with caveats
Seems improbable
Hadoop grows to embrace new execution frameworks

®

®
MapR Platform Services: Open API Architecture
Assures Interoperability, Avoids Lock-in
HDFS
API
POSIX
NFS
SQL,
Hbase
API
JSON
API
Kafka
API

®
Q&A
maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

®
References
•  Spark vs. MapReduce:
–  https://www.mapr.com/blog/apache-spark-vs-mapreduce-whiteboard-
walkthrough
–  http://www.vldb.org/pvldb/vol8/p2110-shi.pdf
–  http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/
•  Spark: http://spark.apache.org/
•  Spark on MapR:
http://maprdocs.mapr.com/51/index.html#Spark/
Spark_26984599.html

Is Spark Replacing Hadoop

More Related Content

What's hot (20)

Similar to Is Spark Replacing Hadoop (20)

More from MapR Technologies (20)

Recently uploaded (20)

Is Spark Replacing Hadoop