In Memory Analytics with Apache Spark

In Memory Analytics-
Apache Spark
Ravi

Agenda
 Overview of Spark
 Spark with Hadoop MapReduce
 Spark Elements and Operations
 Spark Cluster Overview
 Spark Examples
 Spark Stack Extensions:
 Shark
 Streaming
 Mlib
 Graphx

In Memory Analytics
• In-memory analytics is an approach to querying data when it resides in a
computer’s random access memory (RAM), as opposed to querying data
that is stored on physical disks.
• This results in vastly shortened query response times, allowing business
intelligence (BI) and analytic applications to support faster business
decisions.
• As the cost of RAM declines, in-memory analytics is becoming feasible
for many businesses.
• BI and analytic applications have long supported caching data in RAM, but
older 32-bit operating systems provided only 4 GB of addressable memory.
• Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable
memory (and perhaps more in the future), have made it possible to cache
large volumes of data -- potentially an entire data warehouse or data mart --
in a computer’s RAM.

 Not a modified version of Hadoop
 Separate, fast, Map-Reduce-like engine
 In-memory data storage for very fast iterative queries
 Generate execution of graphs and powerful optimizations
 Up to 40x faster than Hadoop
 Spark beats Hadoop by providing primitives for in-memory cluster
computing; thereby avoiding the I/O bottleneck between the individual
jobs of an iterative MapReduce workflow that repeatedly performs
computations on the same working set.
 Compatible with Hadoop’s storage APIs
 Can read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
What is Spark
- Lightning-Fast Cluster Computing

Spark Programming Model
 Key idea : Resilient Distributed Data (RDD)
 Distributed collections of objects that can be cached in memory across cluster nodes
 Manipulated through various parallel operations
 Automatically rebuilt on failures
 Types of RDD:
 Parallelized collections: Take an existing Scala collection and run functions on it in
parallel
 scala> val distData = sc.parallelize(data)
 distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
 Hadoop datasets : Run functions on each record of a file in Hadoop distributed file
system or any other storage system supported by Hadoop
 scala> val distFile = sc.textFile("data.txt")
 distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08

For example, consider the following job:
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Automatic Parallelization of Complex Flows
 When constructing a complex pipeline of
MapReduce jobs, the task of correctly
parallelizing the sequence of jobs is left to
you. Thus, a scheduler tool such as
Apache Oozie is often required to
carefully construct this sequence.
 With Spark, a whole series of individual
tasks is expressed as a single program
flow that is lazily evaluated so that the
system has a complete picture of the
execution graph.
 This approach allows the core scheduler
to correctly map the dependencies
across different stages in the
application, and automatically parallelize
the flow of operators without user
intervention.

Spark vs Hadoop
Spark is a high-speed cluster computing system compatible with Hadoop that
can outperform it by up to 100 times considering its ability to perform
computations in memory

Transformations (eg: map, filter, group by) :
Create a new dataset from an existing one
Actions ( eg: count, collect, save) :
Return a value to the driver program after running a computation
on the dataset

Spark Elements
 Application User program built on Spark. Consists of a driver program and executors on the
cluster.
 Driver program The process running the main() function of the application and creating the
SparkContext
 Cluster manager An external service for acquiring resources on the cluster (e.g. standalone
manager, Mesos, YARN)
 Worker node Any node that can run application code in the cluster
 Executor A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
 Task A unit of work that will be sent to one executor
 Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
 Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

Spark Cluster Overview
Cluster Manager Types
• Standalone – a simple cluster manager included with Spark that makes it
easy to set up a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications.
• Hadoop YARN – the resource manager in Hadoop 2.

Mesos (Dynamic Resource Sharing for
Clusters) Run Modes
 Spark can run over Mesos in two modes: “fine-grained” and “coarse-
grained”.
 Fine-grained mode, which is the default, each Spark task runs as a
separate Mesos task.
 This allows multiple instances of Spark (and other frameworks) to share machines at
a very fine granularity, where each application gets more or fewer machines as it
ramps up, but it comes with an additional overhead in launching each task, which
may be inappropriate for low-latency applications (e.g. interactive queries or serving
web requests).
 Coarse-grained mode will instead launch only one long-running Spark
task on each Mesos machine, and dynamically schedule its own “mini-
tasks” within it.
 The benefit is much lower startup overhead, but at the cost of reserving the Mesos
resources for the complete duration of the application.

Task Scheduler
• Runs general DAGs
• Pipelines functions within a
stage
• Cache-aware data reuse &
locality
• Partitioning-aware to avoid
shuffles

Spark Stack Extension
Spark powers a stack of high-level tools including
 Shark for SQL
 MLlib for machine learning
 GraphX
 Spark Streaming.
You can combine these frameworks seamlessly in the same
application.

Shark
Shark makes Hive faster and more powerful.
 Shark is a new data analysis system that marries query
processing with complex analytics on large clusters
 Shark is an open source distributed SQL query engine for
Hadoop data. It brings state-of-the-art performance and
advanced analytics to Hive users.
 Speed : Run Hive queries up to 100x faster in memory, or
10x on disk.

Streaming
Spark Streaming makes it easy to build scalable fault-tolerant
streaming applications.
 Spark Streaming brings Spark's language-integrated API to stream processing, letting
you write streaming applications the same way you write batch jobs.
 It supports both Java and Scala.
 Spark Streaming lets you reuse the same code for batch processing, join streams
against historical data, or run ad-hoc queries on stream state
 Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.
 Since Spark Streaming is built on top of Spark, users can apply Spark's in-built
machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on
data streams
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Counting tweets on a sliding window
stream.join(historicCounts).filter {
case (word, (curCount, oldCount)) =>
curCount > oldCount
}
Find words with higher frequency than
historic data

MLlib
MLlib is Apache Spark's scalable machine learning library.
 MLlib fits into Spark's APIs and interoperates with NumPy in
Python (starting in Spark 0.9). You can use any Hadoop data
source (e.g. HDFS, HBase, or local files), making it easy to plug
into Hadoop workflows.
points = spark.textFile("hdfs://...")
.map(parsePoint)
model = KMeans.train(points)
Calling MLlib in Scala

GraphX
Unifying Graphs and Tables
 GraphX extends the distributed fault-tolerant collections API and
interactive console of Spark with a new graph API which leverages
recent advances in graph systems (e.g., GraphLab) to enable
users to easily and interactively build, transform, and reason about
graph structured data at scale.

BDAS, the Berkeley Data
Analytics Stack,
https://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.

Software and Research
Projects
 Shark - Hive and SQL on top of Spark
 MLbase - Machine Learning project on top of Spark
 BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
 GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into
Spark 0.9)
 Apache Mesos - Cluster management system that supports running Spark
 Tachyon - In memory storage system that supports running Spark
 Apache MRQL - A query processing and optimization system for large-scale, distributed data
analysis, built on top of Apache Hadoop, Hama, and Spark
 OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.
 SparkR - R frontend for Spark
 Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster

Conclusion
 “Bigdata” is moving beyond one-pass batch jobs, to
low-latency apps that need data sharing
 RDDs offer fault-tolerant sharing at memory speed
 Spark uses them to combine streaming, batch &
interactive analytics in one system

In Memory Analytics with Apache Spark

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to In Memory Analytics with Apache Spark (20)

More from Venkata Naga Ravi (10)

Recently uploaded (20)

In Memory Analytics with Apache Spark

Editor's Notes