Introduction to Apache Spark Ecosystem

Using Apache Spark
Bojan Babic

Apache Spark
spark.incubator.apache.org
github.com/apache/incubator-spark
user@spark.incubator.apache.org

What is Spark?
Efficient
• General execution
graphs
• In-memory storage
Usable
• Rich APIs in Java,
Scala, Python
• Interactive shell
Fast and Expressive Cluster Computing
System Compatible with Apache Hadoop

Key Concepts
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM or
on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• in hadoop work RDD corresponds
to partition elsewhere dataframe
Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count, collect,
save)

Working With RDDs
RDD
RDD
RDD
RDD
Transfor
mations
Acti
on
Value
linesWithSpark = textFile.filter(l => l.contains("Spark”))
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
textFile = sc.textFile(”/var/log/hadoop/somelog”)

Spark (batch) example
Load error messages from a log into memory, then
interactively search for various patterns
val sc = new SparkContext(config)
val lines = spark.textFile(“hdfs://...”)
val messages = lines.filter(l => l.startswith(“ERROR”))
.map(e => e.split(“t“)(2))
messages.cache()
messages.filter(s => s.contains(“mongo”)).count()
messages.filter(s => s.contains(“500”)).count()
Base RDD
Transformed RDD
Action

Spark Streaming example
val sc = new StreamingContext(conf, Seconds(10))
val kafkaParams = Map[String, String](...)
val messages = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder]
(
sc,
kafkaParams,
Map(topic -> threadsPerTopic),
storageLevel = StorageLevel.MEMORY_ONLY_SER
)
val revenue = messages.filter(m => m.contains(“[PURCHASE]”))
.map(p => p.split(“t”)(4)).reduceByKey(_ + _)
Mini-batch

Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
val msgs = textFile.filter(l => l.contains(“ERROR”))
.map(e => e.split(“t”)(2))
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith
(…))
map
(func = split
(...))

Software Components
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage

Task Scheduler
• General task
graphs
• Automatically
pipelines functions
• Data locality
aware
• Partitioning aware
to avoid shuffles
= cached
partition
=
RDD
joi
n
filte
r
groupB
y
Stage
3
Stage
1
Stage
2
A: B
:
C
:
D
:
E
:
F
:
ma
p

Advanced Features
• Controllable partitioning
– Speed up joins against a dataset
• Controllable storage formats
– Keep data serialized for efficiency, replicate to
multiple nodes, cache on disk
• Shared variables: broadcasts, accumulators

Using the Shell
Launching:
Modes:
MASTER=local ./spark-shell # local, 1 thread
MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # cluster
spark-shell
pyspark (IPYTHON=1)

SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you’d make your own
(see later for details)

RDD Operators
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...

Add Spark to Your Project
• Scala: add dependency to build.sbt
libraryDependencies ++= Seq(
("org.apache.spark" %% "spark-core" % "1.2.0").
exclude("org.mortbay.jetty", "servlet-api")
...
)
• make sure you exclude overlapping libs

and add Spark-Streaming
• Scala: add dependency to build.sbt
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"

Other Iterative Algorithms
Time per Iteration (s)

Conclusion
• Spark offers a rich API to make data analytics
fast: both fast to write and fast to run
• Achieves 100x speedups in real applications
• Growing community with 25+ companies
contributing

Thanks!!!
disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey

Introduction to Apache Spark Ecosystem

More Related Content

What's hot (20)

Similar to Introduction to Apache Spark Ecosystem (20)

Recently uploaded (20)

Introduction to Apache Spark Ecosystem