Introduction to Apache Spark

INTRODUCTION TO
APACHE SPARK
Mohamed Hedi Abidi - Software Engineer @ebiznext
@mh_abidi

CONTENT
 Spark Introduction
 Installation
 Spark-Shell
 SparkContext
 RDD
 Persistance
 Simple Spark Apps
 Deploiement
 Spark SQL
 Spark GraphX
 Spark Mllib
 Spark Streaming
 Spark & Elasticsearch

INTRODUCTION
An open source data analytics cluster computing
framework
In Memory Data processing
100x faster than Hadoop
Support MapReduce

INTRODUCTION
 Handles batch, interactive, and real-time within a single
framework

INTRODUCTION
 Programming at a higher level of abstraction : faster,
easier development

INTRODUCTION
 Highly accessible through standard APIs built in Java,
Scala, Python, or SQL (for interactive queries), and a rich
set of machine learning libraries
 Compatibility with the existing Hadoop v1 (SIMR) and
2.x (YARN) ecosystems so companies can leverage their
existing infrastructure.

INSTALLATION
 Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+
 Download and unzip Apache Spark 1.1.0 sources
Or clone development Version :
git clone git://github.com/apache/spark.git
 Run Maven to build Apache Spark
mvn -DskipTests clean package
 Launch Apache Spark standalone REPL
[spark_home]/bin/spark-shell
 Go to SparkUI @
http://localhost:4040

SPARK-SHELL
 we’ll run Spark’s interactive shell… within the “spark”
directory, run:
./bin/spark-shell
 then from the “scala>” REPL prompt, let’s create some
data…
scala> val data = 1 to 10000
 create an RDD based on that data…
scala> val distData = sc.parallelize(data)
 then use a filter to select values less than 10…
scala> distData.filter(_ < 10).collect()

SPARKCONTEXT
 The first thing a Spark program must do is to create a
SparkContext object, which tells Spark how to access a
cluster.
 In the shell for either Scala or Python, this is the sc
variable, which is created automatically
 Other programs must use a constructor to instantiate a
new SparkContext
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

RDDS
 Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – It is an immutable distributed
collection of data, which is partitioned across machines
in a cluster
 There are currently two types:
 parallelized collections : Take an existing Scala collection and
run functions on it in parallel
 External datasets : Spark can create distributed datasets from
any storage source supported by Hadoop, including local file
system, HDFS, Cassandra, HBase, Amazon S3, etc.

RDDS
 Parallelized collections
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at
parallelize at <console>:14
 External datasets
scala> val distFile = sc.textFile("README.md")
distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at
textFileat <console>:12

RDDS
 Two types of operations on RDDs:
transformations and actions
 A transformation is a lazy (not computed immediately)
operation on an RDD that yields another RDD
 An action is an operation that triggers a computation,
returns a value back to the Master, or writes to a stable
storage system

RDDS : COMMONLY USED TRANSFORMATIONS
Transformation & Purpose Example & Result
filter(func)
Purpose: new RDD by selecting
those data elements on which
func returns true
scala> val rdd =
sc.parallelize(List(“ABC”,”BCD”,”DEF”))
scala> val filtered = rdd.filter(_.contains(“C”))
scala> filtered.collect()
Result:
Array[String] = Array(ABC, BCD)
map(func)
Purpose: return new RDD by
applying func on each data
element
scala> val rdd=sc.parallelize(List(1,2,3,4,5))
scala> val times2 = rdd.map(_*2)
scala> times2.collect()
Result:
Array[Int] = Array(2, 4, 6, 8, 10)
flatMap(func)
Purpose: Similar to map but func
returns a Seq instead of a value.
For example, mapping a sentence
into a Seq of words
scala> val rdd=sc.parallelize(List(“Spark is
awesome”,”It is fun”))
scala> val fm=rdd.flatMap(str=>str.split(“ “))
scala> fm.collect()
Result:
Array[String] = Array(Spark, is, awesome, It, is, fun)

RDDS : COMMONLY USED TRANSFORMATIONS
reduceByKey(func,[numTasks])
Purpose: To aggregate values of a
key using a function. “numTasks”
is anoptional parameter to specify
number of reduce tasks
scala> val word1=fm.map(word=>(word,1))
scala> val wrdCnt=word1.reduceByKey(_+_)
scala> wrdCnt.collect()
Result:
Array[(String, Int)] = Array((is,2), (It,1),
(awesome,1), (Spark,1), (fun,1))
groupByKey([numTasks])
Purpose: To convert (K,V) to
(K,Iterable<V>)
scala> val cntWrd = wrdCnt.map{case (word,
count) => (count, word)}
scala> cntWrd.groupByKey().collect()
Result:
Array[(Int, Iterable[String])] =
Array((1,ArrayBuffer(It, awesome, Spark,
fun)), (2,ArrayBuffer(is)))
distinct([numTasks])
Purpose: Eliminate duplicates
from RDD
scala> fm.distinct().collect()
Result:
Array[String] = Array(is, It, awesome, Spark,
fun)

RDDS : COMMONLY USED ACTIONS
count()
Purpose: Get the number of
data elements in the RDD
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))
scala> rdd.count()
Result:
Long = 3
collect()
Purpose: get all the data elements
in an RDD as an Array
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))
scala> rdd.collect()
Result:
Array[Char] = Array(A, B, C)
reduce(func)
Purpose: Aggregate the data
elements in an RDD using this
function which takes two
arguments and returns one
scala> val rdd = sc.parallelize(List(1,2,3,4))
scala> rdd.reduce(_+_)
Result:
Int = 10
take (n)
Purpose: fetch first n data
elements in an RDD. Computed by
driver program.
Scala> val rdd = sc.parallelize(List(1,2,3,4))
scala> rdd.take(2)
Result:
Array[Int] = Array(1, 2)

RDDS : COMMONLY USED ACTIONS
foreach(func)
Purpose: execute function for
each data element in RDD.
Usually used to update an
accumulator(discussed later) or
interacting with external systems.
Scala> val rdd = sc.parallelize(List(1,2))
scala> rdd.foreach(x=>println(“%s*10=%s”.
format(x,x*10)))
Result:
1*10=10
2*10=20
first()
Purpose: retrieves the first
data element in RDD. Similar to
take(1)
scala> val rdd = sc.parallelize(List(1,2,3,4))
scala> rdd.first()
Result:
Int = 1
saveAsTextFile(path)
Purpose: Writes the content of
RDD to a text file or a set of text
files to local file system/HDFS
scala> val hamlet = sc.textFile(“readme.txt”)
scala> hamlet.filter(_.contains(“Spark")).
saveAsTextFile(“filtered”)
Result:
…/filtered$ ls
_SUCCESS part-00000 part-00001

RDDS :
 For a more detailed list of actions and transformations,
please refer to:
http://spark.apache.org/docs/latest/programming-guide.
html#transformations
http://spark.apache.org/docs/latest/programming-guide.
html#actions

PERSISTANCE
 Spark can persist (or cache) a dataset in memory across
operations
 Each node stores in memory any slices of it that it
computes and reuses them in other actions on that
dataset – often making future actions more than 10x
faster
 The cache is fault-tolerant: if any partition of an RDD is
lost, it will automatically be recomputed using the
transformations that originally created it

PERSISTANCE : STORAGE LEVEL
Storage Level Purpose
MEMORY_ONLY
(Default level)
Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. This is the default
level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.
MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISC_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate each partition on two cluster
nodes.

SIMPLE SPARK APPS : WORDCOUNT
Download project from github:
https://github.com/MohamedHedi/SparkSamples
WordCount.scala:
val logFile = args(0)
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numApache = logData.filter(line => line.contains("apache")).count()
val numSpark = logData.filter(line => line.contains("spark")).count()
println("Lines with apache: %s, Lines with spark: %s".format(numApache,
numSpark))
 sbt
 compile
 assembly

SPARK-SUBMIT
./bin/spark-submit
--class <main-class>
--master <master-url>
--deploy-mode <deploy-mode>
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]

SPARK-SUBMIT : LOCAL MODE
./bin/spark-submit
--class com.ebiznext.spark.examples.WordCount
--master local[4]
--deploy-mode client
--conf <key>=<value>
... # other options
.targetscala-2.10SparkSamples-assembly-1.0.jar
.ressourcesREADME.md

CLUSTER MANAGER TYPES
 Spark supports three cluster managers:
 Standalone – a simple cluster manager included with Spark
that makes it easy to set up a cluster.
 Apache Mesos – a general cluster manager that can also run
Hadoop MapReduce and service applications.
 Hadoop YARN – the resource manager in Hadoop 2.

MASTER URLS
Master URL Meaning
local One worker thread (no parallelism at all)
local[K] Run Spark locally with K worker threads (ideally, set
his to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as
logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone cluster master.
Default master port : 7077
mesos://HOST:PORT Connect to the given Mesos cluster.
Default mesos port : 5050
yarn-client Connect to a YARN cluster in client mode. The cluster
location will be found based on the
HADOOP_CONF_DIR variable.
yarn-cluster Connect to a YARN cluster in cluster mode. The cluster
location will be found based on HADOOP_CONF_DIR.

SPARK-SUBMIT : STANDALONE CLUSTER
 ./sbin/start-master.sh
(Windows users  spark-class.cmd org.apache.spark.deploy.master.Master)
 Go to the master’s web UI

 ConnectWorkers to Master
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
 Go to the master’s web UI

./bin/spark-submit --class com.ebiznext.spark.examples.WordCount
--master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly-
1.0.jar .ressourcesREADME.md

SPARK SQL
 Shark is being migrated to Spark SQL
 Spark SQL blurs the lines between RDDs and relational
tables
val conf = new SparkConf().setAppName("SparkSQL")
val peopleFile = args(0)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

SPARK GRAPHX
 GraphX is the new (alpha) Spark API for graphs and graph-parallel
computation.
 GraphX extends the Spark RDD by introducing the Resilient Distributed
Property Graph
case class Peep(name: String, age: Int)
val vertexArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45)))
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9))
val conf = new SparkConf().setAppName("SparkGraphx")
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}

SPARK MLLIB
MLlib is Spark’s scalable machine learning library
consisting of common learning algorithms and utilities.
Use cases :
Recommendation Engine
Content classification
Ranking
Algorithms
Classification and regression : linear regression, decision
trees, naive Bayes
 Collaborative filtering : alternating least squares (ALS)
 Clustering : k-means
…

SPARK MLLIB
SparkKMeans.scala
val sparkConf = new SparkConf().setAppName("SparkKMeans")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile(args(0))
val data = lines.map(parseVector _).cache()
val K = args(1).toInt
val convergeDist = args(2).toDouble
val kPoints = data.takeSample(withReplacement = false, K, 42).toArray
var tempDist = 1.0
while (tempDist > convergeDist) {
val closest = data.map(p => (closestPoint(p, kPoints), (p, 1)))
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }
val newPoints = pointStats.map { pair =>
(pair._1, pair._2._1 * (1.0 / pair._2._2))
}.collectAsMap()
tempDist = 0.0
for (i <- 0 until K) {
tempDist += squaredDistance(kPoints(i), newPoints(i))
}
for (newP <- newPoints) yield {
kPoints(newP._1) = newP._2
}
println("Finished iteration (delta = " + tempDist + ")")
}
println("Final centers:")
kPoints.foreach(println)
sc.stop()

SPARK STREAMING
 Spark Streaming extends the core API to allow high-throughput, fault-tolerant
stream processing of live data streams
 Data can be ingested from many sources: Kafka, Flume, Twitter,
ZeroMQ, TCP sockets…
 Results can be pushed out to filesystems, databases, live dashboards…
 Spark’s Mllib algorithms and graph processing algorithms can be
applied to data streams

SPARK STREAMING
val ssc = new StreamingContext(sparkConf, Seconds(10))
 Create a StreamingContext by providing the configuration and batch
duration

TWITTER - SPARK STREAMING - ELASTICSEARCH
1. Twitter access
val keys = ssc.sparkContext.textFile(args(0), 2).cache()
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4)
// Set the system properties so that Twitter4j library used by twitter stream
// can use them to generat OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
2. Streaming from Twitter
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
sparkConf.set("es.index.auto.create", "true")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val keys = ssc.sparkContext.textFile(args(0), 2).cache()
val stream = TwitterUtils.createStream(ssc, None)
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))

TWITTER - SPARK STREAMING - ELASTICSEARCH
 index in Elasticsearch
 Adding elasticsearch-spark jar to build.sbt:
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"
 Writing RDD to elasticsearch:
val conf = new SparkConf().setAppName(appName).setMaster(master)
sparkConf.set("es.index.auto.create", "true")
val apache = Map("hashtag" -> "#Apache", "count" -> 10)
val spark = Map("hashtag" -> "#Spark", "count" -> 15)
val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark))
rdd.saveToEs("spark/hashtag")

Introduction to Apache Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Introduction to Apache Spark (20)

Recently uploaded (20)

Introduction to Apache Spark

Editor's Notes