SlideShare a Scribd company logo
INTRODUCTION TO 
APACHE SPARK 
Mohamed Hedi Abidi - Software Engineer @ebiznext 
@mh_abidi
CONTENT 
 Spark Introduction 
 Installation 
 Spark-Shell 
 SparkContext 
 RDD 
 Persistance 
 Simple Spark Apps 
 Deploiement 
 Spark SQL 
 Spark GraphX 
 Spark Mllib 
 Spark Streaming 
 Spark & Elasticsearch
INTRODUCTION 
An open source data analytics cluster computing 
framework 
In Memory Data processing 
100x faster than Hadoop 
Support MapReduce
INTRODUCTION 
 Handles batch, interactive, and real-time within a single 
framework
INTRODUCTION 
 Programming at a higher level of abstraction : faster, 
easier development
INTRODUCTION 
 Highly accessible through standard APIs built in Java, 
Scala, Python, or SQL (for interactive queries), and a rich 
set of machine learning libraries 
 Compatibility with the existing Hadoop v1 (SIMR) and 
2.x (YARN) ecosystems so companies can leverage their 
existing infrastructure.
INSTALLATION 
 Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+ 
 Download and unzip Apache Spark 1.1.0 sources 
Or clone development Version : 
git clone git://github.com/apache/spark.git 
 Run Maven to build Apache Spark 
mvn -DskipTests clean package 
 Launch Apache Spark standalone REPL 
[spark_home]/bin/spark-shell 
 Go to SparkUI @ 
http://localhost:4040
SPARK-SHELL 
 we’ll run Spark’s interactive shell… within the “spark” 
directory, run: 
./bin/spark-shell 
 then from the “scala>” REPL prompt, let’s create some 
data… 
scala> val data = 1 to 10000 
 create an RDD based on that data… 
scala> val distData = sc.parallelize(data) 
 then use a filter to select values less than 10… 
scala> distData.filter(_ < 10).collect()
SPARKCONTEXT 
 The first thing a Spark program must do is to create a 
SparkContext object, which tells Spark how to access a 
cluster. 
 In the shell for either Scala or Python, this is the sc 
variable, which is created automatically 
 Other programs must use a constructor to instantiate a 
new SparkContext 
val conf = new SparkConf().setAppName(appName).setMaster(master) 
new SparkContext(conf)
RDDS 
 Resilient Distributed Datasets (RDD) are the primary 
abstraction in Spark – It is an immutable distributed 
collection of data, which is partitioned across machines 
in a cluster 
 There are currently two types: 
 parallelized collections : Take an existing Scala collection and 
run functions on it in parallel 
 External datasets : Spark can create distributed datasets from 
any storage source supported by Hadoop, including local file 
system, HDFS, Cassandra, HBase, Amazon S3, etc.
RDDS 
 Parallelized collections 
scala> val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5) 
scala> val distData = sc.parallelize(data) 
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at 
parallelize at <console>:14 
 External datasets 
scala> val distFile = sc.textFile("README.md") 
distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at 
textFileat <console>:12
RDDS 
 Two types of operations on RDDs: 
transformations and actions 
 A transformation is a lazy (not computed immediately) 
operation on an RDD that yields another RDD 
 An action is an operation that triggers a computation, 
returns a value back to the Master, or writes to a stable 
storage system
RDDS : COMMONLY USED TRANSFORMATIONS 
Transformation & Purpose Example & Result 
filter(func) 
Purpose: new RDD by selecting 
those data elements on which 
func returns true 
scala> val rdd = 
sc.parallelize(List(“ABC”,”BCD”,”DEF”)) 
scala> val filtered = rdd.filter(_.contains(“C”)) 
scala> filtered.collect() 
Result: 
Array[String] = Array(ABC, BCD) 
map(func) 
Purpose: return new RDD by 
applying func on each data 
element 
scala> val rdd=sc.parallelize(List(1,2,3,4,5)) 
scala> val times2 = rdd.map(_*2) 
scala> times2.collect() 
Result: 
Array[Int] = Array(2, 4, 6, 8, 10) 
flatMap(func) 
Purpose: Similar to map but func 
returns a Seq instead of a value. 
For example, mapping a sentence 
into a Seq of words 
scala> val rdd=sc.parallelize(List(“Spark is 
awesome”,”It is fun”)) 
scala> val fm=rdd.flatMap(str=>str.split(“ “)) 
scala> fm.collect() 
Result: 
Array[String] = Array(Spark, is, awesome, It, is, fun)
RDDS : COMMONLY USED TRANSFORMATIONS 
Transformation & Purpose Example & Result 
reduceByKey(func,[numTasks]) 
Purpose: To aggregate values of a 
key using a function. “numTasks” 
is anoptional parameter to specify 
number of reduce tasks 
scala> val word1=fm.map(word=>(word,1)) 
scala> val wrdCnt=word1.reduceByKey(_+_) 
scala> wrdCnt.collect() 
Result: 
Array[(String, Int)] = Array((is,2), (It,1), 
(awesome,1), (Spark,1), (fun,1)) 
groupByKey([numTasks]) 
Purpose: To convert (K,V) to 
(K,Iterable<V>) 
scala> val cntWrd = wrdCnt.map{case (word, 
count) => (count, word)} 
scala> cntWrd.groupByKey().collect() 
Result: 
Array[(Int, Iterable[String])] = 
Array((1,ArrayBuffer(It, awesome, Spark, 
fun)), (2,ArrayBuffer(is))) 
distinct([numTasks]) 
Purpose: Eliminate duplicates 
from RDD 
scala> fm.distinct().collect() 
Result: 
Array[String] = Array(is, It, awesome, Spark, 
fun)
RDDS : COMMONLY USED ACTIONS 
Transformation & Purpose Example & Result 
count() 
Purpose: Get the number of 
data elements in the RDD 
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) 
scala> rdd.count() 
Result: 
Long = 3 
collect() 
Purpose: get all the data elements 
in an RDD as an Array 
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) 
scala> rdd.collect() 
Result: 
Array[Char] = Array(A, B, C) 
reduce(func) 
Purpose: Aggregate the data 
elements in an RDD using this 
function which takes two 
arguments and returns one 
scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.reduce(_+_) 
Result: 
Int = 10 
take (n) 
Purpose: fetch first n data 
elements in an RDD. Computed by 
driver program. 
Scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.take(2) 
Result: 
Array[Int] = Array(1, 2)
RDDS : COMMONLY USED ACTIONS 
Transformation & Purpose Example & Result 
foreach(func) 
Purpose: execute function for 
each data element in RDD. 
Usually used to update an 
accumulator(discussed later) or 
interacting with external systems. 
Scala> val rdd = sc.parallelize(List(1,2)) 
scala> rdd.foreach(x=>println(“%s*10=%s”. 
format(x,x*10))) 
Result: 
1*10=10 
2*10=20 
first() 
Purpose: retrieves the first 
data element in RDD. Similar to 
take(1) 
scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.first() 
Result: 
Int = 1 
saveAsTextFile(path) 
Purpose: Writes the content of 
RDD to a text file or a set of text 
files to local file system/HDFS 
scala> val hamlet = sc.textFile(“readme.txt”) 
scala> hamlet.filter(_.contains(“Spark")). 
saveAsTextFile(“filtered”) 
Result: 
…/filtered$ ls 
_SUCCESS part-00000 part-00001
RDDS : 
 For a more detailed list of actions and transformations, 
please refer to: 
http://spark.apache.org/docs/latest/programming-guide. 
html#transformations 
http://spark.apache.org/docs/latest/programming-guide. 
html#actions
PERSISTANCE 
 Spark can persist (or cache) a dataset in memory across 
operations 
 Each node stores in memory any slices of it that it 
computes and reuses them in other actions on that 
dataset – often making future actions more than 10x 
faster 
 The cache is fault-tolerant: if any partition of an RDD is 
lost, it will automatically be recomputed using the 
transformations that originally created it
PERSISTANCE
PERSISTANCE
PERSISTANCE : STORAGE LEVEL 
Storage Level Purpose 
MEMORY_ONLY 
(Default level) 
Store RDD as deserialized Java objects in the JVM. If the RDD does not 
fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not 
fit in memory, store the partitions that don't fit on disk, and read them 
from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This 
is generally more space-efficient than deserialized objects, especially 
when using a fast serializer, but more CPU-intensive to read. 
MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISC_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, etc. 
Same as the levels above, but replicate each partition on two cluster 
nodes.
SIMPLE SPARK APPS : WORDCOUNT 
Download project from github: 
https://github.com/MohamedHedi/SparkSamples 
WordCount.scala: 
val logFile = args(0) 
val conf = new SparkConf().setAppName("WordCount") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numApache = logData.filter(line => line.contains("apache")).count() 
val numSpark = logData.filter(line => line.contains("spark")).count() 
println("Lines with apache: %s, Lines with spark: %s".format(numApache, 
numSpark)) 
 sbt 
 compile 
 assembly
SPARK-SUBMIT 
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options 
<application-jar> 
[application-arguments]
SPARK-SUBMIT : LOCAL MODE 
./bin/spark-submit 
--class com.ebiznext.spark.examples.WordCount 
--master local[4] 
--deploy-mode client 
--conf <key>=<value> 
... # other options 
.targetscala-2.10SparkSamples-assembly-1.0.jar 
.ressourcesREADME.md
CLUSTER MANAGER TYPES 
 Spark supports three cluster managers: 
 Standalone – a simple cluster manager included with Spark 
that makes it easy to set up a cluster. 
 Apache Mesos – a general cluster manager that can also run 
Hadoop MapReduce and service applications. 
 Hadoop YARN – the resource manager in Hadoop 2.
MASTER URLS 
Master URL Meaning 
local One worker thread (no parallelism at all) 
local[K] Run Spark locally with K worker threads (ideally, set 
his to the number of cores on your machine). 
local[*] Run Spark locally with as many worker threads as 
logical cores on your machine. 
spark://HOST:PORT Connect to the given Spark standalone cluster master. 
Default master port : 7077 
mesos://HOST:PORT Connect to the given Mesos cluster. 
Default mesos port : 5050 
yarn-client Connect to a YARN cluster in client mode. The cluster 
location will be found based on the 
HADOOP_CONF_DIR variable. 
yarn-cluster Connect to a YARN cluster in cluster mode. The cluster 
location will be found based on HADOOP_CONF_DIR.
SPARK-SUBMIT : STANDALONE CLUSTER 
 ./sbin/start-master.sh 
(Windows users  spark-class.cmd org.apache.spark.deploy.master.Master) 
 Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER 
 ConnectWorkers to Master 
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT 
 Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER 
./bin/spark-submit --class com.ebiznext.spark.examples.WordCount 
--master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly- 
1.0.jar .ressourcesREADME.md
SPARK SQL 
 Shark is being migrated to Spark SQL 
 Spark SQL blurs the lines between RDDs and relational 
tables 
val conf = new SparkConf().setAppName("SparkSQL") 
val sc = new SparkContext(conf) 
val peopleFile = args(0) 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
import sqlContext._ 
// Define the schema using a case class. 
case class Person(name: String, age: Int) 
// Create an RDD of Person objects and register it as a table. 
val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) 
people.registerAsTable("people") 
// SQL statements can be run by using the sql methods provided by sqlContext. 
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") 
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations. 
// The columns of a row in the result can be accessed by ordinal. 
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
SPARK GRAPHX 
 GraphX is the new (alpha) Spark API for graphs and graph-parallel 
computation. 
 GraphX extends the Spark RDD by introducing the Resilient Distributed 
Property Graph 
case class Peep(name: String, age: Int) 
val vertexArray = Array( 
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), 
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), 
(5L, Peep("Leslie", 45))) 
val edgeArray = Array( 
Edge(2L, 1L, 7), Edge(2L, 4L, 2), 
Edge(3L, 2L, 4), Edge(3L, 5L, 3), 
Edge(4L, 1L, 1), Edge(5L, 3L, 9)) 
val conf = new SparkConf().setAppName("SparkGraphx") 
val sc = new SparkContext(conf) 
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray) 
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) 
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD) 
val results = g.triplets.filter(t => t.attr > 7) 
for (triplet <- results.collect) { 
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") 
}
SPARK MLLIB 
MLlib is Spark’s scalable machine learning library 
consisting of common learning algorithms and utilities. 
Use cases : 
Recommendation Engine 
Content classification 
Ranking 
Algorithms 
Classification and regression : linear regression, decision 
trees, naive Bayes 
 Collaborative filtering : alternating least squares (ALS) 
 Clustering : k-means 
…
SPARK MLLIB 
SparkKMeans.scala 
val sparkConf = new SparkConf().setAppName("SparkKMeans") 
val sc = new SparkContext(sparkConf) 
val lines = sc.textFile(args(0)) 
val data = lines.map(parseVector _).cache() 
val K = args(1).toInt 
val convergeDist = args(2).toDouble 
val kPoints = data.takeSample(withReplacement = false, K, 42).toArray 
var tempDist = 1.0 
while (tempDist > convergeDist) { 
val closest = data.map(p => (closestPoint(p, kPoints), (p, 1))) 
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) } 
val newPoints = pointStats.map { pair => 
(pair._1, pair._2._1 * (1.0 / pair._2._2)) 
}.collectAsMap() 
tempDist = 0.0 
for (i <- 0 until K) { 
tempDist += squaredDistance(kPoints(i), newPoints(i)) 
} 
for (newP <- newPoints) yield { 
kPoints(newP._1) = newP._2 
} 
println("Finished iteration (delta = " + tempDist + ")") 
} 
println("Final centers:") 
kPoints.foreach(println) 
sc.stop()
SPARK STREAMING 
 Spark Streaming extends the core API to allow high-throughput, fault-tolerant 
stream processing of live data streams 
 Data can be ingested from many sources: Kafka, Flume, Twitter, 
ZeroMQ, TCP sockets… 
 Results can be pushed out to filesystems, databases, live dashboards… 
 Spark’s Mllib algorithms and graph processing algorithms can be 
applied to data streams
SPARK STREAMING 
val ssc = new StreamingContext(sparkConf, Seconds(10)) 
 Create a StreamingContext by providing the configuration and batch 
duration
TWITTER - SPARK STREAMING - ELASTICSEARCH 
1. Twitter access 
val keys = ssc.sparkContext.textFile(args(0), 2).cache() 
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4) 
// Set the system properties so that Twitter4j library used by twitter stream 
// can use them to generat OAuth credentials 
System.setProperty("twitter4j.oauth.consumerKey", consumerKey) 
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) 
System.setProperty("twitter4j.oauth.accessToken", accessToken) 
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) 
2. Streaming from Twitter 
val sparkConf = new SparkConf().setAppName("TwitterPopularTags") 
sparkConf.set("es.index.auto.create", "true") 
val ssc = new StreamingContext(sparkConf, Seconds(10)) 
val keys = ssc.sparkContext.textFile(args(0), 2).cache() 
val stream = TwitterUtils.createStream(ssc, None) 
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#"))) 
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)) 
.map { case (topic, count) => (count, topic) } 
.transform(_.sortByKey(false))
TWITTER - SPARK STREAMING - ELASTICSEARCH 
 index in Elasticsearch 
 Adding elasticsearch-spark jar to build.sbt: 
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3" 
 Writing RDD to elasticsearch: 
val conf = new SparkConf().setAppName(appName).setMaster(master) 
sparkConf.set("es.index.auto.create", "true") 
val apache = Map("hashtag" -> "#Apache", "count" -> 10) 
val spark = Map("hashtag" -> "#Spark", "count" -> 15) 
val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark)) 
rdd.saveToEs("spark/hashtag")

More Related Content

ODP
Introduction to Spark with Scala
PDF
Introduction to Apache Spark
PPTX
Apache Spark An Overview
PDF
Intro to apache spark stand ford
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
PDF
Apache spark basics
PDF
Apache Spark Tutorial
PDF
Apache Spark Introduction
Introduction to Spark with Scala
Introduction to Apache Spark
Apache Spark An Overview
Intro to apache spark stand ford
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache spark basics
Apache Spark Tutorial
Apache Spark Introduction

What's hot (20)

PDF
Apache Spark RDDs
PPT
spark-kafka_mod
PDF
Apache Spark Introduction - CloudxLab
PDF
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Introduction to Spark Internals
PDF
Productionizing Spark and the Spark Job Server
PDF
Spark overview
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PDF
Introduction to spark
PPTX
Apache Spark overview
PPTX
Introduction to real time big data with Apache Spark
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PPTX
Intro to Apache Spark
PDF
Parallelize R Code Using Apache Spark
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Why your Spark job is failing
PPT
Scala and spark
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Apache Spark RDDs
spark-kafka_mod
Apache Spark Introduction - CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Spark Internals
Productionizing Spark and the Spark Job Server
Spark overview
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Introduction to spark
Apache Spark overview
Introduction to real time big data with Apache Spark
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Intro to Apache Spark
Parallelize R Code Using Apache Spark
Processing Large Data with Apache Spark -- HasGeek
Why your Spark job is failing
Scala and spark
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Ad

Viewers also liked (20)

PPTX
ElasticSearch : Architecture et Développement
PDF
Introduction to Apache Spark
PDF
OCF.tw's talk about "Introduction to spark"
PDF
Fully fault tolerant real time data pipeline with docker and mesos
PDF
An introduction To Apache Spark
PPTX
Introduction to Apache Spark and MLlib
PPTX
Introduction to Apache Spark Developer Training
PPTX
Introduction to Apache Spark
PDF
What's new in spark 2.0?
PDF
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
PPTX
Deep Learning with Apache Spark: an Introduction
PDF
Introduction to Apache Spark
PPTX
Interning at CBS Boston - WBZ NewsRadio 1030
PPT
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
PPT
Dream Village
PDF
Gr. 4 Unit 1
PPT
How we do monotize SaaS as a VAS in India?
PPTX
IGLESIA APPS CHURCH APPS
DOCX
Riyaz_resume
DOC
ElasticSearch : Architecture et Développement
Introduction to Apache Spark
OCF.tw's talk about "Introduction to spark"
Fully fault tolerant real time data pipeline with docker and mesos
An introduction To Apache Spark
Introduction to Apache Spark and MLlib
Introduction to Apache Spark Developer Training
Introduction to Apache Spark
What's new in spark 2.0?
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Deep Learning with Apache Spark: an Introduction
Introduction to Apache Spark
Interning at CBS Boston - WBZ NewsRadio 1030
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
Dream Village
Gr. 4 Unit 1
How we do monotize SaaS as a VAS in India?
IGLESIA APPS CHURCH APPS
Riyaz_resume
Ad

Similar to Introduction to Apache Spark (20)

PPTX
Intro to Apache Spark
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Spark core
PDF
Introduction to Apache Spark
PDF
Apache Spark and DataStax Enablement
PPTX
SparkNotes
PDF
Apache Spark: What? Why? When?
PDF
Introduction to Spark
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Meetup ml spark_ppt
PPTX
Apache Spark Introduction @ University College London
PDF
Introduction to apache spark
PDF
Introduction to apache spark
PPTX
OVERVIEW ON SPARK.pptx
PDF
Introduction to Apache Spark
PPTX
Spark real world use cases and optimizations
PDF
Big Data Processing using Apache Spark and Clojure
PPTX
Scala meetup - Intro to spark
PDF
Apache Spark Overview
Intro to Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Spark core
Introduction to Apache Spark
Apache Spark and DataStax Enablement
SparkNotes
Apache Spark: What? Why? When?
Introduction to Spark
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache spark sneha challa- google pittsburgh-aug 25th
Meetup ml spark_ppt
Apache Spark Introduction @ University College London
Introduction to apache spark
Introduction to apache spark
OVERVIEW ON SPARK.pptx
Introduction to Apache Spark
Spark real world use cases and optimizations
Big Data Processing using Apache Spark and Clojure
Scala meetup - Intro to spark
Apache Spark Overview

Recently uploaded (20)

PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
ai tools demonstartion for schools and inter college
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
System and Network Administration Chapter 2
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Introduction to Artificial Intelligence
CHAPTER 2 - PM Management and IT Context
ai tools demonstartion for schools and inter college
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
System and Network Administration Chapter 2
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Online Work Permit System for Fast Permit Processing
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
VVF-Customer-Presentation2025-Ver1.9.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Design an Analysis of Algorithms II-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo POS Development Services by CandidRoot Solutions
How Creative Agencies Leverage Project Management Software.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction to Artificial Intelligence

Introduction to Apache Spark

  • 1. INTRODUCTION TO APACHE SPARK Mohamed Hedi Abidi - Software Engineer @ebiznext @mh_abidi
  • 2. CONTENT  Spark Introduction  Installation  Spark-Shell  SparkContext  RDD  Persistance  Simple Spark Apps  Deploiement  Spark SQL  Spark GraphX  Spark Mllib  Spark Streaming  Spark & Elasticsearch
  • 3. INTRODUCTION An open source data analytics cluster computing framework In Memory Data processing 100x faster than Hadoop Support MapReduce
  • 4. INTRODUCTION  Handles batch, interactive, and real-time within a single framework
  • 5. INTRODUCTION  Programming at a higher level of abstraction : faster, easier development
  • 6. INTRODUCTION  Highly accessible through standard APIs built in Java, Scala, Python, or SQL (for interactive queries), and a rich set of machine learning libraries  Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN) ecosystems so companies can leverage their existing infrastructure.
  • 7. INSTALLATION  Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+  Download and unzip Apache Spark 1.1.0 sources Or clone development Version : git clone git://github.com/apache/spark.git  Run Maven to build Apache Spark mvn -DskipTests clean package  Launch Apache Spark standalone REPL [spark_home]/bin/spark-shell  Go to SparkUI @ http://localhost:4040
  • 8. SPARK-SHELL  we’ll run Spark’s interactive shell… within the “spark” directory, run: ./bin/spark-shell  then from the “scala>” REPL prompt, let’s create some data… scala> val data = 1 to 10000  create an RDD based on that data… scala> val distData = sc.parallelize(data)  then use a filter to select values less than 10… scala> distData.filter(_ < 10).collect()
  • 9. SPARKCONTEXT  The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.  In the shell for either Scala or Python, this is the sc variable, which is created automatically  Other programs must use a constructor to instantiate a new SparkContext val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf)
  • 10. RDDS  Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – It is an immutable distributed collection of data, which is partitioned across machines in a cluster  There are currently two types:  parallelized collections : Take an existing Scala collection and run functions on it in parallel  External datasets : Spark can create distributed datasets from any storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc.
  • 11. RDDS  Parallelized collections scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:14  External datasets scala> val distFile = sc.textFile("README.md") distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at textFileat <console>:12
  • 12. RDDS  Two types of operations on RDDs: transformations and actions  A transformation is a lazy (not computed immediately) operation on an RDD that yields another RDD  An action is an operation that triggers a computation, returns a value back to the Master, or writes to a stable storage system
  • 13. RDDS : COMMONLY USED TRANSFORMATIONS Transformation & Purpose Example & Result filter(func) Purpose: new RDD by selecting those data elements on which func returns true scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result: Array[String] = Array(ABC, BCD) map(func) Purpose: return new RDD by applying func on each data element scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result: Array[Int] = Array(2, 4, 6, 8, 10) flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result: Array[String] = Array(Spark, is, awesome, It, is, fun)
  • 14. RDDS : COMMONLY USED TRANSFORMATIONS Transformation & Purpose Example & Result reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is anoptional parameter to specify number of reduce tasks scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result: Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1)) groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>) scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result: Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is))) distinct([numTasks]) Purpose: Eliminate duplicates from RDD scala> fm.distinct().collect() Result: Array[String] = Array(is, It, awesome, Spark, fun)
  • 15. RDDS : COMMONLY USED ACTIONS Transformation & Purpose Example & Result count() Purpose: Get the number of data elements in the RDD scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) scala> rdd.count() Result: Long = 3 collect() Purpose: get all the data elements in an RDD as an Array scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) scala> rdd.collect() Result: Array[Char] = Array(A, B, C) reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.reduce(_+_) Result: Int = 10 take (n) Purpose: fetch first n data elements in an RDD. Computed by driver program. Scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.take(2) Result: Array[Int] = Array(1, 2)
  • 16. RDDS : COMMONLY USED ACTIONS Transformation & Purpose Example & Result foreach(func) Purpose: execute function for each data element in RDD. Usually used to update an accumulator(discussed later) or interacting with external systems. Scala> val rdd = sc.parallelize(List(1,2)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10))) Result: 1*10=10 2*10=20 first() Purpose: retrieves the first data element in RDD. Similar to take(1) scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.first() Result: Int = 1 saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/HDFS scala> val hamlet = sc.textFile(“readme.txt”) scala> hamlet.filter(_.contains(“Spark")). saveAsTextFile(“filtered”) Result: …/filtered$ ls _SUCCESS part-00000 part-00001
  • 17. RDDS :  For a more detailed list of actions and transformations, please refer to: http://spark.apache.org/docs/latest/programming-guide. html#transformations http://spark.apache.org/docs/latest/programming-guide. html#actions
  • 18. PERSISTANCE  Spark can persist (or cache) a dataset in memory across operations  Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster  The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it
  • 21. PERSISTANCE : STORAGE LEVEL Storage Level Purpose MEMORY_ONLY (Default level) Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISC_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.
  • 22. SIMPLE SPARK APPS : WORDCOUNT Download project from github: https://github.com/MohamedHedi/SparkSamples WordCount.scala: val logFile = args(0) val conf = new SparkConf().setAppName("WordCount") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numApache = logData.filter(line => line.contains("apache")).count() val numSpark = logData.filter(line => line.contains("spark")).count() println("Lines with apache: %s, Lines with spark: %s".format(numApache, numSpark))  sbt  compile  assembly
  • 23. SPARK-SUBMIT ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]
  • 24. SPARK-SUBMIT : LOCAL MODE ./bin/spark-submit --class com.ebiznext.spark.examples.WordCount --master local[4] --deploy-mode client --conf <key>=<value> ... # other options .targetscala-2.10SparkSamples-assembly-1.0.jar .ressourcesREADME.md
  • 25. CLUSTER MANAGER TYPES  Spark supports three cluster managers:  Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.  Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.  Hadoop YARN – the resource manager in Hadoop 2.
  • 26. MASTER URLS Master URL Meaning local One worker thread (no parallelism at all) local[K] Run Spark locally with K worker threads (ideally, set his to the number of cores on your machine). local[*] Run Spark locally with as many worker threads as logical cores on your machine. spark://HOST:PORT Connect to the given Spark standalone cluster master. Default master port : 7077 mesos://HOST:PORT Connect to the given Mesos cluster. Default mesos port : 5050 yarn-client Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable. yarn-cluster Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
  • 27. SPARK-SUBMIT : STANDALONE CLUSTER  ./sbin/start-master.sh (Windows users  spark-class.cmd org.apache.spark.deploy.master.Master)  Go to the master’s web UI
  • 28. SPARK-SUBMIT : STANDALONE CLUSTER  ConnectWorkers to Master ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT  Go to the master’s web UI
  • 29. SPARK-SUBMIT : STANDALONE CLUSTER ./bin/spark-submit --class com.ebiznext.spark.examples.WordCount --master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly- 1.0.jar .ressourcesREADME.md
  • 30. SPARK SQL  Shark is being migrated to Spark SQL  Spark SQL blurs the lines between RDDs and relational tables val conf = new SparkConf().setAppName("SparkSQL") val sc = new SparkContext(conf) val peopleFile = args(0) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 31. SPARK GRAPHX  GraphX is the new (alpha) Spark API for graphs and graph-parallel computation.  GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph case class Peep(name: String, age: Int) val vertexArray = Array( (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), (5L, Peep("Leslie", 45))) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 5L, 3), Edge(4L, 1L, 1), Edge(5L, 3L, 9)) val conf = new SparkConf().setAppName("SparkGraphx") val sc = new SparkContext(conf) val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD) val results = g.triplets.filter(t => t.attr > 7) for (triplet <- results.collect) { println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") }
  • 32. SPARK MLLIB MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. Use cases : Recommendation Engine Content classification Ranking Algorithms Classification and regression : linear regression, decision trees, naive Bayes  Collaborative filtering : alternating least squares (ALS)  Clustering : k-means …
  • 33. SPARK MLLIB SparkKMeans.scala val sparkConf = new SparkConf().setAppName("SparkKMeans") val sc = new SparkContext(sparkConf) val lines = sc.textFile(args(0)) val data = lines.map(parseVector _).cache() val K = args(1).toInt val convergeDist = args(2).toDouble val kPoints = data.takeSample(withReplacement = false, K, 42).toArray var tempDist = 1.0 while (tempDist > convergeDist) { val closest = data.map(p => (closestPoint(p, kPoints), (p, 1))) val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) } val newPoints = pointStats.map { pair => (pair._1, pair._2._1 * (1.0 / pair._2._2)) }.collectAsMap() tempDist = 0.0 for (i <- 0 until K) { tempDist += squaredDistance(kPoints(i), newPoints(i)) } for (newP <- newPoints) yield { kPoints(newP._1) = newP._2 } println("Finished iteration (delta = " + tempDist + ")") } println("Final centers:") kPoints.foreach(println) sc.stop()
  • 34. SPARK STREAMING  Spark Streaming extends the core API to allow high-throughput, fault-tolerant stream processing of live data streams  Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ, TCP sockets…  Results can be pushed out to filesystems, databases, live dashboards…  Spark’s Mllib algorithms and graph processing algorithms can be applied to data streams
  • 35. SPARK STREAMING val ssc = new StreamingContext(sparkConf, Seconds(10))  Create a StreamingContext by providing the configuration and batch duration
  • 36. TWITTER - SPARK STREAMING - ELASTICSEARCH 1. Twitter access val keys = ssc.sparkContext.textFile(args(0), 2).cache() val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4) // Set the system properties so that Twitter4j library used by twitter stream // can use them to generat OAuth credentials System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) 2. Streaming from Twitter val sparkConf = new SparkConf().setAppName("TwitterPopularTags") sparkConf.set("es.index.auto.create", "true") val ssc = new StreamingContext(sparkConf, Seconds(10)) val keys = ssc.sparkContext.textFile(args(0), 2).cache() val stream = TwitterUtils.createStream(ssc, None) val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#"))) val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)) .map { case (topic, count) => (count, topic) } .transform(_.sortByKey(false))
  • 37. TWITTER - SPARK STREAMING - ELASTICSEARCH  index in Elasticsearch  Adding elasticsearch-spark jar to build.sbt: libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"  Writing RDD to elasticsearch: val conf = new SparkConf().setAppName(appName).setMaster(master) sparkConf.set("es.index.auto.create", "true") val apache = Map("hashtag" -> "#Apache", "count" -> 10) val spark = Map("hashtag" -> "#Spark", "count" -> 15) val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark)) rdd.saveToEs("spark/hashtag")

Editor's Notes

  • #4: Hadoop est un framework Java qui facilite la création d'applications distribuées scalables. Il permet aux applications de travailler avec des milliers de nœuds et des pétaoctets de données. MapReduce est design pattern d’architecture, inventé par Google Composé de : Phase Map (calcul) : Pour chaque ensemble le traitement Map est appliqué. Phase intermédiaire où les données sont triées et les données liées sont regroupées pour être traitées par un même nœud. Phase Reduce (agrégation) : Les données sont éventuellement agrégées. Regrouper les résultat de chacun des nœuds pour calculer le résultat final.