OCF.tw's talk about "Introduction to spark"

Introduction to Spark
Wisely Chen (aka thegiive)
Sr. Engineer at Yahoo

Agenda
• What is Spark? ( Easy )
• Spark Concept ( Middle )
• Break : 10min
• Spark EcoSystem ( Easy )
• Spark Future ( Middle )
• Q&A

Who am I?
• Wisely Chen ( thegiive@gmail.com )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Spark Summit 2014 San Francisco
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013,
Coscup 2012, PHPConf 2012 , RubyConf 2012

Taiwan Data Team
Data!
Highway
BI!
Report
Serving!
API
Data!
Mart
ETL /
Forecast
Machine!
Learning

Opinion from Cloudera
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From http://0rz.tw/y3OfM

What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open source project since
Hadoop

YARN
HDFS
MapReduce
Hadoop 2.0
Storm HBase Others

Hadoop Architecture
Hive
MapReduce
YARN
HDFS
SQL
Computing Engine
Resource Management
Storage

Hadoop vs Spark
Hive Shark/SparkSQL
YARN
HDFS
MapReduce
Spark

Spark vs Hadoop
• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra

More than MapReduce
Shark: Hive GraphX: Pregel MLib: Mahout
Spark Core : MapReduce
HDFS
Streaming:
Storm
Resource Management System(Yarn, Mesos)

天下武功，無堅不破，惟快不破

Logistic
regression
3
110
82.5
55
27.5
33
106
180
135
90
45
171
3X~25X than MapReduce framework
!
From Matei’s paper: http://0rz.tw/VVqgP
Running Time(S)
80
60
40
20
0
76
MR Spark
KMeans
0
MR Spark
PageRank
0
23
MR Spark

What is Spark
• Apache Spark™ is a very fast and general
engine for large-scale data processing

Language Support
• Python
• Java
• Scala

Python Word Count
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" "))
• .map(lambda word: (word, 1))
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via
Spark API
Process via Python

Most machine learning
algorithms need iterative computing

a 1.0
1.0
1.0
1.0
PageRank
b
b
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank
Tmp
Result
Rank
Tmp
Result
a 1.85
1.0
0.58
d
c
0.58
a 1.31
1.72
0.39
d
c
0.58

HDFS is 100x slower than memory
Input
(HDFS)
Iter 1
Tmp
(HDFS)
Iter 2
Tmp
(HDFS)
Iter N
Input
(HDFS)
Iter 1
Tmp
(Mem)
Iter 2
Tmp
(Mem)
Iter N
MapReduce
Spark

3rd iteration(mem)!
take 7.7 sec
2nd iteration(mem)!
take 7.4 sec
First iteration(HDFS)!
take 200 sec
Page Rank algorithm in 1 billion record url

RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster,
stored in RAM or on Disk
• Built through parallel transformations

Fault Tolerance
天下武功，無堅不破，惟快不破

RDD
val b = a.filer( line=>line.contain(“Spark”) )
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
Value c
val c = b.count()
Transformation Action

Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!
val err = a.filter( t=> t.contains(“ERROR”) )!
.filter( t=>t.contains(“2014”)!
!
err.cache()!
err.count()!
!
val m = err.filter( t=> t.contains(“MYSQL”) )!
! ! .count()!
val a = err.filter( t=> t.contains(“APACHE”) )!
! ! .count()
Driver
Worker!
!
!
!
Worker!
!
!
Tas! k
Worker!
!
!
!
Task Task

Log mining
!
err.cache()!
err.count()!
!
! ! .count()!
! ! .count()
Driver
Worker!
!
!
!
RDD a
Bloc! k1
Worker!
!
!
!
RDD a
Bloc! k3
Worker!
!
!
!
RDD a
Bloc! k2

Log mining
!
err.cache()!
err.count()!
!
! ! .count()!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Block3
Worker!
!
!
!
!
RDD err
Block1 Block2

Log mining
!
err.cache()!
err.count()!
!
! ! .count()!
! ! .count()
Driver
Worker!
!
!
!
!
RDD err
Worker!
!
!
!
!
RDD err
Cache3
Worker!
!
!
!
!
RDD err
Cache1 Cache2

Log mining
!
err.cache()!
err.count()!
!
! ! .count()!
! ! .count()
Driver
Worker!
!
!
!
!
RDD m
Worker!
!
!
!
!
RDD m
Cache3
Worker!
!
!
!
!
RDD m
Cache1 Cache2

Log mining
!
err.cache()!
err.count()!
!
! ! .count()!
! ! .count()
Driver
Worker!
!
!
!
!
RDD a
Worker!
!
!
!
!
RDD a
Cache3
Worker!
!
!
!
!
RDD a
Cache1 Cache2

RDD Cache
with cache!
take 7 sec
1st
iteration(no cache)!
take same time

RDD Cache
• Data locality
• Cache
After cache, take
only 265ms
A big shuffle!
take 20min
self join 5 billion record data

Scala Word Count
• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)

Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Java Wordcount
• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");

Java vs Scala
• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new
FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });

Python
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" "))
• .map(lambda word: (word, 1))
• .reduceByKey(lambda a, b: a + b)

Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy

How to use it?
• 1. go to https://spark.apache.org/
• 2. Download and unzip it
• 3. ./sbin/start-all.sh or ./bin/spark-shell

Spark ECOSystem
SparkSQL: Hive GraphX: Pregel MLib: Mahout
Spark Core : MapReduce
HDFS
Streaming:
Storm
Resource Management System(Yarn, Mesos)

Detail
Streaming BI ETL
Spark
SparkSQL
MLlib
Hive HDFS Cassandra RDBMS

BI
(SparkSQL)
Streaming
(SparkStreaming)
Machine
Learning
(MLlib)
Spark

Data Analyst
Data Engineer Data Scientist

Bridge people together
• Scala : Engineer
• Java : Engineer
• Python : Data Scientist , Engineer
• R : Data Scientist , Data Analyst
• SQL : Data Analyst

Yahoo EC team
Data Platform!
!
!
!
!
!
!
!
!
!
Filtered
Data!
(HDFS)
Data
Mart!
(Oracle)
ML Model!
(Spark)
BI Report!
(MSTR)
Traffic!
Data
Transaction!
Data
Shark

Data Analyst
350 TB data
• Select tweet from tweets_data where
similarity(tweet , “FIFA” ) > 0.01
Machine
!
Learning
• =
!
• http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr
https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900

Data Scientist
http://goo.gl/q5CAx8
http://research.janelia.org/zebrafish/

SQL
(Data Analyst)
Cloud
Computing
(Data Engineer)
Machine Learning
(Data Scientist)
Spark

Instant BI Report
http://youtu.be/dJQ5lV5Tldw?t=30m30s

Background Knowledge
• Tweet real time data store into SQL database
• Spark MLLib use Wikipedia data to train a TF-IDF
model
• SparkSQL select tweet and filter by TF-IDF
model
• Generate live BI report

Code
• val wiki = sql(“select text from wiki”)
• val model = new TFIDF()
• model.train(wiki)
• registerFunction(“similarity” , model.similarity _ )
• select tweet from tweet where similarity(tweet,
“$search” > 0.01 )

DEMO
http://youtu.be/dJQ5lV5Tldw?t=39m30s

OCF.tw's talk about "Introduction to spark"

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to OCF.tw's talk about "Introduction to spark" (20)

Recently uploaded (20)

OCF.tw's talk about "Introduction to spark"