Spark devoxx2014

Lighting Fast Big Data Analytics with
Apache .
Andy Petrella (@noootsab), Gerard Maas (@maasg)
Big Data Hacker Data Processing Team Lead
#devoxx #sparkvoxx @noootsab @maasg

Agenda
What is Spark?
Spark Foundation: The RDD
Demo
Ecosystem
Examples
Resources

Memory Network
CPU’s
(and don’t forget to throw some disks in the mix)

What is Spark?
Spark is a fast and general engine for large-scale distributed data processing.
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.
split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Fast Functional
Growing
Ecosystem

Spark: A Strong Open Source Project
27/02 Apache top-level proj
30/05 Spark 1.0.0 REL
11/09 Spark 1.1.0 REL
42 contibutors 118 contibutors
#Commits. src: github.com/apache/spark
176 contibutors

Compared to Map-Reduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable( 1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount" );
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[ 0]));
FileOutputFormat.setOutputPath(job, new Path(args[ 1]));
job.waitForCompletion( true);
}
}
val counts = file.flatMap(line => line.
split(" "))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark

The Big Idea...
Express computations in terms of operations on a data set.
Spark Core Concept: RDD => Resilient Distributed Dataset
Think of an RDD as an immutable, distributed collection of objects
• Resilient => Can be reconstructed in case of failure
• Distributed => Transformations are parallelizable operations
• Dataset => Data loaded and partitioned across cluster nodes (executors)
RDDs are memory-intensive. Caching behavior is controllable.

RDDs
Executors
Spark Cluster
HDFS

RDDs
.textFile("...") RDD
Partitions

RDDs
.textFile("...").flatMap(l => l.split(" "))

RDDs
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1))
111111
111111
111111

RDDs
111111
111111
111111
.reduceByKey(_ + _)
2411
3121
2221

RDDs
111111
111111
111111
.reduceByKey(_ + _)
2411
3121
2221
75
7
3

RDDs
111111
111111
111111
.reduceByKey(_ + _)
2411
3121
2221
75
7753
7
3

The Spark Lingo
.textFile("...").flatMap(l => l.split(" ")).map(w => (w,1))
111111
111111
111111
.reduceByKey(_ + _)
2411
3121
2221
75
7753
7
3
Job
Cluster
Executor
RDD
Partition
Stage
Task

Spark: RDD Operations
INPUT
DATA
HDFS
TEXT/
Sequence
File
RDD
SparkContext
RDD
OUTPUT
Data
HDFS
TEXT/
Sequence
File
Cassandra

Transformations
Inner Manipulations
> map, flatMap, filter, distinct
Cross RDD
> union, subtract, intersection, join, cartesian
Structural reorganization (Expensive)
> groupBy, aggregate, sort
Tuning
> coalesce, repartition

Actions
Fetch Data
> collect, take, first, takeSample
Aggregate Results
> reduce, count, countByKey
Output
> foreach, foreachPartition, save*

RDD Lineage
Each RDDs keeps track of its parent.
This is the basis for DAG scheduling
and fault recovery
val wordsRDD = file.flatMap(line => line.split
(" "))
.reduceByKey(_ + _)
val scoreRdd = words.map{case (k,v) => (v,k)}
HadoopRDD
MappedRDD
FlatMappedRDD
MappedRDD
MapPartitionsRDD
ShuffleRDD
wordsRDD MapPartitionsRDD
scoreRDD MappedRDD
rdd.toDebugString is your friend

Spark has Support for...
Java
Scala Notebook
Python
API
Shell
>
A
A API
A API
> Shell Notebook
R API Shell
The Spark Shell is the best way to start exploring Spark

Demo
Exploring and
transforming data with
the Spark Shell
Acknowlegments:
Book data provided by Project Gutenberg (http://www.gutenberg.org/)
through https://www.opensciencedatacloud.org/
Cluster computing resources provided by http://www.virdata.com

Ecosystem
Now, we know what is Spark!
At least, we know its Core, let’s say SDK.
Thanks to its great and enthusiastic community
Spark Core have been used in an ever growing number of fields
Hence the ecosystem is evolving fast

Higher level primitives ...
… or APIs
… or the rise of the popolo
If Spark Core is the fold of distributed computing
Then we’re going to look at the map, filter, countBy, groupBy, ...

Spark Streaming
When you have big fat streams behaving as one single collection
t
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

Spark Streaming

Spark SQL
From SQL to noSQL to SQL … to noSQL
Structured Query Language
We’re not really querying but we’re processing
SQL provides the mathematical (abstraction) structures to manipulate data
We can optimize, Spark has Catalyst

Spark SQL

MLLib
“The library to teach them all”
SciPy, SciKitLearn, R, MatLab and c° → learn on one machine
(sadly often, one core)
SVM lm
NaiveBayes
PCA
K-Means ALS
SVD

GraphX
Connecting the dots
Graph processing at scale.
> Takes edges
> Add some nodes
> Combine = Send messages (Pregel)

GraphX
Connecting the dots
Graph processing at scale.
> Take edges
> Link nodes
> Combine/Send messages

ADAM
The new kid on the block in the Spark community (with the uncovered Thunder)
Game changing library for processing DNA, Genotypes, Variant and co.
Comes with the right stack for processing …
… legacy huge bunch of vital data

Tooling (NoIDE)
Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family!
An IDE is not enough because not only softwares or services are crafted.
Spark is for data analysis, and data scientist need
> interactivity (exploration)
> reproducibility (environment, data and logic)
> shareability (results)

ISpark
Spark-Shell backend for IPython (Worksheet for data analysts)

Zeppelin
Well shaped Notebook based on Kibana, offering Spark dedicated features
> Multi languages (Scala, sql, markdown, shell)
> Dynamic forms (generating inputs)
> Data visualization (and export)
Check the website!

Spark Notebook
Scala-Notebook fork, enhanced for Spark peculiarities.
Full Scala, Akka and RxScala.
Features including:
> Multi languages (Scala, sql, markdown, javascript)
> Data visualization
> Spark work tracking
Try it:
curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev

Databricks Cloud
The amazing product crafted by the company behind Spark!
Cannot say more than this product will be amazing.
Fully collaborative, dashboard creation and publication.
Register for a beta account (Still eagerly waiting for mine )
Go there

Examples

Mining DNA

Mining Geodata

Dallas Seattle
divergence of 18.4

Mining Texts

A small project just for the fun
Process Wikipedia XML dump put in HDFS
Convert XML (multi-lined ) to CSV
Push to S3
Sampling

Compute some stats: TF-IDF
Train a NaiveBayes classifier

See what the machine can say

But… quite some data

A Word of Advice
Spark beautiful simplicity is often overshadowed by the complexity of building
and maintaining a working distributed system.
Sharpen up your Ops skills…
… or ooops

Resources
Project website: http://spark.apache.org/
Spark presentations: http://spark-summit.org/2014
Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark
More Advanced Questions: user@spark.apache.org
Source Code: https://github.com/apache/spark
Getting involved: http://spark.apache.org/community.html

Acknowledgments
Devoxx !
Virdata → Shell Demo cluster
NextLab → Wikipedia ML Cluster
Rand Hindi (Snips) → Geodata example
Xavier Tordoir (SilicoCloud) → DNA example

Answers!

Spark devoxx2014

More Related Content

What's hot (20)

Similar to Spark devoxx2014 (20)

More from Andy Petrella (20)

Spark devoxx2014