SlideShare a Scribd company logo
Lighting Fast Big Data Analytics with 
Apache . 
Andy Petrella (@noootsab), Gerard Maas (@maasg) 
Big Data Hacker Data Processing Team Lead 
#devoxx #sparkvoxx @noootsab @maasg
Agenda 
What is Spark? 
Spark Foundation: The RDD 
Demo 
Ecosystem 
Examples 
Resources 
#devoxx #sparkvoxx @noootsab @maasg
Memory Network 
CPU’s 
(and don’t forget to throw some disks in the mix) 
#devoxx #sparkvoxx @noootsab @maasg
What is Spark? 
Spark is a fast and general engine for large-scale distributed data processing. 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line. 
split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...") 
Fast Functional 
Growing 
Ecosystem 
#devoxx #sparkvoxx @noootsab @maasg
Spark: A Strong Open Source Project 
27/02 Apache top-level proj 
30/05 Spark 1.0.0 REL 
11/09 Spark 1.1.0 REL 
42 contibutors 118 contibutors 
#Commits. src: github.com/apache/spark 
176 contibutors 
#devoxx #sparkvoxx @noootsab @maasg
Compared to Map-Reduce 
public class WordCount { 
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable( 1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
context.write(word, one); 
} 
} 
} 
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
} 
context.write(key, new IntWritable(sum)); 
} 
} 
public static void main(String[] args) throws Exception { 
Configuration conf = new Configuration(); 
Job job = new Job(conf, "wordcount" ); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 
job.setMapperClass(Map.class); 
job.setReducerClass(Reduce.class); 
job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class); 
FileInputFormat.addInputPath(job, new Path(args[ 0])); 
FileOutputFormat.setOutputPath(job, new Path(args[ 1])); 
job.waitForCompletion( true); 
} 
} 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line. 
split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...") 
Spark 
#devoxx #sparkvoxx @noootsab @maasg
The Big Idea... 
Express computations in terms of operations on a data set. 
Spark Core Concept: RDD => Resilient Distributed Dataset 
Think of an RDD as an immutable, distributed collection of objects 
• Resilient => Can be reconstructed in case of failure 
• Distributed => Transformations are parallelizable operations 
• Dataset => Data loaded and partitioned across cluster nodes (executors) 
RDDs are memory-intensive. Caching behavior is controllable. 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
Executors 
Spark Cluster 
HDFS 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...") RDD 
Partitions 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ")) 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
75 
7 
3 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
75 
7753 
7 
3 
#devoxx #sparkvoxx @noootsab @maasg
The Spark Lingo 
.textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
75 
7753 
7 
3 
Job 
Cluster 
Executor 
RDD 
Partition 
Stage 
Task 
#devoxx #sparkvoxx @noootsab @maasg
Spark: RDD Operations 
INPUT 
DATA 
HDFS 
TEXT/ 
Sequence 
File 
RDD 
SparkContext 
RDD 
OUTPUT 
Data 
HDFS 
TEXT/ 
Sequence 
File 
Cassandra 
#devoxx #sparkvoxx @noootsab @maasg
Transformations 
Inner Manipulations 
> map, flatMap, filter, distinct 
Cross RDD 
> union, subtract, intersection, join, cartesian 
Structural reorganization (Expensive) 
> groupBy, aggregate, sort 
Tuning 
> coalesce, repartition 
#devoxx #sparkvoxx @noootsab @maasg
Actions 
Fetch Data 
> collect, take, first, takeSample 
Aggregate Results 
> reduce, count, countByKey 
Output 
> foreach, foreachPartition, save* 
#devoxx #sparkvoxx @noootsab @maasg
RDD Lineage 
Each RDDs keeps track of its parent. 
This is the basis for DAG scheduling 
and fault recovery 
val file = spark.textFile("hdfs://...") 
val wordsRDD = file.flatMap(line => line.split 
(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
val scoreRdd = words.map{case (k,v) => (v,k)} 
HadoopRDD 
MappedRDD 
FlatMappedRDD 
MappedRDD 
MapPartitionsRDD 
ShuffleRDD 
wordsRDD MapPartitionsRDD 
scoreRDD MappedRDD 
rdd.toDebugString is your friend 
#devoxx #sparkvoxx @noootsab @maasg
Spark has Support for... 
Java 
Scala Notebook 
Python 
API 
Shell 
> 
A 
A API 
A API 
> Shell Notebook 
R API Shell 
The Spark Shell is the best way to start exploring Spark 
#devoxx #sparkvoxx @noootsab @maasg
Demo 
Exploring and 
transforming data with 
the Spark Shell 
Acknowlegments: 
Book data provided by Project Gutenberg (http://www.gutenberg.org/) 
through https://www.opensciencedatacloud.org/ 
Cluster computing resources provided by http://www.virdata.com 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
Agenda 
What is Spark? 
Spark Foundation: The RDD 
Demo 
Ecosystem 
Examples 
Resources 
#devoxx #sparkvoxx @noootsab @maasg
Ecosystem 
Now, we know what is Spark! 
At least, we know its Core, let’s say SDK. 
Thanks to its great and enthusiastic community 
Spark Core have been used in an ever growing number of fields 
Hence the ecosystem is evolving fast 
#devoxx #sparkvoxx @noootsab @maasg
Higher level primitives ... 
… or APIs 
… or the rise of the popolo 
If Spark Core is the fold of distributed computing 
Then we’re going to look at the map, filter, countBy, groupBy, ... 
#devoxx #sparkvoxx @noootsab @maasg
Spark Streaming 
When you have big fat streams behaving as one single collection 
t 
DStream[T] 
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] 
#devoxx #sparkvoxx @noootsab @maasg
Spark Streaming 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
From SQL to noSQL to SQL … to noSQL 
Structured Query Language 
We’re not really querying but we’re processing 
SQL provides the mathematical (abstraction) structures to manipulate data 
We can optimize, Spark has Catalyst 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
#devoxx #sparkvoxx @noootsab @maasg
MLLib 
“The library to teach them all” 
SciPy, SciKitLearn, R, MatLab and c° → learn on one machine 
(sadly often, one core) 
SVM lm 
NaiveBayes 
PCA 
K-Means ALS 
SVD 
#devoxx #sparkvoxx @noootsab @maasg
GraphX 
Connecting the dots 
Graph processing at scale. 
> Takes edges 
> Add some nodes 
> Combine = Send messages (Pregel) 
#devoxx #sparkvoxx @noootsab @maasg
GraphX 
Connecting the dots 
Graph processing at scale. 
> Take edges 
> Link nodes 
> Combine/Send messages 
#devoxx #sparkvoxx @noootsab @maasg
ADAM 
The new kid on the block in the Spark community (with the uncovered Thunder) 
Game changing library for processing DNA, Genotypes, Variant and co. 
Comes with the right stack for processing … 
… legacy huge bunch of vital data 
#devoxx #sparkvoxx @noootsab @maasg
Tooling (NoIDE) 
Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! 
An IDE is not enough because not only softwares or services are crafted. 
Spark is for data analysis, and data scientist need 
> interactivity (exploration) 
> reproducibility (environment, data and logic) 
> shareability (results) 
#devoxx #sparkvoxx @noootsab @maasg
ISpark 
Spark-Shell backend for IPython (Worksheet for data analysts) 
#devoxx #sparkvoxx @noootsab @maasg
Zeppelin 
Well shaped Notebook based on Kibana, offering Spark dedicated features 
> Multi languages (Scala, sql, markdown, shell) 
> Dynamic forms (generating inputs) 
> Data visualization (and export) 
Check the website! 
#devoxx #sparkvoxx @noootsab @maasg
Spark Notebook 
Scala-Notebook fork, enhanced for Spark peculiarities. 
Full Scala, Akka and RxScala. 
Features including: 
> Multi languages (Scala, sql, markdown, javascript) 
> Data visualization 
> Spark work tracking 
Try it: 
curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev 
#devoxx #sparkvoxx @noootsab @maasg
Databricks Cloud 
The amazing product crafted by the company behind Spark! 
Cannot say more than this product will be amazing. 
Fully collaborative, dashboard creation and publication. 
Register for a beta account (Still eagerly waiting for mine ) 
Go there 
#devoxx #sparkvoxx @noootsab @maasg
Examples 
#devoxx #sparkvoxx @noootsab @maasg
Mining DNA 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
Mining Geodata 
#devoxx #sparkvoxx @noootsab @maasg
Dallas  Seattle 
divergence of 18.4 
#devoxx #sparkvoxx @noootsab @maasg
Mining Texts 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
Process Wikipedia XML dump put in HDFS 
Convert XML (multi-lined ) to CSV 
Push to S3 
Sampling 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
Compute some stats: TF-IDF 
Train a NaiveBayes classifier 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
See what the machine can say 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
But… quite some data 
#devoxx #sparkvoxx @noootsab @maasg
A Word of Advice 
Spark beautiful simplicity is often overshadowed by the complexity of building 
and maintaining a working distributed system. 
Sharpen up your Ops skills… 
… or ooops 
#devoxx #sparkvoxx @noootsab @maasg
Resources 
Project website: http://spark.apache.org/ 
Spark presentations: http://spark-summit.org/2014 
Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark 
More Advanced Questions: user@spark.apache.org 
Source Code: https://github.com/apache/spark 
Getting involved: http://spark.apache.org/community.html 
#devoxx #sparkvoxx @noootsab @maasg
Acknowledgments 
Devoxx ! 
Virdata → Shell Demo cluster 
NextLab → Wikipedia ML Cluster 
Rand Hindi (Snips) → Geodata example 
Xavier Tordoir (SilicoCloud) → DNA example 
#devoxx #sparkvoxx @noootsab @maasg
Answers! 
#devoxx #sparkvoxx @noootsab @maasg

More Related Content

PDF
Apache Spark and DataStax Enablement
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
Analytics with Cassandra & Spark
PDF
Spark Cassandra Connector Dataframes
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
OLAP with Cassandra and Spark
PPTX
Spark etl
PDF
Cassandra spark connector
Apache Spark and DataStax Enablement
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Analytics with Cassandra & Spark
Spark Cassandra Connector Dataframes
Unified Big Data Processing with Apache Spark (QCON 2014)
OLAP with Cassandra and Spark
Spark etl
Cassandra spark connector

What's hot (20)

ODP
Introduction to Spark with Scala
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Strata NYC 2015 - What's coming for the Spark community
PPTX
Beyond shuffling global big data tech conference 2015 sj
PDF
2014 spark with elastic search
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
PDF
Productionizing your Streaming Jobs
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
Spark with Elasticsearch
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PPTX
Introduction to Spark ML
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PDF
Automated Spark Deployment With Declarative Infrastructure
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Introduction to Spark with Scala
Apache Spark 2.0: Faster, Easier, and Smarter
DataEngConf SF16 - Spark SQL Workshop
Jump Start on Apache® Spark™ 2.x with Databricks
Strata NYC 2015 - What's coming for the Spark community
Beyond shuffling global big data tech conference 2015 sj
2014 spark with elastic search
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Productionizing your Streaming Jobs
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Spark with Elasticsearch
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Introduction to Spark ML
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Automated Spark Deployment With Declarative Infrastructure
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Ad

Similar to Spark devoxx2014 (20)

PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Meetup ml spark_ppt
PDF
Artigo 81 - spark_tutorial.pdf
PDF
Simple Apache Spark Introduction - Part 2
PDF
Spark Programming
PDF
20170126 big data processing
PDF
Apache spark - Architecture , Overview & libraries
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Intro to Spark and Spark SQL
PDF
Jump Start into Apache® Spark™ and Databricks
PPTX
Azure Databricks is Easier Than You Think
PPTX
Spark core
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
An Overview of Apache Spark
PDF
Apache Spark Introduction
Big Data Processing with .NET and Spark (SQLBits 2020)
Meetup ml spark_ppt
Artigo 81 - spark_tutorial.pdf
Simple Apache Spark Introduction - Part 2
Spark Programming
20170126 big data processing
Apache spark - Architecture , Overview & libraries
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Apache spark-melbourne-april-2015-meetup
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
20130912 YTC_Reynold Xin_Spark and Shark
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Intro to Spark and Spark SQL
Jump Start into Apache® Spark™ and Databricks
Azure Databricks is Easier Than You Think
Spark core
Spark Application Carousel: Highlights of Several Applications Built with Spark
An Overview of Apache Spark
Apache Spark Introduction
Ad

More from Andy Petrella (20)

PPTX
Data Observability Best Pracices
PDF
How to Build a Global Data Mapping
PDF
Interactive notebooks
PDF
Governance compliance
PDF
Data science governance and GDPR
PDF
Data science governance : what and how
PDF
Scala: the unpredicted lingua franca for data science
PDF
Agile data science with scala
PDF
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
PDF
What is a distributed data science pipeline. how with apache spark and friends.
PDF
Towards a rebirth of data science (by Data Fellas)
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Leveraging mesos as the ultimate distributed data science platform
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PDF
Distributed machine learning 101 using apache spark from the browser
PPTX
Liège créative: Open Science
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
PDF
What is Distributed Computing, Why we use Apache Spark
Data Observability Best Pracices
How to Build a Global Data Mapping
Interactive notebooks
Governance compliance
Data science governance and GDPR
Data science governance : what and how
Scala: the unpredicted lingua franca for data science
Agile data science with scala
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
What is a distributed data science pipeline. how with apache spark and friends.
Towards a rebirth of data science (by Data Fellas)
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Spark Summit Europe: Share and analyse genomic data at scale
Leveraging mesos as the ultimate distributed data science platform
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Distributed machine learning 101 using apache spark from the browser
Liège créative: Open Science
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
What is Distributed Computing, Why we use Apache Spark

Spark devoxx2014

  • 1. Lighting Fast Big Data Analytics with Apache . Andy Petrella (@noootsab), Gerard Maas (@maasg) Big Data Hacker Data Processing Team Lead #devoxx #sparkvoxx @noootsab @maasg
  • 2. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  • 3. Memory Network CPU’s (and don’t forget to throw some disks in the mix) #devoxx #sparkvoxx @noootsab @maasg
  • 4. What is Spark? Spark is a fast and general engine for large-scale distributed data processing. val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Fast Functional Growing Ecosystem #devoxx #sparkvoxx @noootsab @maasg
  • 5. Spark: A Strong Open Source Project 27/02 Apache top-level proj 30/05 Spark 1.0.0 REL 11/09 Spark 1.1.0 REL 42 contibutors 118 contibutors #Commits. src: github.com/apache/spark 176 contibutors #devoxx #sparkvoxx @noootsab @maasg
  • 6. Compared to Map-Reduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable( 1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount" ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[ 0])); FileOutputFormat.setOutputPath(job, new Path(args[ 1])); job.waitForCompletion( true); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark #devoxx #sparkvoxx @noootsab @maasg
  • 7. The Big Idea... Express computations in terms of operations on a data set. Spark Core Concept: RDD => Resilient Distributed Dataset Think of an RDD as an immutable, distributed collection of objects • Resilient => Can be reconstructed in case of failure • Distributed => Transformations are parallelizable operations • Dataset => Data loaded and partitioned across cluster nodes (executors) RDDs are memory-intensive. Caching behavior is controllable. #devoxx #sparkvoxx @noootsab @maasg
  • 8. RDDs Executors Spark Cluster HDFS #devoxx #sparkvoxx @noootsab @maasg
  • 9. RDDs .textFile("...") RDD Partitions #devoxx #sparkvoxx @noootsab @maasg
  • 10. RDDs .textFile("...").flatMap(l => l.split(" ")) #devoxx #sparkvoxx @noootsab @maasg
  • 11. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 #devoxx #sparkvoxx @noootsab @maasg
  • 12. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 #devoxx #sparkvoxx @noootsab @maasg
  • 13. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7 3 #devoxx #sparkvoxx @noootsab @maasg
  • 14. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 #devoxx #sparkvoxx @noootsab @maasg
  • 15. The Spark Lingo .textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 Job Cluster Executor RDD Partition Stage Task #devoxx #sparkvoxx @noootsab @maasg
  • 16. Spark: RDD Operations INPUT DATA HDFS TEXT/ Sequence File RDD SparkContext RDD OUTPUT Data HDFS TEXT/ Sequence File Cassandra #devoxx #sparkvoxx @noootsab @maasg
  • 17. Transformations Inner Manipulations > map, flatMap, filter, distinct Cross RDD > union, subtract, intersection, join, cartesian Structural reorganization (Expensive) > groupBy, aggregate, sort Tuning > coalesce, repartition #devoxx #sparkvoxx @noootsab @maasg
  • 18. Actions Fetch Data > collect, take, first, takeSample Aggregate Results > reduce, count, countByKey Output > foreach, foreachPartition, save* #devoxx #sparkvoxx @noootsab @maasg
  • 19. RDD Lineage Each RDDs keeps track of its parent. This is the basis for DAG scheduling and fault recovery val file = spark.textFile("hdfs://...") val wordsRDD = file.flatMap(line => line.split (" ")) .map(word => (word, 1)) .reduceByKey(_ + _) val scoreRdd = words.map{case (k,v) => (v,k)} HadoopRDD MappedRDD FlatMappedRDD MappedRDD MapPartitionsRDD ShuffleRDD wordsRDD MapPartitionsRDD scoreRDD MappedRDD rdd.toDebugString is your friend #devoxx #sparkvoxx @noootsab @maasg
  • 20. Spark has Support for... Java Scala Notebook Python API Shell > A A API A API > Shell Notebook R API Shell The Spark Shell is the best way to start exploring Spark #devoxx #sparkvoxx @noootsab @maasg
  • 21. Demo Exploring and transforming data with the Spark Shell Acknowlegments: Book data provided by Project Gutenberg (http://www.gutenberg.org/) through https://www.opensciencedatacloud.org/ Cluster computing resources provided by http://www.virdata.com #devoxx #sparkvoxx @noootsab @maasg
  • 23. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  • 24. Ecosystem Now, we know what is Spark! At least, we know its Core, let’s say SDK. Thanks to its great and enthusiastic community Spark Core have been used in an ever growing number of fields Hence the ecosystem is evolving fast #devoxx #sparkvoxx @noootsab @maasg
  • 25. Higher level primitives ... … or APIs … or the rise of the popolo If Spark Core is the fold of distributed computing Then we’re going to look at the map, filter, countBy, groupBy, ... #devoxx #sparkvoxx @noootsab @maasg
  • 26. Spark Streaming When you have big fat streams behaving as one single collection t DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] #devoxx #sparkvoxx @noootsab @maasg
  • 27. Spark Streaming #devoxx #sparkvoxx @noootsab @maasg
  • 28. Spark SQL From SQL to noSQL to SQL … to noSQL Structured Query Language We’re not really querying but we’re processing SQL provides the mathematical (abstraction) structures to manipulate data We can optimize, Spark has Catalyst #devoxx #sparkvoxx @noootsab @maasg
  • 29. Spark SQL #devoxx #sparkvoxx @noootsab @maasg
  • 30. MLLib “The library to teach them all” SciPy, SciKitLearn, R, MatLab and c° → learn on one machine (sadly often, one core) SVM lm NaiveBayes PCA K-Means ALS SVD #devoxx #sparkvoxx @noootsab @maasg
  • 31. GraphX Connecting the dots Graph processing at scale. > Takes edges > Add some nodes > Combine = Send messages (Pregel) #devoxx #sparkvoxx @noootsab @maasg
  • 32. GraphX Connecting the dots Graph processing at scale. > Take edges > Link nodes > Combine/Send messages #devoxx #sparkvoxx @noootsab @maasg
  • 33. ADAM The new kid on the block in the Spark community (with the uncovered Thunder) Game changing library for processing DNA, Genotypes, Variant and co. Comes with the right stack for processing … … legacy huge bunch of vital data #devoxx #sparkvoxx @noootsab @maasg
  • 34. Tooling (NoIDE) Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! An IDE is not enough because not only softwares or services are crafted. Spark is for data analysis, and data scientist need > interactivity (exploration) > reproducibility (environment, data and logic) > shareability (results) #devoxx #sparkvoxx @noootsab @maasg
  • 35. ISpark Spark-Shell backend for IPython (Worksheet for data analysts) #devoxx #sparkvoxx @noootsab @maasg
  • 36. Zeppelin Well shaped Notebook based on Kibana, offering Spark dedicated features > Multi languages (Scala, sql, markdown, shell) > Dynamic forms (generating inputs) > Data visualization (and export) Check the website! #devoxx #sparkvoxx @noootsab @maasg
  • 37. Spark Notebook Scala-Notebook fork, enhanced for Spark peculiarities. Full Scala, Akka and RxScala. Features including: > Multi languages (Scala, sql, markdown, javascript) > Data visualization > Spark work tracking Try it: curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev #devoxx #sparkvoxx @noootsab @maasg
  • 38. Databricks Cloud The amazing product crafted by the company behind Spark! Cannot say more than this product will be amazing. Fully collaborative, dashboard creation and publication. Register for a beta account (Still eagerly waiting for mine ) Go there #devoxx #sparkvoxx @noootsab @maasg
  • 39. Examples #devoxx #sparkvoxx @noootsab @maasg
  • 40. Mining DNA #devoxx #sparkvoxx @noootsab @maasg
  • 42. Mining Geodata #devoxx #sparkvoxx @noootsab @maasg
  • 43. Dallas Seattle divergence of 18.4 #devoxx #sparkvoxx @noootsab @maasg
  • 44. Mining Texts #devoxx #sparkvoxx @noootsab @maasg
  • 45. A small project just for the fun Process Wikipedia XML dump put in HDFS Convert XML (multi-lined ) to CSV Push to S3 Sampling #devoxx #sparkvoxx @noootsab @maasg
  • 46. A small project just for the fun Compute some stats: TF-IDF Train a NaiveBayes classifier #devoxx #sparkvoxx @noootsab @maasg
  • 47. A small project just for the fun See what the machine can say #devoxx #sparkvoxx @noootsab @maasg
  • 48. A small project just for the fun But… quite some data #devoxx #sparkvoxx @noootsab @maasg
  • 49. A Word of Advice Spark beautiful simplicity is often overshadowed by the complexity of building and maintaining a working distributed system. Sharpen up your Ops skills… … or ooops #devoxx #sparkvoxx @noootsab @maasg
  • 50. Resources Project website: http://spark.apache.org/ Spark presentations: http://spark-summit.org/2014 Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark More Advanced Questions: user@spark.apache.org Source Code: https://github.com/apache/spark Getting involved: http://spark.apache.org/community.html #devoxx #sparkvoxx @noootsab @maasg
  • 51. Acknowledgments Devoxx ! Virdata → Shell Demo cluster NextLab → Wikipedia ML Cluster Rand Hindi (Snips) → Geodata example Xavier Tordoir (SilicoCloud) → DNA example #devoxx #sparkvoxx @noootsab @maasg
  • 52. Answers! #devoxx #sparkvoxx @noootsab @maasg