SlideShare a Scribd company logo
Spark Application
Development Made
Fast and Easy
Shivnath Babu
Lance Co Ting Keh
Lance Co Ting Keh
Machine Learning @ Box
Distributed ML Infrastructure
Go Blue Devils!
Shivnath Babu
Associate Professor @ Duke
Chief Scientist at Unravel Data Sys.
R&D in Management of Data Systems
Spark @Box
What’s so great about Spark?
From:
https://weminoredinfilm.files.wordpress.com/2014/04/59911951.jpg
Complexity
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Spark Execution
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
HDFS
Stage 0 Stage 1
RDD0 RDD1 RDD2 RDD3 RDD4
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
Stage 0 Stage 1
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Spark Execution
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2Exec3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2Exec3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2Exec3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Anything that can go wrong,
will go wrong (at some point)
What can go wrong?
Failures
•  My query failed after 6 hours!
•  What does this exception mean?
What can go wrong?
•  Failures
•  My query failed after 6 hours!
•  What does this exception mean?
•  Wrong results
•  Result of my job looks wrong
•  Bad performance
•  My app is very slow
•  Pipeline is not meeting the 4hr SLA
•  Poor scalability
•  Oh, but it worked on the dev cluster!
•  Bad App(le)s
•  Tom’s query brought the cluster down
•  Application Problems
•  Poor choice of transformations
•  Ineffective caching
•  Bloated data structures
•  Data/Storage Problems
•  Skewed data, load imbalance
•  Small files, poor data partitioning
•  Spark Problems
•  Shuffle
•  Lazy evaluation causes confusion
•  Resource Problems
•  Resource contention
•  Performance degradation
And Why?
How do application developers
detect & fix these problems today?
SparkApplicationDevMadeEasy_Spark_Summit_2015
Too much to look at!
Look at Logs?
Logs in distributed systems are spread out, incomplete,
& usually very difficult to understand
There has to be a better way for
application developers to
detect & fix problems
Visualize:
Show me all relevant data in one place
Optimize:
Analyze the data for me and give me
diagnoses and fixes
Strategize:
Help me prevent the problems from
happening and meet my goals
Demo
Visualize:
Show me all relevant data in one place
Visualize:
Show me all relevant data in one place
Optimize:
Analyze the data for me and give me diagnoses
and fixes
Strategize:
Help me prevent the problems from happening
and meet my goals
Strategize:
Help me prevent the problems from happening
and meet my goals
We are hiring: jobs@unraveldata.com
Sign up for early access at:
bit.ly/unravelspark

More Related Content

PPTX
LanceShivnathHadoopSummit2015
PDF
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
PDF
Strata San Jose 2016: Scalable Ensemble Learning with H2O
PDF
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
PDF
Productionizing Deep Reinforcement Learning with Spark and MLflow
PDF
Machine learning model to production
PDF
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
PDF
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
LanceShivnathHadoopSummit2015
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Productionizing Deep Reinforcement Learning with Spark and MLflow
Machine learning model to production
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...

What's hot (16)

PDF
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
PPTX
Spark ML Pipeline serving
PDF
Apply MLOps at Scale by H&M
PPTX
Anomaly Detection using Spark MLlib and Spark Streaming
PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
PPTX
Hyperparameter Optimization - Sven Hafeneger
PDF
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
PPTX
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
PDF
Serverless data pipelines gcp
PPTX
H2O intro at Dallas Meetup
PDF
Splice Machine's use of Apache Spark and MLflow
PDF
Scalable Machine Learning in R and Python with H2O
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
Spark + H20 = Machine Learning at scale
PDF
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Tuning ML Models: Scaling, Workflows, and Architecture
Spark ML Pipeline serving
Apply MLOps at Scale by H&M
Anomaly Detection using Spark MLlib and Spark Streaming
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Hyperparameter Optimization - Sven Hafeneger
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Serverless data pipelines gcp
H2O intro at Dallas Meetup
Splice Machine's use of Apache Spark and MLflow
Scalable Machine Learning in R and Python with H2O
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Spark + H20 = Machine Learning at scale
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Ad

Similar to SparkApplicationDevMadeEasy_Spark_Summit_2015 (20)

PPTX
Spark Application Development Made Easy
PDF
Introduction to Spark Training
PDF
Dev Ops Training
PPTX
Intro to Spark development
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Stefano Baghino - From Big Data to Fast Data: Apache Spark
PDF
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
Apache Spark: What? Why? When?
PPTX
In Memory Analytics with Apache Spark
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
Spark forspringdevs springone_final
PPTX
Spark Overview and Performance Issues
PDF
Spark: A Unified Engine for Big Data Processing
Spark Application Development Made Easy
Introduction to Spark Training
Dev Ops Training
Intro to Spark development
Big_data_analytics_NoSql_Module-4_Session
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
Spark.pptx to knowledge gaining in wdm days ago
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Apache Spark: What? Why? When?
In Memory Analytics with Apache Spark
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Spark Summit East 2015 Advanced Devops Student Slides
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Spark forspringdevs springone_final
Spark Overview and Performance Issues
Spark: A Unified Engine for Big Data Processing
Ad

SparkApplicationDevMadeEasy_Spark_Summit_2015