SlideShare a Scribd company logo
Apache: Big Data - Starting with Apache Spark, Best Practices
Starting with Apache Spark,
Best Practices and Learning from
the Field
Felix Cheung, Principal Engineer + Spark Committer
Spark@Microsoft
Apache: Big Data - Starting with Apache Spark, Best Practices
Best Practices
Enterprise Solutions
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
Resilient - Fault tolerant
Apache: Big Data - Starting with Apache Spark, Best Practices
19,500+ commits
Tungsten
AMPLab becoming RISELab
• Drizzle – low latency execution, 3.5x lower than
Spark Streaming
• Ernest – performance prediction, automatically
choose the optimal resource config on the cloud
Apache: Big Data - Starting with Apache Spark, Best Practices
Deployment
Scheduler
Resource Manager (aka Cluster Manager)
- Spark History Server, Spark UI
Spark Core
Apache: Big Data - Starting with Apache Spark, Best Practices
Parallelization, Partition
Transformation
Action
Shuffle
Doing multiple things at the same time
A unit of parallelization
Manipulating data - immutable
"Narrow"
"Wide"
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
Processing: sorting, serialize/deserialize,
compression
Transfer: disk IO, network
bandwidth/latency
Take up memory, or spill to disk for
intermediate results ("shuffle file")
Materialize results
Execute the chain of transformations
that leads to output – lazy evaluation
count
collect -> take
write
DataFrame
Dataset
Data source
Execution engine - Catalyst
SQL
Execution Plan
Predicate Pushdown
Strong typing
Optimized execution
Dataset[Row]
Partition = set of Row's
"format" - Parquet, CSV, JSON, or
Cassandra, HBase
Apache: Big Data - Starting with Apache Spark, Best Practices
Ability to process expressions as early in
the plan as possible
spark.read.jdbc(jdbcUrl, "food",
connectionProperties)
// with pushdown
spark.read.jdbc(jdbcUrl, "food",
connectionProperties).select("hotdog", "pizza",
"sushi")
Discretized Streams (DStreams)
Receiver DStream
Direct DStream
Basic and Advanced Sources
Streaming
Source
Reliability
Receiver + Write Ahead Log (WAL)
Checkpointing
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
https://databricks.com/wp-content/uploads/2015/01/blog-ha-52.png
Only for reliable messaging sources that
supports read from position
Stronger fault-tolerance, exactly-once*
No receiver/WAL
– less resource, lower overhead
Saving to reliable storage to recover
from failure
1. Metadata checkpointing
StreamingContext.checkpoint()
2. Data checkpointing
dstream.checkpoint()
ML Pipeline
Transformer
Estimator
Evaluator
Machine Learning
DataFrame-based
- leverage optimizations and support
transformations
a sequence of algorithms
- PipelineStages
Transformer EstimatorTransformerDataFrame
Feature engineering Modeling
Feature transformer
- take a DataFrame and its Column and
append one or more new Column
StopWordsRemover
Binarizer
SQLTransformer
VectorAssembler
Estimators
An algorithm
DataFrame -> Model
A Model is a Transformer
LinearRegression
KMeans
Evaluator
Metric to measure Model
performance on held-out test data
Evaluator
MulticlassClassificationEvaluator
BinaryClassificationEvaluator
RegressionEvaluator
MLWriter/MLReader
Pipeline persistence
Include transformers, estimators,
Params
Graph
Pregel
Graph Algorithms
Graph Queries
Graph
Directed multigraph with user properties
on edges and vertices
SEA
NYC
LAX
PageRank
ConnectedComponents
ranks =
tripGraph.pageRank(resetProbability=
0.15, maxIter=5)
DataFrame-based
Simplify loading graph data, wrangling
Support Graph Queries
Pattern matching
Mix pattern with SQL syntax
motifs = g.find("(a)-[e]->(b); (b)-
[e2]->(a); !(c)-[]->(a)").filter("a.id
= 'MIA'")
Structured Streaming Model
Source
Sink
StreamingQuery
Structured Streaming
Extending same DataFrame to include
incremental execution of unbounded
input
Reliability, correctness / exactly-once -
checkpointing (2.1 JSON format)
Stream as Unbounded Input
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Watermark (2.1) - handling of late data
Streaming ETL, joining static data,
partitioning, windowing
FileStreamSource
KafkaSource
MemoryStream (not for production)
TextSocketSource
MQTT
FileStreamSink (new formats in 2.1)
ConsoleSink
ForeachSink (Scala only)
MemorySink – as Temp View
staticDF = (
spark
.read
.schema(jsonSchema)
.json(inputPath)
)
streamingDF = (
spark
.readStream
.schema(jsonSchema)
.option("maxFilesPerTrigger", 1)
.json(inputPath)
)
# Take a list of files as a stream
streamingCountsDF = (
streamingDF
.groupBy(
streamingDF.word,
window(
streamingDF.time,
"1 hour"))
.count()
)
query = (
streamingCountsDF
.writeStream
.format("memory")
.queryName("word_counts")
.outputMode("complete")
.start()
)
spark.sql("select count from word_counts order
by time")
Apache: Big Data - Starting with Apache Spark, Best Practices
How much going in affects how much
work it's going to take
Size does matter!
CSV or JSON is "simple" but also tend to
be big
JSON-> Parquet (compressed)
- 7x faster
Format also does matter
Recommended format - Parquet
Default data source/format
• VectorizedReader
• Better dictionary decoding
Parquet Columnar Format
Column chunk co-located
Metadata and headers for
skipping
Recommend Parquet
Compression is a factor
gzip <100MB/s vs snappy 500MB/s
Tradeoffs: faster or smaller?
Spark 2.0+ defaults to snappy
Sidenote: Table Partitioning
Storage data into groups of partitioning
columns
Encoded path structure matches Hive
table/event_date=2017-02-01
Spark UI
Timeline view
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
Apache: Big Data - Starting with Apache Spark, Best Practices
Spark UI
DAG view
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
Executor tab
SQL tab
Apache: Big Data - Starting with Apache Spark, Best Practices
Understanding Queries
explain() is your friend
but it could be hard to understand at
times == Parsed Logical Plan ==
Aggregate [count(1) AS count#79L]
+- Sort [speed_y#49 ASC], true
+- Join Inner, (speed_x#48 = speed_y#49)
:- Project [speed#2 AS speed_x#48, dist#3]
: +- LogicalRDD [speed#2, dist#3]
+- Project [speed#18 AS speed_y#49, dist#19]
+- LogicalRDD [speed#18, dist#19]
Apache: Big Data - Starting with Apache Spark, Best Practices
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#79L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)],
output=[count#83L])
+- *Project
+- *Sort [speed_y#49 ASC], true, 0
+- Exchange rangepartitioning(speed_y#49 ASC, 200)
+- *Project [speed_y#49]
+- *SortMergeJoin [speed_x#48], [speed_y#49], Inne
:- *Sort [speed_x#48 ASC], false, 0
: +- Exchange hashpartitioning(speed_x#48, 200
: +- *Project [speed#2 AS speed_x#48]
: +- *Filter isnotnull(speed#2)
: +- Scan ExistingRDD[speed#2,dist#3]
+- *Sort [speed_y#49 ASC], false, 0
+- Exchange hashpartitioning(speed_y#49, 200
UDF
Write you own custom transforms
But... Catalyst can't see through it (yet?!)
Always prefer to use builtin transforms
as much as possible
UDF vs Builtin Example
Remember Predicate Pushdown?
val isSeattle = udf { (s: String) => s == "Seattle" }
cities.where(isSeattle('name))
*Filter UDF(name#2)
+- *FileScan parquet [id#128L,name#2] Batched: true, Format:
ParquetFormat, InputPaths: file:/Users/b/cities.parquet,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<id:bigint,name:string>
UDF vs Builtin Example
cities.where('name === "Seattle")
*Project [id#128L, name#2]
+- *Filter (isnotnull(name#2) && (name#2 = Seattle))
+- *FileScan parquet [id#128L,name#2] Batched: true, Format:
ParquetFormat, InputPaths: file:/Users/b/cities.parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name),
EqualTo(name,Seattle)], ReadSchema:
struct<id:bigint,name:string>
UDF in Python
Avoid!
Why? Pickling, transfer, extra memory to
run Python interpreter
- Hard to debug errors!
from pyspark.sql.types import IntegerType
sqlContext.udf.register("stringLengthInt", lambda x:
len(x), IntegerType())
sqlContext.sql("SELECT stringLengthInt('test')").take(1)
Going for Performance
Stored in compressed Parquet
Partitioned table
Predicate Pushdown
Avoid UDF
Shuffling for Join
Can be very
expensive
Optimizing for Join
Partition!
Narrow transform if left and right
partitioned with same scheme
Optimizing for Join
Broadcast Join (aka Map-Side Join in
Hadoop)
Smaller table against large table - avoid
shuffling large table
Default 10MB auto broadcast
BroadcastHashJoin
left.join(right, Seq("id"), "leftanti").explain
== Physical Plan ==
*BroadcastHashJoin [id#50], [id#60], LeftAnti,
BuildRight
:- LocalTableScan [id#50, left#51]
+- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[0, int,
false] as bigint)))
+- LocalTableScan [id#60]
Repartition
To numPartitions or by Columns
Increase parallelism – will shuffle
coalesce() – combine partitions in place
Cache
cache() or persist()
Flush least-recently-used (LRU)
- Make sure there is enough memory!
MEMORY_AND_DISK to avoid expensive
recompute (but spill to disk is slow)
Streaming
Use Structured Streaming (2.1+)
If not...
If reliable messaging (Kafka) use Direct
DStream
Metadata - Config
Position from streaming source (aka
offset)
- could get duplicates! (at-least-once)
Pending batches
Persist stateful transformations
- data lost if not saved
Cut short execution that could grow
indefinitely
Direct DStream
Checkpoint also store offset
Turn off auto commit
- do when in good state for exactly-
once
Checkpointing
Stream/ML/Graph/SQL
- more efficient indefinite/iterative
- recovery
Generally not versioning-safe
Use reliable distributed file system
(caution on “object store”)
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
Hadoop
WebLog Spark SQL
External
Data
Source
BI Tools
HDFS
Hive
Hive
Metastore
FrontEnd
Hourly
Spark
Streaming
Spark
ML
Kafka
HDFS
FrontEnd
Near-RealTime
(end-to-end roundtrip:
8-20 sec)
Offline
Analysis
Apache: Big Data - Starting with Apache Spark, Best Practices
Spark SQL
RDBMS
BI Tools
Hive
SQL
Appliance
BI Tools
Apache: Big Data - Starting with Apache Spark, Best Practices
Spark
Streaming
Message
Bus
Spark
ML
Kafka
Storage
Spark SQL
Visualization
External
Data
Source
BI Tools
Data Lake
Hive
Metastore
Spark
ML
Apache: Big Data - Starting with Apache Spark, Best Practices
Spark
Streaming
Kafka
HDFS
Spark SQL
Visualization
Presto
BI Tools
Hive
SQL
Flume
SQL
Data
Science
Notebook
Apache: Big Data - Starting with Apache Spark, Best Practices
Spark
Streaming
Message
Bus
Storage
SQL
Spark SQL Spark SQL
Data
Factory
Moving
Apache: Big Data - Starting with Apache Spark, Best Practices
https://www.linkedin.com/in/felix-cheung-b4067510
https://github.com/felixcheung

More Related Content

PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Scalable Data Science in Python and R on Apache Spark
PPTX
Spark r under the hood with Hossein Falaki
PDF
Scalable Data Science with SparkR
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
SSR: Structured Streaming for R and Machine Learning
Scalable Data Science in Python and R on Apache Spark
Spark r under the hood with Hossein Falaki
Scalable Data Science with SparkR
Keeping Spark on Track: Productionizing Spark for ETL
Parallelizing Existing R Packages with SparkR
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Robust and Scalable ETL over Cloud Storage with Apache Spark

What's hot (20)

PDF
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Parallelize R Code Using Apache Spark
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PPTX
ETL with SPARK - First Spark London meetup
PPTX
Building a modern Application with DataFrames
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PDF
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
PDF
Koalas: Interoperability Between Koalas and Apache Spark
PDF
Spark Summit EU talk by Luca Canali
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Sqoop on Spark for Data Ingestion
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Parallelize R Code Using Apache Spark
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Building Robust ETL Pipelines with Apache Spark
SparkR - Play Spark Using R (20160909 HadoopCon)
ETL with SPARK - First Spark London meetup
Building a modern Application with DataFrames
Exceptions are the Norm: Dealing with Bad Actors in ETL
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Koalas: Interoperability Between Koalas and Apache Spark
Spark Summit EU talk by Luca Canali
Recent Developments In SparkR For Advanced Analytics
Taking Spark Streaming to the Next Level with Datasets and DataFrames
A look under the hood at Apache Spark's API and engine evolutions
Sparkly Notebook: Interactive Analysis and Visualization with Spark
From Pipelines to Refineries: Scaling Big Data Applications
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Sqoop on Spark for Data Ingestion
Ad

Similar to Apache: Big Data - Starting with Apache Spark, Best Practices (20)

PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Making Structured Streaming Ready for Production
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Hadoop Introduction
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
What's new with Apache Spark's Structured Streaming?
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
DOC
Whitepaper To Study Filestream Option In Sql Server
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PDF
Java File I/O Performance Analysis - Part I - JCConf 2018
PDF
Spark what's new what's coming
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Writing Continuous Applications with Structured Streaming in PySpark
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Making Structured Streaming Ready for Production
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Real-Time Spark: From Interactive Queries to Streaming
Hadoop Introduction
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
What's new with Apache Spark's Structured Streaming?
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Spark (Structured) Streaming vs. Kafka Streams
Writing Continuous Applications with Structured Streaming PySpark API
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Whitepaper To Study Filestream Option In Sql Server
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Java File I/O Performance Analysis - Part I - JCConf 2018
Spark what's new what's coming
Ad

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
1_Introduction to advance data techniques.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Mega Projects Data Mega Projects Data
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IB Computer Science - Internal Assessment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
1_Introduction to advance data techniques.pptx
Business Analytics and business intelligence.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Business Acumen Training GuidePresentation.pptx
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to machine learning and Linear Models
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Mega Projects Data Mega Projects Data

Apache: Big Data - Starting with Apache Spark, Best Practices