Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Apache Spark Release 1.6
Patrick Wendell

About Me @pwendell
U.C. BerkeleyPhD, left to co-found Databricks
Coordinate community roadmap
Frequentreleasemanager for Spark

About Databricks
Founded by Spark team, donated Spark to Apachein 2013 and lead
developmenttoday.
Collaborative, cloud-hosted data platform powered by Spark
Free trial to check it out
https://databricks.com/
We’re hiring!

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries

Spark’s 3 Month Release Cycle
For production jobs, use the latest
release
To try out unreleasedfeaturesor
fixes, use nightly builds
people.apache.org/~pwendell/spark-nightly/
master
branch-1.6
V1.6.0 V1.6.1

Spark 1.6 Release
Will ship upstreamthrough Apachefoundation in December (likely)
Key themes
Out of the box performance
Previews of key new API’s
Follow along with me at http://bit.ly/1OBkjMM

Follow along
http://bit.ly/1lrvdLc

Memory Management in Spark: <= 1.5
• Two separate memory managers:
• Execution memory: computation of shuffles, joins,sorts, aggregations
• Storage memory: caching and propagating internal data sources across
cluster
• Challengeswith this:
• Manual intervention to avoid unnecessary spilling
• No good defaultsfor all workloads – meaning lost efficiency
• Goal: Allowmemory regionsto shrink/growdynamically

Unified Memory Management in Spark 1.6
• Can cross between execution and storage memory
• When execution memory exceedsits own region,it can borrow as much of the
storage space as isfree and vice versa
• Borrowed storage memory can be evicted at any time
• Significantly reducesconfiguration
• Can define low water mark for storage (below which we won’tevict)
• Reference:[SPARK-10000]

History of Spark API’s
RDD API (2011) Distribute collection of JVM objects
Functional
operators (map, filter, etc)
DataFrame API (2013) Distribute collection of Row objects
Expression-based
operations and UDF’s
Logical plans and
optimizer
Fast/efficient

Dataset
“Encoder”converts from JVM Objectinto
a Dataset Row
Checkout[SPARK-9999]
JVM Object
Dataset
Row
encoder

Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Long)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy($“name”)
.avg(“age”)

Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics

Other Notable Core Engine Features
SQL directly over files
Advanced JSON parsing
Better instrumentation for SQL operators

Demos of What We Learned So Far

Advanced Layout of Cached Data
Storing partitioning and orderingschemes in In-memory table scan
allowsfor performance improvements: e.g. in Joins, an extra partition step can
be saved based on thisinformation
Adding distributeBy and localSort to DF API
Similar to HiveQL’s DISTRIBUTE BY
allowsthe userto control the partitioning and ordering of a data set
Checkout [SPARK-4849]

[Streaming] New improved state management
Introducing a DStream transformation for stateful streamprocessing
Does notscan every key
Easier to implementcommon use cases
timeout of idle data
returning items otherthan state
SupercedesupdateStateByKey in functionality and performance.
trackStateByKey (note, this name may change)

[Streaming] trackStateByKey example
(name may change)
// Initial RDD input
val initialRDD = ssc.sparkContext.parallelize(...)
// ReceiverInputDStream
val lines = ssc.socketTextStream(...)
val words = lines.flatMap(...)
val wordDStream = words.map(x => (x, 1))
// stateDStream using trackStateByKey
val trackStateFunc = (...) { ... }
val stateDStream =
wordDStream.trackStateByKey(StateSpec.function(trackStateFunc).initialSta
te(initialRDD))

[Streaming] Display the failed output op in Streaming
Checkout:
[SPARK-10885] PR#8950

[MLlib]: Pipeline persistence
Persist ML Pipelinesto:
Save models in the spark.ml API
Re-run workflows in a reproducible manner
Export modelsto non-Sparkapps (e.g., model server)
This is more complexthan ML model persistence because:
Must persist Transformers and Estimators, not just Models.
We need a standard way to persist Params.
Pipelines and other meta-algorithms can contain other Transformers and Estimators,
including as Params.
We should save feature metadata with Models

[MLlib]: Pipeline persistence
Reference[SPARK-6725]
Adding model export/import to
the spark.ml API.
Adding the internal
Saveable/Loadable API and
Parquet-based format

R-like statistics for GLMs
Provide R-like summary
statistics for ordinary least
squares via normal equation
solver
Check out[SPARK-9836]

Performance
SPARK-10000 Unified Memory Management- Shared memory for execution and
caching instead of exclusive division ofthe regions.
SPARK-10917,SPARK-11149 In-memory ColumnarCache Performance - Significant
(up to 14x) speed up when caching data that containscomplextypes in
DataFrames or SQL.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query
execution to occur using off-heap memory to avoid GC overhead

Performance (continued)
SPARK-4849 Advanced Layoutof Cached Data - storing partitioning and ordering
schemesin In-memory table scan, and adding distributeBy and localSort to DF
API
SPARK-9858 Adaptive query execution - Initial supportfor automatically selecting
the number of reducersfor joinsand aggregations.

Spark SQL
SPARK-9999 Dataset API
SPARK-11197 SQL Querieson Files
SPARK-11745 Reading non-standard JSON files
SPARK-10412 Per-operator Metrics for SQL Execution
SPARK-11329 Star (*) expansion forStructTypes
SPARK-11111 Fastnull-safe joins
SPARK-10978 Datasource API Avoid Double Filter

Spark Streaming
API Updates
SPARK-2629 New improved state management
SPARK-11198 Kinesisrecord deaggregation
SPARK-10891 Kinesismessage handlerfunction
SPARK-6328 Python Streaming ListenerAPI
UI Improvements
Made failuresvisible in the streaming tab, in the timelines,batch list, and batch
detailspage.
Made outputoperationsvisible in the streaming tab as progress bars

MLlib: New algorithms / models
SPARK-8518 Survival analysis- Log-linearmodel for survival analysis
SPARK-9834 Normal equation forleast squares - Normal equation solver,providing
R-like model summary statistics
SPARK-3147 Online hypothesistesting - A/B testing in the Spark Streaming
framework
SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,SQL
transformer
SPARK-6517 Bisecting K-Meansclustering - Fast top-down clustering variantof K-
Means

MLlib: API Improvements
ML Pipelines
SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of
spark.ml algorithms
SPARK-5565 LDA in ML Pipelines- API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via
summary(model)
SPARK-9681 Feature interactions in R formula -Interaction operator ":"in R formula
Python API - Many improvementsto Python API to approach feature parity

MLlib: Miscellaneous Improvements
SPARK-7685 , SPARK-9642 Instance weightsfor GLMs - Logistic and Linear
Regression can take instance weights
SPARK-10384,SPARK-10385 Univariate and bivariate statistics in DataFrames -
Variance, stddev,correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source

For More Information
Apache Spark 1.6.0 Release Preview:http://apache-spark-developers-
list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-6-0-Release-Preview-
td15314.html
Spark 1.6 Preview available in Databricks:
https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in-
databricks.html

Notebooks
Spark 1.6 ImprovementsNotebook:
http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_Improvements.ht
ml?t=1448929686268
Spark 1.6 R ImprovementsNotebook:
http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_R_Improvements.
html?t=1448946977231

Join us at
Spark Summit East
February16-18, 2016 | New York City

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell (20)

More from Databricks (20)

Recently uploaded (20)

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell