SlideShare a Scribd company logo
Apache Spark Release 1.6
Patrick Wendell
About Me @pwendell
U.C. BerkeleyPhD, left to co-found Databricks
Coordinate community roadmap
Frequentreleasemanager for Spark
About Databricks
Founded by Spark team, donated Spark to Apachein 2013 and lead
developmenttoday.
Collaborative, cloud-hosted data platform powered by Spark
Free trial to check it out
https://databricks.com/
We’re hiring!
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Users Distributors & Apps
Spark’s 3 Month Release Cycle
For production jobs, use the latest
release
To try out unreleasedfeaturesor
fixes, use nightly builds
people.apache.org/~pwendell/spark-nightly/
master
branch-1.6
V1.6.0 V1.6.1
Spark 1.6
Spark 1.6 Release
Will ship upstreamthrough Apachefoundation in December (likely)
Key themes
Out of the box performance
Previews of key new API’s
Follow along with me at http://bit.ly/1OBkjMM
Follow along
http://bit.ly/1lrvdLc
Memory Management in Spark: <= 1.5
• Two separate memory managers:
• Execution memory: computation of shuffles, joins,sorts, aggregations
• Storage memory: caching and propagating internal data sources across
cluster
• Challengeswith this:
• Manual intervention to avoid unnecessary spilling
• No good defaultsfor all workloads – meaning lost efficiency
• Goal: Allowmemory regionsto shrink/growdynamically
Unified Memory Management in Spark 1.6
• Can cross between execution and storage memory
• When execution memory exceedsits own region,it can borrow as much of the
storage space as isfree and vice versa
• Borrowed storage memory can be evicted at any time
• Significantly reducesconfiguration
• Can define low water mark for storage (below which we won’tevict)
• Reference:[SPARK-10000]
History of Spark API’s
RDD API (2011) Distribute collection of JVM objects
Functional
operators (map, filter, etc)
DataFrame API (2013) Distribute collection of Row objects
Expression-based
operations and UDF’s
Logical plans and
optimizer
Fast/efficient
Dataset
“Encoder”converts from JVM Objectinto
a Dataset Row
Checkout[SPARK-9999]
JVM Object
Dataset
Row
encoder
Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Long)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy($“name”)
.avg(“age”)
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics
Other Notable Core Engine Features
SQL directly over files
Advanced JSON parsing
Better instrumentation for SQL operators
Demos of What We Learned So Far
Advanced Layout of Cached Data
Storing partitioning and orderingschemes in In-memory table scan
allowsfor performance improvements: e.g. in Joins, an extra partition step can
be saved based on thisinformation
Adding distributeBy and localSort to DF API
Similar to HiveQL’s DISTRIBUTE BY
allowsthe userto control the partitioning and ordering of a data set
Checkout [SPARK-4849]
[Streaming] New improved state management
Introducing a DStream transformation for stateful streamprocessing
Does notscan every key
Easier to implementcommon use cases
timeout of idle data
returning items otherthan state
SupercedesupdateStateByKey in functionality and performance.
trackStateByKey (note, this name may change)
[Streaming] trackStateByKey example
(name may change)
// Initial RDD input
val initialRDD = ssc.sparkContext.parallelize(...)
// ReceiverInputDStream
val lines = ssc.socketTextStream(...)
val words = lines.flatMap(...)
val wordDStream = words.map(x => (x, 1))
// stateDStream using trackStateByKey
val trackStateFunc = (...) { ... }
val stateDStream =
wordDStream.trackStateByKey(StateSpec.function(trackStateFunc).initialSta
te(initialRDD))
[Streaming] Display the failed output op in Streaming
Checkout:
[SPARK-10885] PR#8950
[MLlib]: Pipeline persistence
Persist ML Pipelinesto:
Save models in the spark.ml API
Re-run workflows in a reproducible manner
Export modelsto non-Sparkapps (e.g., model server)
This is more complexthan ML model persistence because:
Must persist Transformers and Estimators, not just Models.
We need a standard way to persist Params.
Pipelines and other meta-algorithms can contain other Transformers and Estimators,
including as Params.
We should save feature metadata with Models
[MLlib]: Pipeline persistence
Reference[SPARK-6725]
Adding model export/import to
the spark.ml API.
Adding the internal
Saveable/Loadable API and
Parquet-based format
R-like statistics for GLMs
Provide R-like summary
statistics for ordinary least
squares via normal equation
solver
Check out[SPARK-9836]
Performance
SPARK-10000 Unified Memory Management- Shared memory for execution and
caching instead of exclusive division ofthe regions.
SPARK-10917,SPARK-11149 In-memory ColumnarCache Performance - Significant
(up to 14x) speed up when caching data that containscomplextypes in
DataFrames or SQL.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query
execution to occur using off-heap memory to avoid GC overhead
Performance (continued)
SPARK-4849 Advanced Layoutof Cached Data - storing partitioning and ordering
schemesin In-memory table scan, and adding distributeBy and localSort to DF
API
SPARK-9858 Adaptive query execution - Initial supportfor automatically selecting
the number of reducersfor joinsand aggregations.
Spark SQL
SPARK-9999 Dataset API
SPARK-11197 SQL Querieson Files
SPARK-11745 Reading non-standard JSON files
SPARK-10412 Per-operator Metrics for SQL Execution
SPARK-11329 Star (*) expansion forStructTypes
SPARK-11111 Fastnull-safe joins
SPARK-10978 Datasource API Avoid Double Filter
Spark Streaming
API Updates
SPARK-2629 New improved state management
SPARK-11198 Kinesisrecord deaggregation
SPARK-10891 Kinesismessage handlerfunction
SPARK-6328 Python Streaming ListenerAPI
UI Improvements
Made failuresvisible in the streaming tab, in the timelines,batch list, and batch
detailspage.
Made outputoperationsvisible in the streaming tab as progress bars
MLlib: New algorithms / models
SPARK-8518 Survival analysis- Log-linearmodel for survival analysis
SPARK-9834 Normal equation forleast squares - Normal equation solver,providing
R-like model summary statistics
SPARK-3147 Online hypothesistesting - A/B testing in the Spark Streaming
framework
SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,SQL
transformer
SPARK-6517 Bisecting K-Meansclustering - Fast top-down clustering variantof K-
Means
MLlib: API Improvements
ML Pipelines
SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of
spark.ml algorithms
SPARK-5565 LDA in ML Pipelines- API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via
summary(model)
SPARK-9681 Feature interactions in R formula -Interaction operator ":"in R formula
Python API - Many improvementsto Python API to approach feature parity
MLlib: Miscellaneous Improvements
SPARK-7685 , SPARK-9642 Instance weightsfor GLMs - Logistic and Linear
Regression can take instance weights
SPARK-10384,SPARK-10385 Univariate and bivariate statistics in DataFrames -
Variance, stddev,correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
For More Information
Apache Spark 1.6.0 Release Preview:http://apache-spark-developers-
list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-6-0-Release-Preview-
td15314.html
Spark 1.6 Preview available in Databricks:
https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in-
databricks.html
Notebooks
Spark 1.6 ImprovementsNotebook:
http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_Improvements.ht
ml?t=1448929686268
Spark 1.6 R ImprovementsNotebook:
http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_R_Improvements.
html?t=1448946977231
Join us at
Spark Summit East
February16-18, 2016 | New York City
Thanks!

More Related Content

PPTX
Building a modern Application with DataFrames
PDF
New Developments in Spark
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Spark streaming state of the union
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
PDF
Lessons from Running Large Scale Spark Workloads
PDF
Strata NYC 2015 - What's coming for the Spark community
Building a modern Application with DataFrames
New Developments in Spark
Spark Under the Hood - Meetup @ Data Science London
Spark streaming state of the union
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Lessons from Running Large Scale Spark Workloads
Strata NYC 2015 - What's coming for the Spark community

What's hot (20)

PDF
Enabling exploratory data science with Spark and R
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PDF
End-to-end Data Pipeline with Apache Spark
PDF
The BDAS Open Source Community
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
New directions for Apache Spark in 2015
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Spark DataFrames and ML Pipelines
PDF
Introduction to Spark (Intern Event Presentation)
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Operational Tips for Deploying Spark
PDF
Jump Start into Apache Spark (Seattle Spark Meetup)
PDF
Spark what's new what's coming
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling exploratory data science with Spark and R
Spark Application Carousel: Highlights of Several Applications Built with Spark
Unified Big Data Processing with Apache Spark (QCON 2014)
Jump Start into Apache® Spark™ and Databricks
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
End-to-end Data Pipeline with Apache Spark
The BDAS Open Source Community
Jump Start on Apache® Spark™ 2.x with Databricks
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
New directions for Apache Spark in 2015
Spark streaming State of the Union - Strata San Jose 2015
Spark DataFrames and ML Pipelines
Introduction to Spark (Intern Event Presentation)
Large-Scale Data Science in Apache Spark 2.0
Operational Tips for Deploying Spark
Jump Start into Apache Spark (Seattle Spark Meetup)
Spark what's new what's coming
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2015: Lessons from 300+ production users
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Ad

Viewers also liked (6)

PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Parallelizing Existing R Packages with SparkR
PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Map reduce vs spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Parallelizing Existing R Packages with SparkR
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Map reduce vs spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Ad

Similar to Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell (20)

PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
A look ahead at spark 2.0
PDF
Apache Spark - A High Level overview
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
2016 Spark Summit East Keynote: Matei Zaharia
PDF
Apache spark 2.4 and beyond
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Dev Ops Training
PDF
Started with-apache-spark
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
A Deep Dive into Structured Streaming in Apache Spark
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
A look ahead at spark 2.0
Apache Spark - A High Level overview
Processing Large Data with Apache Spark -- HasGeek
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
2016 Spark Summit East Keynote: Matei Zaharia
Apache spark 2.4 and beyond
Simplifying Big Data Analytics with Apache Spark
Dev Ops Training
Started with-apache-spark
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Jump Start with Apache Spark 2.0 on Databricks
Apache Spark 2.0: Faster, Easier, and Smarter
Jump Start on Apache Spark 2.2 with Databricks
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
A Deep Dive into Structured Streaming in Apache Spark
Jump Start with Apache Spark 2.0 on Databricks

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Introduction to Artificial Intelligence
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
medical staffing services at VALiNTRY
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Adobe Illustrator 28.6 Crack My Vision of Vector Design
2025 Textile ERP Trends: SAP, Odoo & Oracle
Odoo Companies in India – Driving Business Transformation.pdf
L1 - Introduction to python Backend.pptx
ISO 45001 Occupational Health and Safety Management System
ManageIQ - Sprint 268 Review - Slide Deck
Understanding Forklifts - TECH EHS Solution
Design an Analysis of Algorithms I-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
top salesforce developer skills in 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Online Work Permit System for Fast Permit Processing
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
How to Migrate SBCGlobal Email to Yahoo Easily
How to Choose the Right IT Partner for Your Business in Malaysia
Introduction to Artificial Intelligence
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
medical staffing services at VALiNTRY

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

  • 1. Apache Spark Release 1.6 Patrick Wendell
  • 2. About Me @pwendell U.C. BerkeleyPhD, left to co-found Databricks Coordinate community roadmap Frequentreleasemanager for Spark
  • 3. About Databricks Founded by Spark team, donated Spark to Apachein 2013 and lead developmenttoday. Collaborative, cloud-hosted data platform powered by Spark Free trial to check it out https://databricks.com/ We’re hiring!
  • 4. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 7. Spark’s 3 Month Release Cycle For production jobs, use the latest release To try out unreleasedfeaturesor fixes, use nightly builds people.apache.org/~pwendell/spark-nightly/ master branch-1.6 V1.6.0 V1.6.1
  • 9. Spark 1.6 Release Will ship upstreamthrough Apachefoundation in December (likely) Key themes Out of the box performance Previews of key new API’s Follow along with me at http://bit.ly/1OBkjMM
  • 11. Memory Management in Spark: <= 1.5 • Two separate memory managers: • Execution memory: computation of shuffles, joins,sorts, aggregations • Storage memory: caching and propagating internal data sources across cluster • Challengeswith this: • Manual intervention to avoid unnecessary spilling • No good defaultsfor all workloads – meaning lost efficiency • Goal: Allowmemory regionsto shrink/growdynamically
  • 12. Unified Memory Management in Spark 1.6 • Can cross between execution and storage memory • When execution memory exceedsits own region,it can borrow as much of the storage space as isfree and vice versa • Borrowed storage memory can be evicted at any time • Significantly reducesconfiguration • Can define low water mark for storage (below which we won’tevict) • Reference:[SPARK-10000]
  • 13. History of Spark API’s RDD API (2011) Distribute collection of JVM objects Functional operators (map, filter, etc) DataFrame API (2013) Distribute collection of Row objects Expression-based operations and UDF’s Logical plans and optimizer Fast/efficient
  • 14. Dataset “Encoder”converts from JVM Objectinto a Dataset Row Checkout[SPARK-9999] JVM Object Dataset Row encoder
  • 15. Dataset API in Spark 1.6 Typed interface over DataFrames / Tungsten case class Person(name: String, age: Long) val dataframe = read.json(“people.json”) val ds: Dataset[Person] = dataframe.as[Person] ds.filter(p => p.name.startsWith(“M”)) .groupBy($“name”) .avg(“age”)
  • 16. Tungsten Execution PythonSQL R Streaming DataFrame (& Dataset) Advanced Analytics
  • 17. Other Notable Core Engine Features SQL directly over files Advanced JSON parsing Better instrumentation for SQL operators
  • 18. Demos of What We Learned So Far
  • 19. Advanced Layout of Cached Data Storing partitioning and orderingschemes in In-memory table scan allowsfor performance improvements: e.g. in Joins, an extra partition step can be saved based on thisinformation Adding distributeBy and localSort to DF API Similar to HiveQL’s DISTRIBUTE BY allowsthe userto control the partitioning and ordering of a data set Checkout [SPARK-4849]
  • 20. [Streaming] New improved state management Introducing a DStream transformation for stateful streamprocessing Does notscan every key Easier to implementcommon use cases timeout of idle data returning items otherthan state SupercedesupdateStateByKey in functionality and performance. trackStateByKey (note, this name may change)
  • 21. [Streaming] trackStateByKey example (name may change) // Initial RDD input val initialRDD = ssc.sparkContext.parallelize(...) // ReceiverInputDStream val lines = ssc.socketTextStream(...) val words = lines.flatMap(...) val wordDStream = words.map(x => (x, 1)) // stateDStream using trackStateByKey val trackStateFunc = (...) { ... } val stateDStream = wordDStream.trackStateByKey(StateSpec.function(trackStateFunc).initialSta te(initialRDD))
  • 22. [Streaming] Display the failed output op in Streaming Checkout: [SPARK-10885] PR#8950
  • 23. [MLlib]: Pipeline persistence Persist ML Pipelinesto: Save models in the spark.ml API Re-run workflows in a reproducible manner Export modelsto non-Sparkapps (e.g., model server) This is more complexthan ML model persistence because: Must persist Transformers and Estimators, not just Models. We need a standard way to persist Params. Pipelines and other meta-algorithms can contain other Transformers and Estimators, including as Params. We should save feature metadata with Models
  • 24. [MLlib]: Pipeline persistence Reference[SPARK-6725] Adding model export/import to the spark.ml API. Adding the internal Saveable/Loadable API and Parquet-based format
  • 25. R-like statistics for GLMs Provide R-like summary statistics for ordinary least squares via normal equation solver Check out[SPARK-9836]
  • 26. Performance SPARK-10000 Unified Memory Management- Shared memory for execution and caching instead of exclusive division ofthe regions. SPARK-10917,SPARK-11149 In-memory ColumnarCache Performance - Significant (up to 14x) speed up when caching data that containscomplextypes in DataFrames or SQL. SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
  • 27. Performance (continued) SPARK-4849 Advanced Layoutof Cached Data - storing partitioning and ordering schemesin In-memory table scan, and adding distributeBy and localSort to DF API SPARK-9858 Adaptive query execution - Initial supportfor automatically selecting the number of reducersfor joinsand aggregations.
  • 28. Spark SQL SPARK-9999 Dataset API SPARK-11197 SQL Querieson Files SPARK-11745 Reading non-standard JSON files SPARK-10412 Per-operator Metrics for SQL Execution SPARK-11329 Star (*) expansion forStructTypes SPARK-11111 Fastnull-safe joins SPARK-10978 Datasource API Avoid Double Filter
  • 29. Spark Streaming API Updates SPARK-2629 New improved state management SPARK-11198 Kinesisrecord deaggregation SPARK-10891 Kinesismessage handlerfunction SPARK-6328 Python Streaming ListenerAPI UI Improvements Made failuresvisible in the streaming tab, in the timelines,batch list, and batch detailspage. Made outputoperationsvisible in the streaming tab as progress bars
  • 30. MLlib: New algorithms / models SPARK-8518 Survival analysis- Log-linearmodel for survival analysis SPARK-9834 Normal equation forleast squares - Normal equation solver,providing R-like model summary statistics SPARK-3147 Online hypothesistesting - A/B testing in the Spark Streaming framework SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,SQL transformer SPARK-6517 Bisecting K-Meansclustering - Fast top-down clustering variantof K- Means
  • 31. MLlib: API Improvements ML Pipelines SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms SPARK-5565 LDA in ML Pipelines- API for Latent Dirichlet Allocation in ML Pipelines R API SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model) SPARK-9681 Feature interactions in R formula -Interaction operator ":"in R formula Python API - Many improvementsto Python API to approach feature parity
  • 32. MLlib: Miscellaneous Improvements SPARK-7685 , SPARK-9642 Instance weightsfor GLMs - Logistic and Linear Regression can take instance weights SPARK-10384,SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev,correlations, etc. SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
  • 33. For More Information Apache Spark 1.6.0 Release Preview:http://apache-spark-developers- list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-6-0-Release-Preview- td15314.html Spark 1.6 Preview available in Databricks: https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in- databricks.html
  • 34. Notebooks Spark 1.6 ImprovementsNotebook: http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_Improvements.ht ml?t=1448929686268 Spark 1.6 R ImprovementsNotebook: http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_R_Improvements. html?t=1448946977231
  • 35. Join us at Spark Summit East February16-18, 2016 | New York City