SlideShare a Scribd company logo
Sameer Agarwal
Spark Summit | San Francisco | June 6th 2018
What’s New in Apache Spark 2.3
#DevSAIS16
About Me
2
• Spark Committer and 2.3 Release Manager
• Software Engineer at Facebook (Big Compute)
• Previously at Databricks and UC Berkeley
• Research on BlinkDB (Approximate Queries in Spark)
Spark 2.3 Release by the numbers
• Released on 28th February 2018
• Development Span: July ‘17 – Feb ‘18
• 284 Contributors
• 1406 JIRAs
– SQL/Streaming (52%)
– Spark Core (12%)
– PySpark (9%)
– ML (8%)
3
Overview
4
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
ML Streaming
+
Image Reader
Major Features in Spark 2.3
5
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
https://spark.apache.org/releases/spark-release-2-3 -0.html
Overview
6
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
ML Streaming
+
Image Reader
Structured Streaming
7
Users: Treat a stream as an infinite table, no need to
reason about micro-batches
Developers: Decoupled the high-level API with the
execution engine
Structured Streaming
8
Micro Batch Execution
9
Micro Batch Execution
10
Latency > 100ms Exactly-once Semantics
Continuous
Processing
Continuous Processing (SPARK-20928)
11
An experimental
execution mode
Continuous Processing (SPARK-20928)
12
Continuous Processing (SPARK-20928)
13
Latency ~1ms At-least once Semantics
Continuous Processing (SPARK-20928)
14
Continuous Processing (SPARK-20928)
Supported Operations
• Map-like Dataset Operations
– Projections
– Selections
• All SQL functions
– Except current_timestamp(),
current_date() and
aggregation functions
15
Supported Sources
• Kafka Source
• Rate Source
Supported Sinks
• Kafka Sink
• Memory Sink
• Console Sink
Blog: https://tinyurl.com/spark-cp
Overview
16
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
ML Streaming
+
Image Reader
ML on Streaming
• Model transformation/prediction on batch and
streaming data with unified API
• After fitting a model or Pipeline, you can deploy it in a
streaming job
val streamOutput = transformer.transform(streamDF)
17
Image Support in Spark (SPARK-21866)
• A standard API in Spark for reading images into DataFrames
• Utilities for loading images from common formats
• Deep learning frameworks can rely on this
val df = ImageSchema.readImages("/data/images")
18
Overview
19
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
ML Streaming
+
Image Reader
PySpark
• Introduced in Spark 0.7 (~2013); became first class citizen
in the Dataframe API in Spark 1.3 (~2015)
• Much slower than Scala/Java with UDFs due to serialization
and Python interpreter
• Note: Most PyData tooling (e.g., Pandas, numpy etc.) are
written in C/C++
20
PySpark Performance
21
Pandas UDFs perform much
better than row-at-a-time UDFs
across the board, ranging from
3x to over 100x.
22
Scalar UDFs
• Used with functions such as
select and withColumn
• The python function should take
pandas.Series as input and
return a pandas.Series of
same length
Pandas/Vectorized UDFs
23
Pandas/Vectorized UDFs
Grouped Map UDFs
• Split-apply-Combine
• A python function that defines
the computation for
each group
• Input/Outputs are both
pandas.DataFrame
Blog: https://tinyurl.com/pyspark-udf
Overview
24
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
ML Streaming
+
Image Reader
Spark on Kubernetes (SPARK-18278)
25
Spark Core
Spark SQL +
DataFrames
Structured
Streaming
MLlib GraphX
Standalone YARN Mesos
Spark on Kubernetes (SPARK-18278)
• Driver runs in a Kubernetes pod created by the submission
client and creates pods that runs the executors in
response to requests from Spark Scheduler
• Make direct use of Kubernetes clusters for multi-tenancy
and sharing through Namespaces and Quotas, as well as
administrative features such as Pluggable Authorization
and Logging
26
Spark on Kubernetes (SPARK-18278)
27
Apache Spark 2.3
• Supports K8S 1.6+
• Cluster Mode
• Static Resource Allocation
• Java/Scala Applications
• Container-local and remote-
dependencies that are
downloadable
Roadmap (Apache Spark 2.4+)
• Client Mode
• Dynamic Resource Allocation +
External Shuffle Service
• Python/R Applications
• Client-local dependencies + Resource
Staging Server (RSS)
Blog: https://tinyurl.com/spark-k8s
Recap
28
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
ML Streaming
+
Image Reader
Sameer Agarwal
Spark Summit | San Francisco | June 6th 2018
Questions?
#DevSAIS16

More Related Content

PDF
Apache Spark Usage in the Open Source Ecosystem
PDF
Mobius: C# Language Binding For Spark
PDF
Spark Summit EU talk by Jakub Hava
PPTX
Developing apache spark jobs in .net using mobius
PDF
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
PPTX
SparkR + Zeppelin
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PDF
Scaling Apache Spark on Kubernetes at Lyft
Apache Spark Usage in the Open Source Ecosystem
Mobius: C# Language Binding For Spark
Spark Summit EU talk by Jakub Hava
Developing apache spark jobs in .net using mobius
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
SparkR + Zeppelin
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Scaling Apache Spark on Kubernetes at Lyft

What's hot (20)

PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Fast and Reliable Apache Spark SQL Engine
PPTX
Spark Summit - Mobius C# Binding for Apache Spark
PDF
Infrastructure for Deep Learning in Apache Spark
PDF
Spark Summit EU talk by Rolf Jagerman
PPTX
Seattle Spark Meetup Mobius CSharp API
PDF
Physical Plans in Spark SQL
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
PDF
Spark Summit EU talk by Luca Canali
PDF
Vectorized Query Execution in Apache Spark at Facebook
PDF
EclairJS = Node.Js + Apache Spark
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
PDF
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
PDF
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Apache Spark MLlib 2.0 Preview: Data Science and Production
Fast and Reliable Apache Spark SQL Engine
Spark Summit - Mobius C# Binding for Apache Spark
Infrastructure for Deep Learning in Apache Spark
Spark Summit EU talk by Rolf Jagerman
Seattle Spark Meetup Mobius CSharp API
Physical Plans in Spark SQL
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Powering Custom Apps at Facebook using Spark Script Transformation
Spark Summit EU talk by Luca Canali
Vectorized Query Execution in Apache Spark at Facebook
EclairJS = Node.Js + Apache Spark
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
An Introduction to Sparkling Water by Michal Malohlava
Ad

Similar to Spark7 (20)

PPTX
What’s new in Apache Spark 2.3
PDF
What's New in Upcoming Apache Spark 2.3
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PDF
Apache spark 2.4 and beyond
PDF
Media_Entertainment_Veriticals
PDF
What's new in Apache Spark 2.4
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Using pySpark with Google Colab & Spark 3.0 preview
PDF
Scaling spark on kubernetes at Lyft
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Track A-2 基於 Spark 的數據分析
PDF
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PDF
Apache Arrow at DataEngConf Barcelona 2018
PPTX
Apache Spark Overview
What’s new in Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
2018 02-08-what's-new-in-apache-spark-2.3
Apache spark 2.4 and beyond
Media_Entertainment_Veriticals
What's new in Apache Spark 2.4
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Apache spark-melbourne-april-2015-meetup
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Using pySpark with Google Colab & Spark 3.0 preview
Scaling spark on kubernetes at Lyft
An Insider’s Guide to Maximizing Spark SQL Performance
Track A-2 基於 Spark 的數據分析
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Apache Arrow at DataEngConf Barcelona 2018
Apache Spark Overview
Ad

More from poovarasu maniandan (12)

DOCX
DOCX
Literature survey
DOCX
Home security system using internet of things
DOCX
rescue robot

Recently uploaded (20)

PPTX
Acid Base Disorders educational power point.pptx
PPTX
Cardiovascular - antihypertensive medical backgrounds
PPTX
surgery guide for USMLE step 2-part 1.pptx
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
PPTX
post stroke aphasia rehabilitation physician
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
neonatal infection(7392992y282939y5.pptx
PPT
HIV lecture final - student.pptfghjjkkejjhhge
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
PPTX
Clinical approach and Radiotherapy principles.pptx
PPT
Management of Acute Kidney Injury at LAUTECH
PPTX
Neuropathic pain.ppt treatment managment
PPTX
History and examination of abdomen, & pelvis .pptx
PDF
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
PPTX
anaemia in PGJKKKKKKKKKKKKKKKKHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH...
PDF
focused on the development and application of glycoHILIC, pepHILIC, and comm...
PPT
Obstructive sleep apnea in orthodontics treatment
PPTX
Stimulation Protocols for IUI | Dr. Laxmi Shrikhande
PPTX
Transforming Regulatory Affairs with ChatGPT-5.pptx
Acid Base Disorders educational power point.pptx
Cardiovascular - antihypertensive medical backgrounds
surgery guide for USMLE step 2-part 1.pptx
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
post stroke aphasia rehabilitation physician
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
neonatal infection(7392992y282939y5.pptx
HIV lecture final - student.pptfghjjkkejjhhge
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
Clinical approach and Radiotherapy principles.pptx
Management of Acute Kidney Injury at LAUTECH
Neuropathic pain.ppt treatment managment
History and examination of abdomen, & pelvis .pptx
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
anaemia in PGJKKKKKKKKKKKKKKKKHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH...
focused on the development and application of glycoHILIC, pepHILIC, and comm...
Obstructive sleep apnea in orthodontics treatment
Stimulation Protocols for IUI | Dr. Laxmi Shrikhande
Transforming Regulatory Affairs with ChatGPT-5.pptx

Spark7