SlideShare a Scribd company logo
What’s New in Spark 2?
Eyal Ben Ivri
In just two words…
Not Much
In just OVER two words…
Not Much,
but let’s talk about it.
Let’s start from the beginning
• What is Apache Spark?
• An open source cluster computing framework.
• Originally developed at the University of California, Berkeley's AMPLab.
• Aimed and designed to be a Big Data computational framework.
Spark Components
Data Sources
Spark Core (Batch Processing, RDD, SparkContext)
SparkSQL
(DataFrame)
Spark
Streaming
Spark MLlib
(ML Pipelines)
Spark GraphX
Spark
Packages
Spark Components (v2)
Data Sources
Spark Core (Batch Processing, RDD, SparkContext)
Spark
Streaming
Spark MLlib
(ML Pipelines)
Spark GraphX
Spark
Packages
Spark SQL
(SparkSession,
DataSet)
Timeline
UC
Berkeley’s
AMPLab
(2009)
Open
Sourced
(2010)
Apache
Foundation
(2013)
Top-Level
Apache
Project (Feb
2014)
Version 1.0
(May 2014)
World
record in
large scale
sorting (Nov
214)
Version 1.6
(Jan 2016)
Version
2.0.0 (Jul
2016)
Version
2.0.1 (Oct
2016,
Current)
Version History (major changes)
1.0 –
SparkSQL
(formally Shark
project)
1.1 –
Streaming
support for
python
1.2 – Core
engine
improvements.
GraphX
graduates.
1.3 –
DataFrame.
Python engine
improvements.
1.4 – SparkR
1.5 – Bugs and
Performance
1.6 – Dataset
(experimental)
Spark 2.0.x
• Major notables:
• Scala 2.11
• SparkSession
• Performance
• API Stability
• SQL:2003 support
• Structured Streaming
• R UDFs and Mllib algorithms implementation
API
• Spark doesn't like API changes
• The good news:
• To migrate, you’ll have to perform
little to no changes to your code.
• The (not so) bad news:
• To benefit from all the
performance improvements, some
old code might need more
refactoring.
Programming API
• Unifying DataFrame and Dataset:
• Dataset[Row] = DataFrame
• SparkSession replaces
SqlContext/HiveContext
• Both kept for backwards compatibility.
• Simpler, more performant accumulator
API
• A new, improved Aggregator API for
typed aggregation in Datasets
SQL Language
• Improved SQL functionalities (SQL
2003 support)
• Can now run all 99 TPC-DS
queries
• The parser support ANSI-SQL as
well as HiveQL
• Subquery support
SparkSQL new features
• Native CSV data source (Based on
Databricks’ spark-csv package)
• Better off-heap memory
management
• Bucketing support (Hive
implementation)
• Performance performance
performance
Demo
Spark API
SparkSQL
Judges Ruling
Dataset was supposed to
be the future like a 6
months ago
SQL 2003 is so 2003
The API lives on!
SQL 2003 is cool
Structured Streaming (Alpha)
Structured Streaming is a scalable and
fault-tolerant stream processing engine
built on the Spark SQL engine.
You can express your streaming
computation the same way you would
express a batch computation on static
data.
Structured Streaming (cont.)
You can use the Dataset / DataFrame API in Scala, Java or Python to
express streaming aggregations, event-time windows, stream-to-
batch joins, etc.
The computation is executed on the same optimized Spark SQL
engine.
Exactly-once fault-tolerance guarantees through checkpointing and
Write Ahead Logs.
Windowing streams
How many vehicles entered each toll booth every 5 minutes?
val windowedCounts = cars.groupBy(
window($"timestamp", ”5 minutes", ”5 minutes"),
$”car"
).count()
What's New in Spark 2?
What's New in Spark 2?
Demo Structured Streaming
Judges Ruling
Still a micro-batching process SparkSQL is the future
Tungsten Project - Phase 2
Performance improvements
Heap Memory management
Connectors optimizations
Let’s Pop the hood
Project Tungsten
Bring Spark performance closer to the bare metal, through:
• Native memory Management
• Runtime code generation
Started @ Version 1.4
The cornerstone that enabled the Catalyst engine
Project Tungsten - Phase 2
Whole stage code generation
a. A technique that blends state-of-the-art from modern compilers and MPP databases.
b. Gives a performance boost of up to x9 faster
c. Emit optimized bytecode at runtime that collapses the entire query into a single function
d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
Project Tungsten - Phase 2
Optimized input / output
a. Caching for Dataframes is based on Parquet
b. Faster Parquet reader
c. Google Gueva is OUT
d. Smarter HadoopFS connector
you have to be running on DataFrame / Dataset
Overall Judges Ruling
I Want to complain but i don’t know
what about!!
Internal performance improvements
aside, this feels more like Spark 1.7
I like flink...
All is good
SparkSQL is for sure the future of
spark
The competition has done well
for Spark
Thank you (Questions?)
Long live Spark (and Flink)
Eyal Ben Ivri
https://github.com/eyalbenivri/spark2demo

More Related Content

PDF
2016 Spark Summit East Keynote: Matei Zaharia
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Introduction to Apache Spark 2.0
PDF
Introduction to spark 2.0
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PDF
A look under the hood at Apache Spark's API and engine evolutions
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
2016 Spark Summit East Keynote: Matei Zaharia
Jump Start with Apache Spark 2.0 on Databricks
Introduction to Apache Spark 2.0
Introduction to spark 2.0
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
A look under the hood at Apache Spark's API and engine evolutions
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

What's hot (20)

PPTX
Apache Spark and Online Analytics
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
PPTX
Large-Scale Data Science in Apache Spark 2.0
PPTX
Building a modern Application with DataFrames
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Vectorized Query Execution in Apache Spark at Facebook
PDF
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
New directions for Apache Spark in 2015
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PPTX
Spark r under the hood with Hossein Falaki
PDF
Apache Spark Usage in the Open Source Ecosystem
PDF
Apache spark-the-definitive-guide-excerpts-r1
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Optimizing Apache Spark UDFs
Apache Spark and Online Analytics
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Exceptions are the Norm: Dealing with Bad Actors in ETL
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Large-Scale Data Science in Apache Spark 2.0
Building a modern Application with DataFrames
Jump Start with Apache Spark 2.0 on Databricks
Vectorized Query Execution in Apache Spark at Facebook
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
New directions for Apache Spark in 2015
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark r under the hood with Hossein Falaki
Apache Spark Usage in the Open Source Ecosystem
Apache spark-the-definitive-guide-excerpts-r1
Composable Parallel Processing in Apache Spark and Weld
From Pipelines to Refineries: Scaling Big Data Applications
Optimizing Apache Spark UDFs
Ad

Viewers also liked (20)

PPTX
Parallelizing Existing R Packages with SparkR
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Sparkstreaming
PDF
Devops Spark Streaming
PPTX
Scala training workshop 02
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
PPTX
Spark Technology Center IBM
PPTX
What’s New in the Berkeley Data Analytics Stack
PPTX
October 2014 HUG : Hive On Spark
PDF
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
PPTX
YARN Ready: Apache Spark
PPTX
Electronic governance steps in the right direction?
PDF
Low Latency Execution For Apache Spark
PPTX
Hortonworks Technical Workshop: HBase For Mission Critical Applications
PPTX
Scala - The Simple Parts, SFScala presentation
PPTX
Scala - the good, the bad and the very ugly
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
What's new in Hadoop Common and HDFS
PDF
Apache Spark in Action
Parallelizing Existing R Packages with SparkR
Apache Spark 2.0: Faster, Easier, and Smarter
Sparkstreaming
Devops Spark Streaming
Scala training workshop 02
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Technology Center IBM
What’s New in the Berkeley Data Analytics Stack
October 2014 HUG : Hive On Spark
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
YARN Ready: Apache Spark
Electronic governance steps in the right direction?
Low Latency Execution For Apache Spark
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Scala - The Simple Parts, SFScala presentation
Scala - the good, the bad and the very ugly
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Processing Large Data with Apache Spark -- HasGeek
What's new in Hadoop Common and HDFS
Apache Spark in Action
Ad

Similar to What's New in Spark 2? (20)

PDF
Spark streaming state of the union
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PPTX
Apache spark
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
PPTX
Apache Spark Overview
PDF
Spark streaming + kafka 0.10
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
PPTX
Apache Beam (incubating)
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Build a deep learning pipeline on apache spark for ads optimization
PDF
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
PDF
Media_Entertainment_Veriticals
PDF
What's new in spark 2.0?
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Spark streaming state of the union
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Apache spark
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
Apache Spark Overview
Spark streaming + kafka 0.10
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Big Data Processing with .NET and Spark (SQLBits 2020)
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
Apache Beam (incubating)
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Build a deep learning pipeline on apache spark for ads optimization
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
Media_Entertainment_Veriticals
What's new in spark 2.0?
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
KodekX | Application Modernization Development
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
KodekX | Application Modernization Development
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?

What's New in Spark 2?

  • 1. What’s New in Spark 2? Eyal Ben Ivri
  • 2. In just two words… Not Much
  • 3. In just OVER two words… Not Much, but let’s talk about it.
  • 4. Let’s start from the beginning • What is Apache Spark? • An open source cluster computing framework. • Originally developed at the University of California, Berkeley's AMPLab. • Aimed and designed to be a Big Data computational framework.
  • 5. Spark Components Data Sources Spark Core (Batch Processing, RDD, SparkContext) SparkSQL (DataFrame) Spark Streaming Spark MLlib (ML Pipelines) Spark GraphX Spark Packages
  • 6. Spark Components (v2) Data Sources Spark Core (Batch Processing, RDD, SparkContext) Spark Streaming Spark MLlib (ML Pipelines) Spark GraphX Spark Packages Spark SQL (SparkSession, DataSet)
  • 7. Timeline UC Berkeley’s AMPLab (2009) Open Sourced (2010) Apache Foundation (2013) Top-Level Apache Project (Feb 2014) Version 1.0 (May 2014) World record in large scale sorting (Nov 214) Version 1.6 (Jan 2016) Version 2.0.0 (Jul 2016) Version 2.0.1 (Oct 2016, Current)
  • 8. Version History (major changes) 1.0 – SparkSQL (formally Shark project) 1.1 – Streaming support for python 1.2 – Core engine improvements. GraphX graduates. 1.3 – DataFrame. Python engine improvements. 1.4 – SparkR 1.5 – Bugs and Performance 1.6 – Dataset (experimental)
  • 9. Spark 2.0.x • Major notables: • Scala 2.11 • SparkSession • Performance • API Stability • SQL:2003 support • Structured Streaming • R UDFs and Mllib algorithms implementation
  • 10. API • Spark doesn't like API changes • The good news: • To migrate, you’ll have to perform little to no changes to your code. • The (not so) bad news: • To benefit from all the performance improvements, some old code might need more refactoring.
  • 11. Programming API • Unifying DataFrame and Dataset: • Dataset[Row] = DataFrame • SparkSession replaces SqlContext/HiveContext • Both kept for backwards compatibility. • Simpler, more performant accumulator API • A new, improved Aggregator API for typed aggregation in Datasets
  • 12. SQL Language • Improved SQL functionalities (SQL 2003 support) • Can now run all 99 TPC-DS queries • The parser support ANSI-SQL as well as HiveQL • Subquery support
  • 13. SparkSQL new features • Native CSV data source (Based on Databricks’ spark-csv package) • Better off-heap memory management • Bucketing support (Hive implementation) • Performance performance performance
  • 15. Judges Ruling Dataset was supposed to be the future like a 6 months ago SQL 2003 is so 2003 The API lives on! SQL 2003 is cool
  • 16. Structured Streaming (Alpha) Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.
  • 17. Structured Streaming (cont.) You can use the Dataset / DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to- batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.
  • 18. Windowing streams How many vehicles entered each toll booth every 5 minutes? val windowedCounts = cars.groupBy( window($"timestamp", ”5 minutes", ”5 minutes"), $”car" ).count()
  • 22. Judges Ruling Still a micro-batching process SparkSQL is the future
  • 23. Tungsten Project - Phase 2 Performance improvements Heap Memory management Connectors optimizations Let’s Pop the hood
  • 24. Project Tungsten Bring Spark performance closer to the bare metal, through: • Native memory Management • Runtime code generation Started @ Version 1.4 The cornerstone that enabled the Catalyst engine
  • 25. Project Tungsten - Phase 2 Whole stage code generation a. A technique that blends state-of-the-art from modern compilers and MPP databases. b. Gives a performance boost of up to x9 faster c. Emit optimized bytecode at runtime that collapses the entire query into a single function d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
  • 26. Project Tungsten - Phase 2 Optimized input / output a. Caching for Dataframes is based on Parquet b. Faster Parquet reader c. Google Gueva is OUT d. Smarter HadoopFS connector you have to be running on DataFrame / Dataset
  • 27. Overall Judges Ruling I Want to complain but i don’t know what about!! Internal performance improvements aside, this feels more like Spark 1.7 I like flink... All is good SparkSQL is for sure the future of spark The competition has done well for Spark
  • 28. Thank you (Questions?) Long live Spark (and Flink) Eyal Ben Ivri https://github.com/eyalbenivri/spark2demo