SlideShare a Scribd company logo
What’s New in the Spark
Community
Patrick Wendell | @pwendell
About Me
Co-Founder of Databricks
Founding committer of Apache Spark at U.C. Berkeley
Today, manage Spark effort @ Databricks
About Databricks
Team donated Spark to ASF in 2013;
primary maintainers of Spark today
Hosted analytics stack based on
Apache Spark
Managed clusters, notebooks,
collaboration, and third party apps:
Today’s Talk
Quick overview of Apache Spark
Technical roadmap directions
Community and ecosystem trends
What is your familiarity with Spark?
1.  Not very familiar with Spark – only very high level.
2.  Understand the components/uses well, but I’ve never written code.
3.  I’ve written Spark code on POC or production use case of Spark.
“Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune
…
Apache Spark Engine
Spark Core
Streaming
SQL and
Dataframe
MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
This Talk
“What’s new” in Spark? And what’s coming?
Two parts: Technical roadmap and community developments
“The future is already here — it's just not very evenly distributed.”
- William Gibson
Technical Directions
Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Higher Level API’s
Making Spark accessible to data scientists, engineers, statisticians…
Computing an Average: MapReduce vs Spark
private	
  IntWritable	
  one	
  =	
  	
  
	
  	
  new	
  IntWritable(1)	
  
private	
  IntWritable	
  output	
  =	
  
	
  	
  new	
  IntWritable()	
  
proctected	
  void	
  map(	
  
	
  	
  	
  	
  LongWritable	
  key,	
  
	
  	
  	
  	
  Text	
  value,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  String[]	
  fields	
  =	
  value.split("t")	
  
	
  	
  output.set(Integer.parseInt(fields[1]))	
  
	
  	
  context.write(one,	
  output)	
  
}	
  
	
  
IntWritable	
  one	
  =	
  new	
  IntWritable(1)	
  
DoubleWritable	
  average	
  =	
  new	
  DoubleWritable()	
  
	
  
protected	
  void	
  reduce(	
  
	
  	
  	
  	
  IntWritable	
  key,	
  
	
  	
  	
  	
  Iterable<IntWritable>	
  values,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  int	
  sum	
  =	
  0	
  
	
  	
  int	
  count	
  =	
  0	
  
	
  	
  for(IntWritable	
  value	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  sum	
  +=	
  value.get()	
  
	
  	
  	
  	
  	
  count++	
  
	
  	
  	
  	
  }	
  
	
  	
  average.set(sum	
  /	
  (double)	
  count)	
  
	
  	
  context.Write(key,	
  average)	
  
}	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [x.[1],	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
13
Computing an Average with Spark
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [x.[1],	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
14
Computing an Average with DataFrames
	
  
sqlCtx.table("people")	
  	
  
	
  	
  	
  .groupBy("name")	
  	
  
	
  	
  	
  .agg("name",	
  avg("age"))	
  	
  
	
  	
  	
  .collect()	
  	
  
	
  
15
Spark DataFrame API
Explicit data model and schema
Selecting columns and filtering
Aggregation (count, sum, average, etc)
User defined functions
Joining different data sources
Statistical functions and easy plotting
Python, Scala, Java, and R
16
	
  
sqlCtx.table("people")	
  	
  
	
  	
  	
  .groupBy("name")	
  	
  
	
  	
  	
  .agg("name",	
  avg("age"))	
  	
  
	
  	
  	
  .collect()	
  	
  
Ask more of your framework!
MapReduce Spark Spark + DataFrames
Fault tolerance Fault tolerance Fault tolerance
Data distribution Data distribution Data distribution
Set operators Set operators
Operator DAG Operator DAG
Caching Caching
Schema management
Relational semantics
Logical plan optimization
Storage push down and opt.
Analytic operations
…
Other high level API’s
ML Pipelines
SparkR
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
>	
  faithful	
  <-­‐	
  read.df("faithful.json",	
  "json”)	
  
>	
  head(filter(faithful,	
  faithful	
  $waiting	
  <	
  50))	
  
##	
  	
  eruptions	
  waiting	
  
##1	
  	
  	
  	
  	
  1.750	
  	
  	
  	
  	
  	
  47	
  
##2	
  	
  	
  	
  	
  1.750	
  	
  	
  	
  	
  	
  47	
  
##3	
  	
  	
  	
  	
  1.867	
  	
  	
  	
  	
  	
  48	
  
Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Performance Initiatives
Project Tungsten – improving runtime efficiency of key internals
Everything else – IO optimizations, dynamic plan re-writing
Project Tungsten: The CPU Squeeze
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
Project Tungsten
Code generation for CPU efficiency
Code generation on by default and using Janino [SPARK-7956]
Beef up built-in UDF library (added ~100 UDF’s with code gen)
AddMonths	
  
ArrayContains	
  
Ascii	
  
Base64	
  
Bin	
  
BinaryMathExpressi
on	
  
CheckOverflow	
  
CombineSets	
  
Contains	
  
CountSet	
  
Crc32	
  
DateAdd	
  
DateDiff	
  
DateFormatClass	
  
DateSub	
  
DayOfMonth	
  
DayOfYear	
  
Decode	
  
Encode	
  
EndsWith	
  
Explode	
  
Factorial	
  
FindInSet	
  
FormatNumber	
  
FromUTCTimestamp	
  
FromUnixTime	
  
GetArrayItem	
  
GetJsonObject	
  
GetMapValue	
  
Hex	
  
InSet	
  
InitCap	
  
IsNaN	
  
IsNotNull	
  
IsNull	
  
LastDay	
  
Length	
  
Levenshtein	
  
Like	
  
Lower	
  
MakeDecimal	
  
Md5	
  
Month	
  
MonthsBetween	
  
NaNvl	
  
NextDay	
  
Not	
  
PromotePrecision	
  
Quarter	
  
RLike	
  
Round	
  
Second	
  
Sha1	
  
Sha2	
  
ShiYLeY	
  
ShiYRight	
  
ShiYRightUnsigne
d	
  
SortArray	
  
SoundEx	
  
StartsWith	
  
StringInstr	
  
StringRepeat	
  
StringReverse	
  
StringSpace	
  
StringSplit	
  
StringTrim	
  
StringTrimLeY	
  
StringTrimRight	
  
TimeAdd	
  
TimeSub	
  
ToDate	
  
ToUTCTimestamp	
  
TruncDate	
  
UnBase64	
  
UnaryMathExpressi
on	
  
Unhex	
  
UnixTimestamp	
  
Project Tungsten
Binary processing for memory management (all data types):
External sorting with managed memory
External hashing with managed memory
Memory	
  page	
  
hc	
   ptr	
  
…	
  
key	
   value	
   key	
   value	
  
key	
   value	
   key	
   value	
  
key	
   value	
   key	
   value	
  
Managed Memory HashMap in Tungsten
Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM GPU NVRAM
Where are we going?
Tungsten
backend
language
frontend
…
Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Pluggability: Rich IO Support
df	
  =	
  sqlContext.read	
  	
  
	
  	
  .format("json")	
  	
  
	
  	
  .option("samplingRatio",	
  "0.1")	
  	
  
	
  	
  .load("/home/michael/data.json”)	
  
	
  
df.write	
  	
  
	
  	
  .format("parquet")	
  	
  
	
  	
  .mode("append")	
  	
  
	
  	
  .partitionBy("year")	
  	
  
	
  	
  .saveAsTable("fasterData")	
  
Unified interface to reading/writing data in a variety of formats
Large Number of IO Integration
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
28
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
Deployment Integrations
Technical Directions
Early on, the focus was:
Can Spark be an engine that is faster and easier to use than Hadoop
MapReduce?
Today the question is:
Can Spark & its ecosystem make big data as easy as little data?
Community/User Growth
Who is the “Spark Community”?
thousands of users
… hundreds of developers
… dozens of distributors
Getting a better vantage point
Databricks survey - feedback from more than 1,400 users
Community trends: Library & package ecosystem
Strata NY 2014: Widespread use of core RDD API
Today: Most use built-in and community libraries
51% of users use 3 or more libraries
Spark Packages
Strata NY 2014: Didn’t exist
Today: > 100 community packages
> ./bin/spark-shell --packages databricks/spark-avro:0.2
Spark Packages
API Extensions
Clojure API
Spark Kernel
Zepplin Notebook
Indexed RDD
Deployment Utilities
Google Compute
Microsoft Azure
Spark Jobserver
Data Sources
Redshift
Avro
CSV
Elastic Search
MongoDB
Increasing storage options
Strata NY 2014: IO primarily through Hadoop InputFormat API
January 2015: Spark adds native storage API
Today: Well over 20 natively integrated storage bindings
Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,
Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…
Deployment environments
Strata NY 2014: Traction in the Hadoop community
Today: Growth beyond Hadoop… increasingly public cloud
51% of respondents run Spark in public cloud
Wrapping it up
Spark has grown and developed quickly in the last year!
Looking forward expect:
-  Engineering effort on higher level API’s and performance
-  A broader surrounding ecosystem
-  The unexpected
Where to learn more about Spark?
SparkHub community portal
Spark Summit conference - https://spark-summit.org/
Massive online course (edX):
Databricks Spark training Books:
Questions?

More Related Content

PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
The BDAS Open Source Community
PDF
A look ahead at spark 2.0
PDF
Enabling exploratory data science with Spark and R
PDF
Spark what's new what's coming
PPTX
Building a modern Application with DataFrames
PDF
Spark streaming State of the Union - Strata San Jose 2015
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
The BDAS Open Source Community
A look ahead at spark 2.0
Enabling exploratory data science with Spark and R
Spark what's new what's coming
Building a modern Application with DataFrames
Spark streaming State of the Union - Strata San Jose 2015

What's hot (20)

PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
New directions for Apache Spark in 2015
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
New Developments in Spark
PDF
Spark streaming state of the union
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Lessons from Running Large Scale Spark Workloads
PPTX
Optimizing Apache Spark SQL Joins
PDF
End-to-end Data Pipeline with Apache Spark
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
New directions for Apache Spark in 2015
Unified Big Data Processing with Apache Spark (QCON 2014)
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Spark Under the Hood - Meetup @ Data Science London
New Developments in Spark
Spark streaming state of the union
Jump Start into Apache® Spark™ and Databricks
New Directions for Spark in 2015 - Spark Summit East
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark SQL Deep Dive @ Melbourne Spark Meetup
Lessons from Running Large Scale Spark Workloads
Optimizing Apache Spark SQL Joins
End-to-end Data Pipeline with Apache Spark
From Pipelines to Refineries: Scaling Big Data Applications
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Ad

Viewers also liked (20)

PDF
Apache streams 2015
PDF
Austin Data Meetup 092014 - Spark
PPTX
Spark in the BigData dark
PDF
London Spark Meetup Project Tungsten Oct 12 2015
PDF
Introduction to Spark SQL & Catalyst
PDF
Spark Summit EU talk by Herman van Hovell
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Enhancements on Spark SQL optimizer by Min Qiu
PDF
20140908 spark sql & catalyst
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
PPTX
Spark sql meetup
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Apache streams 2015
Austin Data Meetup 092014 - Spark
Spark in the BigData dark
London Spark Meetup Project Tungsten Oct 12 2015
Introduction to Spark SQL & Catalyst
Spark Summit EU talk by Herman van Hovell
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Enhancements on Spark SQL optimizer by Min Qiu
20140908 spark sql & catalyst
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Spark sql meetup
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Processing Large Data with Apache Spark -- HasGeek
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Robust and Scalable ETL over Cloud Storage with Apache Spark
Enhancing Spark SQL Optimizer with Reliable Statistics
Ad

Similar to Strata NYC 2015 - What's coming for the Spark community (20)

PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Intro to Spark and Spark SQL
PDF
Apache Spark, the Next Generation Cluster Computing
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Meetup ml spark_ppt
PDF
Apache Spark and DataStax Enablement
PDF
Big data distributed processing: Spark introduction
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
20170126 big data processing
PPTX
Building a modern Application with DataFrames
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Graph Analytics in Spark
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Big data analysis using spark r published
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
Microservices, containers, and machine learning
PDF
Beyond SQL: Speeding up Spark with DataFrames
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Processing with .NET and Spark (SQLBits 2020)
Intro to Spark and Spark SQL
Apache Spark, the Next Generation Cluster Computing
AI與大數據數據處理 Spark實戰(20171216)
Real-Time Spark: From Interactive Queries to Streaming
Meetup ml spark_ppt
Apache Spark and DataStax Enablement
Big data distributed processing: Spark introduction
Jump Start with Apache Spark 2.0 on Databricks
20170126 big data processing
Building a modern Application with DataFrames
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Composable Parallel Processing in Apache Spark and Weld
Big data analysis using spark r published
20130912 YTC_Reynold Xin_Spark and Shark
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Microservices, containers, and machine learning
Beyond SQL: Speeding up Spark with DataFrames

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
history of c programming in notes for students .pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Introduction to Artificial Intelligence
PDF
System and Network Administraation Chapter 3
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Transform Your Business with a Software ERP System
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How Creative Agencies Leverage Project Management Software.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
history of c programming in notes for students .pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Design an Analysis of Algorithms I-SECS-1021-03
Design an Analysis of Algorithms II-SECS-1021-03
Introduction to Artificial Intelligence
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Transform Your Business with a Software ERP System

Strata NYC 2015 - What's coming for the Spark community

  • 1. What’s New in the Spark Community Patrick Wendell | @pwendell
  • 2. About Me Co-Founder of Databricks Founding committer of Apache Spark at U.C. Berkeley Today, manage Spark effort @ Databricks
  • 3. About Databricks Team donated Spark to ASF in 2013; primary maintainers of Spark today Hosted analytics stack based on Apache Spark Managed clusters, notebooks, collaboration, and third party apps:
  • 4. Today’s Talk Quick overview of Apache Spark Technical roadmap directions Community and ecosystem trends
  • 5. What is your familiarity with Spark? 1.  Not very familiar with Spark – only very high level. 2.  Understand the components/uses well, but I’ve never written code. 3.  I’ve written Spark code on POC or production use case of Spark.
  • 6. “Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune
  • 7. … Apache Spark Engine Spark Core Streaming SQL and Dataframe MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 8. This Talk “What’s new” in Spark? And what’s coming? Two parts: Technical roadmap and community developments “The future is already here — it's just not very evenly distributed.” - William Gibson
  • 10. Spark Technical Directions Higher level API’s Make developers more productive Performance of key execution primitives Shuffle, sorting, hashing, and state management Pluggability and extensibility Make it easy for other projects to integrate with Spark
  • 11. Spark Technical Directions Higher level API’s Make developers more productive Performance of key execution primitives Shuffle, sorting, hashing, and state management Pluggability and extensibility Make it easy for other projects to integrate with Spark
  • 12. Higher Level API’s Making Spark accessible to data scientists, engineers, statisticians…
  • 13. Computing an Average: MapReduce vs Spark private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()   13
  • 14. Computing an Average with Spark data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()   14
  • 15. Computing an Average with DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()       15
  • 16. Spark DataFrame API Explicit data model and schema Selecting columns and filtering Aggregation (count, sum, average, etc) User defined functions Joining different data sources Statistical functions and easy plotting Python, Scala, Java, and R 16   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()    
  • 17. Ask more of your framework! MapReduce Spark Spark + DataFrames Fault tolerance Fault tolerance Fault tolerance Data distribution Data distribution Data distribution Set operators Set operators Operator DAG Operator DAG Caching Caching Schema management Relational semantics Logical plan optimization Storage push down and opt. Analytic operations …
  • 18. Other high level API’s ML Pipelines SparkR ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr >  faithful  <-­‐  read.df("faithful.json",  "json”)   >  head(filter(faithful,  faithful  $waiting  <  50))   ##    eruptions  waiting   ##1          1.750            47   ##2          1.750            47   ##3          1.867            48  
  • 19. Spark Technical Directions Higher level API’s Make developers more productive Performance of key execution primitives Shuffle, sorting, hashing, and state management Pluggability and extensibility Make it easy for other projects to integrate with Spark
  • 20. Performance Initiatives Project Tungsten – improving runtime efficiency of key internals Everything else – IO optimizations, dynamic plan re-writing
  • 21. Project Tungsten: The CPU Squeeze 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz L
  • 22. Project Tungsten Code generation for CPU efficiency Code generation on by default and using Janino [SPARK-7956] Beef up built-in UDF library (added ~100 UDF’s with code gen) AddMonths   ArrayContains   Ascii   Base64   Bin   BinaryMathExpressi on   CheckOverflow   CombineSets   Contains   CountSet   Crc32   DateAdd   DateDiff   DateFormatClass   DateSub   DayOfMonth   DayOfYear   Decode   Encode   EndsWith   Explode   Factorial   FindInSet   FormatNumber   FromUTCTimestamp   FromUnixTime   GetArrayItem   GetJsonObject   GetMapValue   Hex   InSet   InitCap   IsNaN   IsNotNull   IsNull   LastDay   Length   Levenshtein   Like   Lower   MakeDecimal   Md5   Month   MonthsBetween   NaNvl   NextDay   Not   PromotePrecision   Quarter   RLike   Round   Second   Sha1   Sha2   ShiYLeY   ShiYRight   ShiYRightUnsigne d   SortArray   SoundEx   StartsWith   StringInstr   StringRepeat   StringReverse   StringSpace   StringSplit   StringTrim   StringTrimLeY   StringTrimRight   TimeAdd   TimeSub   ToDate   ToUTCTimestamp   TruncDate   UnBase64   UnaryMathExpressi on   Unhex   UnixTimestamp  
  • 23. Project Tungsten Binary processing for memory management (all data types): External sorting with managed memory External hashing with managed memory Memory  page   hc   ptr   …   key   value   key   value   key   value   key   value   key   value   key   value   Managed Memory HashMap in Tungsten
  • 24. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM GPU NVRAM Where are we going? Tungsten backend language frontend …
  • 25. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics
  • 26. Spark Technical Directions Higher level API’s Make developers more productive Performance of key execution primitives Shuffle, sorting, hashing, and state management Pluggability and extensibility Make it easy for other projects to integrate with Spark
  • 27. Pluggability: Rich IO Support df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json”)     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")   Unified interface to reading/writing data in a variety of formats
  • 28. Large Number of IO Integration Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 28 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/
  • 30. Technical Directions Early on, the focus was: Can Spark be an engine that is faster and easier to use than Hadoop MapReduce? Today the question is: Can Spark & its ecosystem make big data as easy as little data?
  • 32. Who is the “Spark Community”? thousands of users … hundreds of developers … dozens of distributors
  • 33. Getting a better vantage point Databricks survey - feedback from more than 1,400 users
  • 34. Community trends: Library & package ecosystem Strata NY 2014: Widespread use of core RDD API Today: Most use built-in and community libraries 51% of users use 3 or more libraries
  • 35. Spark Packages Strata NY 2014: Didn’t exist Today: > 100 community packages > ./bin/spark-shell --packages databricks/spark-avro:0.2
  • 36. Spark Packages API Extensions Clojure API Spark Kernel Zepplin Notebook Indexed RDD Deployment Utilities Google Compute Microsoft Azure Spark Jobserver Data Sources Redshift Avro CSV Elastic Search MongoDB
  • 37. Increasing storage options Strata NY 2014: IO primarily through Hadoop InputFormat API January 2015: Spark adds native storage API Today: Well over 20 natively integrated storage bindings Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase, Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…
  • 38. Deployment environments Strata NY 2014: Traction in the Hadoop community Today: Growth beyond Hadoop… increasingly public cloud 51% of respondents run Spark in public cloud
  • 39. Wrapping it up Spark has grown and developed quickly in the last year! Looking forward expect: -  Engineering effort on higher level API’s and performance -  A broader surrounding ecosystem -  The unexpected
  • 40. Where to learn more about Spark? SparkHub community portal Spark Summit conference - https://spark-summit.org/ Massive online course (edX): Databricks Spark training Books: