SlideShare a Scribd company logo
8
Most read
15
Most read
18
Most read
Beyond SQL:
Speeding up Spark with DataFrames
Michael Armbrust - @michaelarmbrust
March 2015 – Spark Summit East
2
0	
  
50	
  
100	
  
150	
  
# of Unique Contributors
0	
  
50	
  
100	
  
150	
  
200	
  
# Of Commits Per Month
Graduated
from Alpha
in 1.3
About Me and SQL	
  
Spark SQL
‱  Part of the core distribution since Spark 1.0 (April 2014)
3
SELECT	
  COUNT(*)	
  
FROM	
  hiveTable	
  
WHERE	
  hive_udf(data)	
  	
  
About Me and SQL	
  
Spark SQL
‱  Part of the core distribution since Spark 1.0 (April 2014)
‱  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
4
About Me and SQL	
  
Spark SQL
‱  Part of the core distribution since Spark 1.0 (April 2014)
‱  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
‱  Connect existing BI tools to Spark through JDBC
Spark SQL
‱  Part of the core distribution since Spark 1.0 (April 2014)
‱  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
‱  Connect existing BI tools to Spark through JDBC
‱  Bindings in Python, Scala, and Java
5
About Me and SQL	
  
Spark SQL
‱  Part of the core distribution since Spark 1.0 (April 2014)
‱  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
‱  Connect existing BI tools to Spark through JDBC
‱  Bindings in Python, Scala, and Java
@michaelarmbrust
‱  Lead developer of Spark SQL @databricks
About Me and
6
SQL	
  
The not-so-secret truth...
7
is not about SQL.
	
  
SQL	
  
Execution Engine Performance
8
0
50
100
150
200
250
300
350
400
450
3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98
TPC-DS Performance
Shark
Spark SQL
The not-so-secret truth...
9
is about more than SQL.
	
  
SQL	
  
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
‱  Write less code
‱  Read less data
‱  Let the optimizer do the hard work
10
DataFrame
noun – [dey-tuh-freym]
11
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
	
  
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
12
{ JSON }
Built-In External
JDBC
and more

Write Less Code: High-Level Operations
Common operations can be expressed concisely as calls
to the DataFrame API:
‱  Selecting required columns
‱  Joining diïŹ€erent data sources
‱  Aggregation (count, sum, average, etc)
‱  Filtering
13
Write Less Code: Compute an Average
private	
  IntWritable	
  one	
  =	
  	
  
	
  	
  new	
  IntWritable(1)	
  
private	
  IntWritable	
  output	
  =	
  
	
  	
  new	
  IntWritable()	
  
proctected	
  void	
  map(	
  
	
  	
  	
  	
  LongWritable	
  key,	
  
	
  	
  	
  	
  Text	
  value,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  String[]	
  fields	
  =	
  value.split("t")	
  
	
  	
  output.set(Integer.parseInt(fields[1]))	
  
	
  	
  context.write(one,	
  output)	
  
}	
  
	
  
IntWritable	
  one	
  =	
  new	
  IntWritable(1)	
  
DoubleWritable	
  average	
  =	
  new	
  DoubleWritable()	
  
	
  
protected	
  void	
  reduce(	
  
	
  	
  	
  	
  IntWritable	
  key,	
  
	
  	
  	
  	
  Iterable<IntWritable>	
  values,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  int	
  sum	
  =	
  0	
  
	
  	
  int	
  count	
  =	
  0	
  
	
  	
  for(IntWritable	
  value	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  sum	
  +=	
  value.get()	
  
	
  	
  	
  	
  	
  count++	
  
	
  	
  	
  	
  }	
  
	
  	
  average.set(sum	
  /	
  (double)	
  count)	
  
	
  	
  context.Write(key,	
  average)	
  
}	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [x.[1],	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
Write Less Code: Compute an Average
15
Using RDDs
	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [int(x[1]),	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
	
  
	
  
	
  Using DataFrames
	
  
sqlCtx.table("people")	
  	
  
	
  	
  	
  .groupBy("name")	
  	
  
	
  	
  	
  .agg("name",	
  avg("age"))	
  	
  
	
  	
  	
  .collect()	
  	
  
	
  
Full API Docs
‱  Python
‱  Scala
‱  Java
Not Just Less Code: Faster Implementations
16
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
17
Demo: Data Sources API
Using Spark SQL to read, write, and transform data in a variety of
formats.
http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
Read Less Data
The fastest way to process big data is to never read it.
Spark SQL can help you read less data automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned oïŹ€ by default in Spark 1.3 18
‱  Converting to more eïŹ€icient formats
‱  Using columnar formats (i.e. parquet)
‱  Using partitioning (i.e., /year=2014/month=02/
)1
‱  Skipping data using statistics (i.e., min, max)2
‱  Pushing predicates into storage systems (i.e., JDBC)
	
  
Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
20
21
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "json"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "New	
  York").select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
22
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  partitioned	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "parquet"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "New	
  York").select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
Machine Learning Pipelines
23
tokenizer	
  =	
  Tokenizer(inputCol="text",	
  outputCol="words”)	
  
hashingTF	
  =	
  HashingTF(inputCol="words",	
  outputCol="features”)	
  
lr	
  =	
  LogisticRegression(maxIter=10,	
  regParam=0.01)	
  
pipeline	
  =	
  Pipeline(stages=[tokenizer,	
  hashingTF,	
  lr])	
  
	
  
df	
  =	
  sqlCtx.load("/path/to/data")	
  
model	
  =	
  pipeline.fit(df)	
  	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model	
  
Create and Run Spark Programs Faster:
‱  Write less code
‱  Read less data
‱  Let the optimizer do the hard work
SQL	
  
Questions?

More Related Content

PDF
Data Source API in Spark
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Introduction to Apache Spark
PDF
Understanding Query Plans and Spark UIs
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Apache Spark Core
Data Source API in Spark
Introducing DataFrames in Spark for Large Scale Data Science
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
The Parquet Format and Performance Optimization Opportunities
Introduction to Apache Spark
Understanding Query Plans and Spark UIs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Apache Spark Core

What's hot (20)

PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Apache Spark Overview
PPTX
Optimizing Apache Spark SQL Joins
PDF
Spark SQL
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Physical Plans in Spark SQL
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PDF
Introduction to Spark Internals
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Programming in Spark using PySpark
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Productizing Structured Streaming Jobs
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Apache Spark Overview
Optimizing Apache Spark SQL Joins
Spark SQL
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Common Strategies for Improving Performance on Your Delta Lakehouse
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Designing Structured Streaming Pipelines—How to Architect Things Right
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Dynamic Partition Pruning in Apache Spark
Physical Plans in Spark SQL
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Introduction to Spark Internals
Apache Spark Core—Deep Dive—Proper Optimization
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Programming in Spark using PySpark
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Productizing Structured Streaming Jobs
Ad

Similar to Beyond SQL: Speeding up Spark with DataFrames (20)

PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Spark Sql and DataFrame
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Intro to Spark and Spark SQL
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PPTX
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
Spark what's new what's coming
PDF
20170126 big data processing
PDF
Strata NYC 2015 - What's coming for the Spark community
PDF
Jump Start into Apache¼ Sparkℱ and Databricks
PDF
Apache Spark, the Next Generation Cluster Computing
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PDF
Meetup ml spark_ppt
PDF
Artigo 81 - spark_tutorial.pdf
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark SQL Deep Dive @ Melbourne Spark Meetup
Real-Time Spark: From Interactive Queries to Streaming
Spark Sql and DataFrame
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Intro to Spark and Spark SQL
Structuring Spark: DataFrames, Datasets, and Streaming
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark what's new what's coming
20170126 big data processing
Strata NYC 2015 - What's coming for the Spark community
Jump Start into Apache¼ Sparkℱ and Databricks
Apache Spark, the Next Generation Cluster Computing
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Meetup ml spark_ppt
Artigo 81 - spark_tutorial.pdf
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
AI in Product Development-omnex systems
PDF
top salesforce developer skills in 2025.pdf
PPT
Introduction Database Management System for Course Database
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
ai tools demonstartion for schools and inter college
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Nekopoi APK 2025 free lastest update
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
history of c programming in notes for students .pptx
PPTX
Transform Your Business with a Software ERP System
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Design an Analysis of Algorithms I-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How to Migrate SBCGlobal Email to Yahoo Easily
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
AI in Product Development-omnex systems
top salesforce developer skills in 2025.pdf
Introduction Database Management System for Course Database
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo POS Development Services by CandidRoot Solutions
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
ai tools demonstartion for schools and inter college
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Nekopoi APK 2025 free lastest update
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
history of c programming in notes for students .pptx
Transform Your Business with a Software ERP System
Design an Analysis of Algorithms II-SECS-1021-03
Design an Analysis of Algorithms I-SECS-1021-03

Beyond SQL: Speeding up Spark with DataFrames

  • 1. Beyond SQL: Speeding up Spark with DataFrames Michael Armbrust - @michaelarmbrust March 2015 – Spark Summit East
  • 2. 2 0   50   100   150   # of Unique Contributors 0   50   100   150   200   # Of Commits Per Month Graduated from Alpha in 1.3 About Me and SQL   Spark SQL ‱  Part of the core distribution since Spark 1.0 (April 2014)
  • 3. 3 SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     About Me and SQL   Spark SQL ‱  Part of the core distribution since Spark 1.0 (April 2014) ‱  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments
  • 4. 4 About Me and SQL   Spark SQL ‱  Part of the core distribution since Spark 1.0 (April 2014) ‱  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments ‱  Connect existing BI tools to Spark through JDBC
  • 5. Spark SQL ‱  Part of the core distribution since Spark 1.0 (April 2014) ‱  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments ‱  Connect existing BI tools to Spark through JDBC ‱  Bindings in Python, Scala, and Java 5 About Me and SQL  
  • 6. Spark SQL ‱  Part of the core distribution since Spark 1.0 (April 2014) ‱  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments ‱  Connect existing BI tools to Spark through JDBC ‱  Bindings in Python, Scala, and Java @michaelarmbrust ‱  Lead developer of Spark SQL @databricks About Me and 6 SQL  
  • 7. The not-so-secret truth... 7 is not about SQL.   SQL  
  • 8. Execution Engine Performance 8 0 50 100 150 200 250 300 350 400 450 3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98 TPC-DS Performance Shark Spark SQL
  • 9. The not-so-secret truth... 9 is about more than SQL.   SQL  
  • 10. Spark SQL: The whole story Creating and Running Spark Programs Faster: ‱  Write less code ‱  Read less data ‱  Let the optimizer do the hard work 10
  • 11. DataFrame noun – [dey-tuh-freym] 11 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).  
  • 12. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 12 { JSON } Built-In External JDBC and more

  • 13. Write Less Code: High-Level Operations Common operations can be expressed concisely as calls to the DataFrame API: ‱  Selecting required columns ‱  Joining diïŹ€erent data sources ‱  Aggregation (count, sum, average, etc) ‱  Filtering 13
  • 14. Write Less Code: Compute an Average private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()  
  • 15. Write Less Code: Compute an Average 15 Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()        Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()       Full API Docs ‱  Python ‱  Scala ‱  Java
  • 16. Not Just Less Code: Faster Implementations 16 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 17. 17 Demo: Data Sources API Using Spark SQL to read, write, and transform data in a variety of formats. http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
  • 18. Read Less Data The fastest way to process big data is to never read it. Spark SQL can help you read less data automatically: 1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned oïŹ€ by default in Spark 1.3 18 ‱  Converting to more eïŹ€icient formats ‱  Using columnar formats (i.e. parquet) ‱  Using partitioning (i.e., /year=2014/month=02/
)1 ‱  Skipping data using statistics (i.e., min, max)2 ‱  Pushing predicates into storage systems (i.e., JDBC)  
  • 19. Plan Optimization & Execution 19 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 20. Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 20
  • 21. 21 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column    events  =  add_demographics(sqlCtx.load("/data/events",  "json"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table expensive only join relevent users Physical Plan join scan (events) filter scan (users)
  • 22. 22 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  partitioned  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column     Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  • 23. Machine Learning Pipelines 23 tokenizer  =  Tokenizer(inputCol="text",  outputCol="words”)   hashingTF  =  HashingTF(inputCol="words",  outputCol="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])     df  =  sqlCtx.load("/path/to/data")   model  =  pipeline.fit(df)   ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model  
  • 24. Create and Run Spark Programs Faster: ‱  Write less code ‱  Read less data ‱  Let the optimizer do the hard work SQL   Questions?