SlideShare a Scribd company logo
Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit Amsterdam 2015 - October, 28th
Graduated
from Alpha
in 1.3
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
SQLAbout Me and
2
0
100
200
300
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT COUNT(*)
FROM hiveTable
WHERE hive_udf(data)
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
SQLAbout Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
SQLAbout Me and
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
5
SQLAbout Me and
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
• @michaelarmbrust
• Creator of Spark SQL @databricks
6
SQLAbout Me and
The not-so-secret truth...
7
is about more than SQL.
SQL
8
Create and run
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
DataFrame
noun – [dey-tuh-freym]
9
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
read and write  
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)  
finish the I/O
specification
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
13
Read Less Data: Efficient Formats
• Compact binary encoding with intelligent compression
(delta, RLE, etc)
• Each column stored separately with an index that allows
skipping of unread columns
• Support for partitioning (/data/year=2015)
• Data skipping using statistics (column min/max, etc)
14
is an efficient columnar storage format:
ORC
Write Less Code: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{  JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*
ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url",  "https://issues.apache.org/jira/rest/api/latest/search")
.option("user",  "marmbrus")
.option("password",  "*******")
.option("query",  """
|project  =  SPARK  AND  
|component  =  SQL  AND  
|(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin)
.load()
16
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
Load	
  data	
  from	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (Spark’s	
  Bug	
  Tracker)	
  
using	
  a	
  custom	
  data	
  source.
Write	
  the	
  converted	
  data	
  out	
  to	
  
a	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  stored	
  in	
  
Write Less Code: High-Level Operations
Solve common problems conciselyusing DataFrame functions:
• Selecting columnsand filtering
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Plotting resultswith Pandas
17
Write Less Code: Compute an Average
private IntWritable one  =
new IntWritable(1)
private IntWritable output   =
new IntWritable()
proctected void map(
LongWritable key,
Text  value,
Context   context)   {
String[]   fields   = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one,   output)
}
IntWritable one  = new IntWritable(1)
DoubleWritable average   = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context   context)   {
int sum   = 0
int count   = 0
for(IntWritable value   : values)   {
sum  += value.get()
count++
}
average.set(sum   / (double)   count)
context.Write(key,   average)
}
data   = sc.textFile(...).split("t")
data.map(lambda x:  (x[0],   [x.[1],   1]))   
.reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   
.map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   
.collect()
18
Write Less Code: Compute an Average
Using RDDs
data  = sc.textFile(...).split("t")
data.map(lambda x:  (x[0],  [int(x[1]),  1]))  
.reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])  
.map(lambda x:  [x[0],  x[1][0]  / x[1][1]])  
.collect()
Using DataFrames
sqlCtx.table("people")   
.groupBy("name")   
.agg("name",  avg("age"))  
.map(lambda …)  
.collect()  
Full API Docs
• Python
• Scala
• Java
• R
19
Using SQL
SELECT name,  avg(age)
FROM people
GROUP BY  name
Not Just Less Code, Faster Too!
20
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQLshare the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity =  udf(lambda zipCode:  <custom  logic  here>)
def add_demographics(events):
u  = sqlCtx.table("users")
events  
.join(u,  events.user_id == u.user_id)  
.withColumn("city",  zipToCity(df.zip))
Augments any
DataFrame
that contains
user_id
22
Optimize Full Pipelines
Optimization happensas late as possible, therefore
Spark SQL can optimize even across functions.
23
events  = add_demographics(sqlCtx.load("/data/events",  "json"))
training_data = events  
.where(events.city == "Amsterdam")  
.select(events.timestamp)  
.collect()  
24
def add_demographics(events):
u  = sqlCtx.table("users") #  Load  Hive  table
events  
.join(u,  events.user_id == u.user_id)   #  Join  on  user_id
.withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column
events  = add_demographics(sqlCtx.load("/data/events",  "json"))  
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  
Logical Plan
filter
bycity
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
24
25
def add_demographics(events):
u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  table
events  
.join(u,  events.user_id == u.user_id)   #  Join  on  user_id
.withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))  
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  
Logical Plan
filter
bycity
join
events file users table
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
25
Machine Learning Pipelines
26
tokenizer = Tokenizer(inputCol="text",	
  outputCol="words”)
hashingTF = HashingTF(inputCol="words",	
  outputCol="features”)
lr = LogisticRegression(maxIter=10,  regParam=0.01)
pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr])
df = sqlCtx.load("/path/to/data")
model  = pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model
• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,  
format_string,  lower,  lpad
– Data/Time – current_timestamp,  
date_format,  date_add,  …
– Math – sqrt,  randn,   …
– Other –
monotonicallyIncreasingId,  
sparkPartitionId,   …
27
Rich Function Library
from pyspark.sql.functions import *
yesterday  = date_sub(current_date(),  1)
df2  = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(),  1)
val df2 = df.filter(df("created_at")  > yesterday)
Added	
  in	
  
Spark	
  1.5
Optimized Execution with
Project Tungsten
Compact encoding, cacheaware algorithms,
runtime code generation
28
The overheads of JVM objects
“abcd”
29
• Native: 4 byteswith UTF-8 encoding
• Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
4 4 (object header) ...
8 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes data + overhead
6 “bricks”
Tungsten’s Compact Encoding
30
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths
“abcd” with Tungsten encoding: ~5-­‐6	
  bytes	
  
Runtime Bytecode Generation
31
df.where(df("year")  > 2015)
GreaterThan(year#234,  Literal(2015))
bool filter(Object  baseObject)  {
int offset  = baseOffset + bitSetWidthInBytes + 3*8L;
int value  =  Platform.getInt(baseObject,  offset);
return value34  > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject,  offset);
• Type-safe: operate on domain
objectswith compiled lambda
functions
• Fast: Code-generated
encodersfor fast serialization
• Interoperable: Easily convert
DataFrame ßà Dataset
withoutboilerplate
32
Coming soon: Datasets
val df = ctx.read.json("people.json")
//  Convert  data  to  domain  objects.
case class Person(name:  String,  age:  Int)
val ds: Dataset[Person]  = df.as[Person]
ds.filter(_.age  > 30)
//  Compute  histogram  of  age  by  name.
val hist =  ds.groupBy(_.name).mapGroups {
case (name,  people:  Iter[Person])  =>
val buckets =  new Array[Int](10)            
people.map(_.age).foreach {  a  =>
buckets(a  / 10)  += 1
}                  
(name,  buckets)
}
Preview	
  in	
  
Spark	
  1.6
33
Create and run your
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
Questions?
Committer Office Hours
Weds, 4:00-5:00 pm Michael
Thurs, 10:30-11:30 am Reynold
Thurs, 2:00-3:00 pm Andrew

More Related Content

PDF
Beyond SQL: Speeding up Spark with DataFrames
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
A look ahead at spark 2.0
PPTX
Building a modern Application with DataFrames
PDF
Enabling exploratory data science with Spark and R
PDF
Strata NYC 2015 - What's coming for the Spark community
PDF
Data Source API in Spark
Beyond SQL: Speeding up Spark with DataFrames
Introducing DataFrames in Spark for Large Scale Data Science
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
A look ahead at spark 2.0
Building a modern Application with DataFrames
Enabling exploratory data science with Spark and R
Strata NYC 2015 - What's coming for the Spark community
Data Source API in Spark

What's hot (20)

PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Spark streaming state of the union
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PDF
Practical Machine Learning Pipelines with MLlib
PDF
Vertica And Spark: Connecting Computation And Data
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PPTX
Optimizing Apache Spark SQL Joins
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Demystifying DataFrame and Dataset
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PPTX
Spark etl
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Spark streaming state of the union
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Practical Machine Learning Pipelines with MLlib
Vertica And Spark: Connecting Computation And Data
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Jump Start with Apache Spark 2.0 on Databricks
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Spark streaming State of the Union - Strata San Jose 2015
Real-Time Spark: From Interactive Queries to Streaming
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Optimizing Apache Spark SQL Joins
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Demystifying DataFrame and Dataset
Spark Application Carousel: Highlights of Several Applications Built with Spark
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Exceptions are the Norm: Dealing with Bad Actors in ETL
Spark etl
Ad

Viewers also liked (6)

PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
PDF
Spark Summit EU 2015: Matei Zaharia keynote
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Spark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Lessons from 300+ production users
Ad

Similar to Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data (20)

PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Introduction to Spark Datasets - Functional and relational together at last
PPTX
Spark Sql and DataFrame
PDF
Spark what's new what's coming
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PPTX
Spark sql
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
pyspark_df.pdf
PDF
Intro to Spark and Spark SQL
PDF
Introduction to Spark SQL & Catalyst
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
PDF
Spark SQL
PPTX
Building a modern Application with DataFrames
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Spark Dataframe - Mr. Jyotiska
PDF
Jump Start with Apache Spark 2.0 on Databricks
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark SQL Deep Dive @ Melbourne Spark Meetup
Introduction to Spark Datasets - Functional and relational together at last
Spark Sql and DataFrame
Spark what's new what's coming
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark sql
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Structuring Spark: DataFrames, Datasets, and Streaming
pyspark_df.pdf
Intro to Spark and Spark SQL
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Spark SQL
Building a modern Application with DataFrames
Jump Start into Apache® Spark™ and Databricks
Spark Dataframe - Mr. Jyotiska
Jump Start with Apache Spark 2.0 on Databricks

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Online Work Permit System for Fast Permit Processing
PDF
medical staffing services at VALiNTRY
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
CHAPTER 2 - PM Management and IT Context
How Creative Agencies Leverage Project Management Software.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Online Work Permit System for Fast Permit Processing
medical staffing services at VALiNTRY
How to Migrate SBCGlobal Email to Yahoo Easily
Navsoft: AI-Powered Business Solutions & Custom Software Development
Which alternative to Crystal Reports is best for small or large businesses.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Nekopoi APK 2025 free lastest update
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Upgrade and Innovation Strategies for SAP ERP Customers
Odoo Companies in India – Driving Business Transformation.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Wondershare Filmora 15 Crack With Activation Key [2025

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

  • 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit Amsterdam 2015 - October, 28th
  • 2. Graduated from Alpha in 1.3 • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) SQLAbout Me and 2 0 100 200 300 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3. 3 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments SQLAbout Me and Improved multi-version support in 1.4
  • 4. 4 • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC SQLAbout Me and
  • 5. • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R 5 SQLAbout Me and
  • 6. • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R • @michaelarmbrust • Creator of Spark SQL @databricks 6 SQLAbout Me and
  • 7. The not-so-secret truth... 7 is about more than SQL. SQL
  • 8. 8 Create and run Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work
  • 9. DataFrame noun – [dey-tuh-freym] 9 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
  • 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 10
  • 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") read and write   functions create new builders for doing I/O 11
  • 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 12
  • 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)   finish the I/O specification df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 13
  • 14. Read Less Data: Efficient Formats • Compact binary encoding with intelligent compression (delta, RLE, etc) • Each column stored separately with an index that allows skipping of unread columns • Support for partitioning (/data/year=2015) • Data skipping using statistics (column min/max, etc) 14 is an efficient columnar storage format: ORC
  • 15. Write Less Code: Data Source API Spark SQL’s Data Source API can read and write DataFrames usinga variety of formats. 15 {  JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/ ORCplain text*
  • 16. ETL Using Custom Data Sources sqlContext.read .format("com.databricks.spark.jira") .option("url",  "https://issues.apache.org/jira/rest/api/latest/search") .option("user",  "marmbrus") .option("password",  "*******") .option("query",  """ |project  =  SPARK  AND   |component  =  SQL  AND   |(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin) .load() 16 .repartition(1) .write .format("parquet") .saveAsTable("sparkSqlJira") Load  data  from                                (Spark’s  Bug  Tracker)   using  a  custom  data  source. Write  the  converted  data  out  to   a                                              table  stored  in  
  • 17. Write Less Code: High-Level Operations Solve common problems conciselyusing DataFrame functions: • Selecting columnsand filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting resultswith Pandas 17
  • 18. Write Less Code: Compute an Average private IntWritable one  = new IntWritable(1) private IntWritable output   = new IntWritable() proctected void map( LongWritable key, Text  value, Context   context)   { String[]   fields   = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one,   output) } IntWritable one  = new IntWritable(1) DoubleWritable average   = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context   context)   { int sum   = 0 int count   = 0 for(IntWritable value   : values)   { sum  += value.get() count++ } average.set(sum   / (double)   count) context.Write(key,   average) } data   = sc.textFile(...).split("t") data.map(lambda x:  (x[0],   [x.[1],   1]))   .reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   .map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   .collect() 18
  • 19. Write Less Code: Compute an Average Using RDDs data  = sc.textFile(...).split("t") data.map(lambda x:  (x[0],  [int(x[1]),  1]))   .reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])   .map(lambda x:  [x[0],  x[1][0]  / x[1][1]])   .collect() Using DataFrames sqlCtx.table("people")   .groupBy("name")   .agg("name",  avg("age"))   .map(lambda …)   .collect()   Full API Docs • Python • Scala • Java • R 19 Using SQL SELECT name,  avg(age) FROM people GROUP BY  name
  • 20. Not Just Less Code, Faster Too! 20 0 2 4 6 8 10 RDDScala RDDPython DataFrameScala DataFramePython DataFrameR DataFrameSQL Time to Aggregate 10 million int pairs (secs)
  • 21. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQLshare the same optimization/execution pipeline
  • 22. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity =  udf(lambda zipCode:  <custom  logic  here>) def add_demographics(events): u  = sqlCtx.table("users") events   .join(u,  events.user_id == u.user_id)   .withColumn("city",  zipToCity(df.zip)) Augments any DataFrame that contains user_id 22
  • 23. Optimize Full Pipelines Optimization happensas late as possible, therefore Spark SQL can optimize even across functions. 23 events  = add_demographics(sqlCtx.load("/data/events",  "json")) training_data = events   .where(events.city == "Amsterdam")   .select(events.timestamp)   .collect()  
  • 24. 24 def add_demographics(events): u  = sqlCtx.table("users") #  Load  Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column events  = add_demographics(sqlCtx.load("/data/events",  "json"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table expensive only join relevent users Physical Plan join scan (events) filter bycity scan (users) 24
  • 25. 25 def add_demographics(events): u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table Physical Plan join scan (events) filter bycity scan (users) 25
  • 26. Machine Learning Pipelines 26 tokenizer = Tokenizer(inputCol="text",  outputCol="words”) hashingTF = HashingTF(inputCol="words",  outputCol="features”) lr = LogisticRegression(maxIter=10,  regParam=0.01) pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr]) df = sqlCtx.load("/path/to/data") model  = pipeline.fit(df) ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model
  • 27. • 100+ native functionswith optimized codegen implementations – String manipulation – concat,   format_string,  lower,  lpad – Data/Time – current_timestamp,   date_format,  date_add,  … – Math – sqrt,  randn,   … – Other – monotonicallyIncreasingId,   sparkPartitionId,   … 27 Rich Function Library from pyspark.sql.functions import * yesterday  = date_sub(current_date(),  1) df2  = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(),  1) val df2 = df.filter(df("created_at")  > yesterday) Added  in   Spark  1.5
  • 28. Optimized Execution with Project Tungsten Compact encoding, cacheaware algorithms, runtime code generation 28
  • 29. The overheads of JVM objects “abcd” 29 • Native: 4 byteswith UTF-8 encoding • Java: 48 bytes java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API) 12 byte object header 8 byte hashcode 20 bytes data + overhead
  • 30. 6 “bricks” Tungsten’s Compact Encoding 30 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths “abcd” with Tungsten encoding: ~5-­‐6  bytes  
  • 31. Runtime Bytecode Generation 31 df.where(df("year")  > 2015) GreaterThan(year#234,  Literal(2015)) bool filter(Object  baseObject)  { int offset  = baseOffset + bitSetWidthInBytes + 3*8L; int value  =  Platform.getInt(baseObject,  offset); return value34  > 2015; } DataFrame Code / SQL Catalyst Expressions Low-level bytecode JVM intrinsic JIT-ed to pointer arithmetic Platform.getInt(baseObject,  offset);
  • 32. • Type-safe: operate on domain objectswith compiled lambda functions • Fast: Code-generated encodersfor fast serialization • Interoperable: Easily convert DataFrame ßà Dataset withoutboilerplate 32 Coming soon: Datasets val df = ctx.read.json("people.json") //  Convert  data  to  domain  objects. case class Person(name:  String,  age:  Int) val ds: Dataset[Person]  = df.as[Person] ds.filter(_.age  > 30) //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.name).mapGroups { case (name,  people:  Iter[Person])  => val buckets =  new Array[Int](10)             people.map(_.age).foreach {  a  => buckets(a  / 10)  += 1 }                   (name,  buckets) } Preview  in   Spark  1.6
  • 33. 33 Create and run your Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work Questions? Committer Office Hours Weds, 4:00-5:00 pm Michael Thurs, 10:30-11:30 am Reynold Thurs, 2:00-3:00 pm Andrew