Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit Amsterdam 2015 - October, 28th

Graduated
from Alpha
in 1.3
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
SQLAbout Me and
2
0
100
200
300
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2

3
SELECT COUNT(*)
FROM hiveTable
WHERE hive_udf(data)
• Spark SQL
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
SQLAbout Me and
Improved
multi-version
support in 1.4

4
• Spark SQL
• Connectexisting BI tools to Spark through JDBC
SQLAbout Me and

• Spark SQL
• Bindingsin Python,Scala, Java, and R
5
SQLAbout Me and

• Spark SQL
• Bindingsin Python,Scala, Java, and R
• @michaelarmbrust
• Creator of Spark SQL @databricks
6
SQLAbout Me and

The not-so-secret truth...
7
is about more than SQL.
SQL

8
Create and run
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work

DataFrame
noun – [dey-tuh-freym]
9
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).

Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
10

.format("json")
df.write
.format("parquet")
.mode("append")
read and write
functions create
new builders for
doing I/O
11

Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
.format("json")
df.write
.format("parquet")
.mode("append")
12

load(…), save(…) or
saveAsTable(…)
finish the I/O
specification
.format("json")
df.write
.format("parquet")
.mode("append")
13

Read Less Data: Efficient Formats
• Compact binary encoding with intelligent compression
(delta, RLE, etc)
• Each column stored separately with an index that allows
skipping of unread columns
• Support for partitioning (/data/year=2015)
• Data skipping using statistics (column min/max, etc)
14
is an efficient columnar storage format:
ORC

Write Less Code: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*

ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url", "https://issues.apache.org/jira/rest/api/latest/search")
.option("user", "marmbrus")
.option("password", "*******")
.option("query", """
|project = SPARK AND
|component = SQL AND
|(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin)
.load()
16
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
Load
data
from

(Spark’s
Bug
Tracker)

using
a
custom
data
source.
Write
the
converted
data
out
to

a

table
stored
in

Write Less Code: High-Level Operations
Solve common problems conciselyusing DataFrame functions:
• Selecting columnsand filtering
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Plotting resultswith Pandas
17

Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
18

Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.map(lambda …)
.collect()
Full API Docs
• Python
• Scala
• Java
• R
19
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name

Not Just Less Code, Faster Too!
20
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)

Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQLshare the same optimization/execution pipeline

Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity = udf(lambda zipCode: <custom logic here>)
def add_demographics(events):
u = sqlCtx.table("users")
events
.join(u, events.user_id == u.user_id)
.withColumn("city", zipToCity(df.zip))
Augments any
DataFrame
that contains
user_id
22

Optimize Full Pipelines
Optimization happensas late as possible, therefore
Spark SQL can optimize even across functions.
23
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events
.where(events.city == "Amsterdam")
.select(events.timestamp)
.collect()

24
u = sqlCtx.table("users") # Load Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()
Logical Plan
filter
bycity
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
24

25
u = sqlCtx.table("users") # Load partitioned Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()
Logical Plan
filter
bycity
join
events file users table
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
25

Machine Learning Pipelines
26
tokenizer = Tokenizer(inputCol="text",
outputCol="words”)
hashingTF = HashingTF(inputCol="words",
outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
df = sqlCtx.load("/path/to/data")
model = pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model

• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,
format_string, lower, lpad
– Data/Time – current_timestamp,
date_format, date_add, …
– Math – sqrt, randn, …
– Other –
monotonicallyIncreasingId,
sparkPartitionId, …
27
Rich Function Library
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
Added
in

Spark
1.5

Optimized Execution with
Project Tungsten
Compact encoding, cacheaware algorithms,
runtime code generation
28

The overheads of JVM objects
“abcd”
29
• Native: 4 byteswith UTF-8 encoding
• Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes data + overhead

6 “bricks”
Tungsten’s Compact Encoding
30
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths
“abcd” with Tungsten encoding: ~5-‐6
bytes

Runtime Bytecode Generation
31
df.where(df("year") > 2015)
GreaterThan(year#234, Literal(2015))
bool filter(Object baseObject) {
int offset = baseOffset + bitSetWidthInBytes + 3*8L;
int value = Platform.getInt(baseObject, offset);
return value34 > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject, offset);

• Type-safe: operate on domain
objectswith compiled lambda
functions
• Fast: Code-generated
encodersfor fast serialization
• Interoperable: Easily convert
DataFrame ßà Dataset
withoutboilerplate
32
Coming soon: Datasets
val df = ctx.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds: Dataset[Person] = df.as[Person]
ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups {
case (name, people: Iter[Person]) =>
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
Preview
in

Spark
1.6

33
Create and run your
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
Questions?
Committer Office Hours
Weds, 4:00-5:00 pm Michael
Thurs, 10:30-11:30 am Reynold
Thurs, 2:00-3:00 pm Andrew

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data (20)

More from Databricks (20)

Recently uploaded (20)

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data