Strata NYC 2015 - What's coming for the Spark community

What’s New in the Spark
Community
Patrick Wendell | @pwendell

About Me
Co-Founder of Databricks
Founding committer of Apache Spark at U.C. Berkeley
Today, manage Spark eﬀort @ Databricks

About Databricks
Team donated Spark to ASF in 2013;
primary maintainers of Spark today
Hosted analytics stack based on
Apache Spark
Managed clusters, notebooks,
collaboration, and third party apps:

Today’s Talk
Quick overview of Apache Spark
Technical roadmap directions
Community and ecosystem trends

What is your familiarity with Spark?
1.  Not very familiar with Spark – only very high level.
2.  Understand the components/uses well, but I’ve never written code.
3.  I’ve written Spark code on POC or production use case of Spark.

“Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune

…
Apache Spark Engine
Spark Core
Streaming
SQL and
Dataframe
MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries

This Talk
“What’s new” in Spark? And what’s coming?
Two parts: Technical roadmap and community developments
“The future is already here — it's just not very evenly distributed.”
- William Gibson

Spark Technical Directions
Higher level API’s
Make developers more productive
Performance of key execution primitives
Shuﬀle, sorting, hashing, and state management
Pluggability and extensibility
Make it easy for other projects to integrate with Spark

Higher Level API’s
Making Spark accessible to data scientists, engineers, statisticians…

Computing an Average: MapReduce vs Spark
private
IntWritable
one
=

new
IntWritable(1)

private
IntWritable
output
=

new
IntWritable()

proctected
void
map(

LongWritable
key,

Text
value,

Context
context)
{

String[]
fields
=
value.split("t")

output.set(Integer.parseInt(fields[1]))

context.write(one,
output)

}

IntWritable
one
=
new
IntWritable(1)

DoubleWritable
average
=
new
DoubleWritable()

protected
void
reduce(

IntWritable
key,

Iterable<IntWritable>
values,

Context
context)
{

int
sum
=
0

int
count
=
0

for(IntWritable
value
:
values)
{

sum
+=
value.get()

count++

}

average.set(sum
/
(double)
count)

context.Write(key,
average)

}

data
=
sc.textFile(...).split("t")

data.map(lambda
x:
(x[0],
[x.[1],
1]))

.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])

.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])

.collect()

13

Computing an Average with Spark
data
=
sc.textFile(...).split("t")

data.map(lambda
x:
(x[0],
[x.[1],
1]))

.reduceByKey(lambda
x,
y:
[x[0]
+
y[0],
x[1]
+
y[1]])

.map(lambda
x:
[x[0],
x[1][0]
/
x[1][1]])

.collect()

14

Computing an Average with DataFrames

sqlCtx.table("people")

.groupBy("name")

.agg("name",
avg("age"))

.collect()

15

Spark DataFrame API
Explicit data model and schema
Selecting columns and filtering
Aggregation (count, sum, average, etc)
User defined functions
Joining diﬀerent data sources
Statistical functions and easy plotting
Python, Scala, Java, and R
16

sqlCtx.table("people")

.groupBy("name")

.agg("name",
avg("age"))

.collect()

Ask more of your framework!
MapReduce Spark Spark + DataFrames
Fault tolerance Fault tolerance Fault tolerance
Data distribution Data distribution Data distribution
Set operators Set operators
Operator DAG Operator DAG
Caching Caching
Schema management
Relational semantics
Logical plan optimization
Storage push down and opt.
Analytic operations
…

Other high level API’s
ML Pipelines
SparkR
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
>
faithful
<-‐
read.df("faithful.json",
"json”)

>
head(filter(faithful,
faithful
$waiting
<
50))

##

eruptions
waiting

##1

1.750

47

##2

1.750

47

##3

1.867

48

Performance Initiatives
Project Tungsten – improving runtime eﬀiciency of key internals
Everything else – IO optimizations, dynamic plan re-writing

Project Tungsten: The CPU Squeeze
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L

Project Tungsten
Code generation for CPU efficiency
Code generation on by default and using Janino [SPARK-7956]
Beef up built-in UDF library (added ~100 UDF’s with code gen)
AddMonths

ArrayContains

Ascii

Base64

Bin

BinaryMathExpressi
on

CheckOverflow

CombineSets

Contains

CountSet

Crc32

DateAdd

DateDiff

DateFormatClass

DateSub

DayOfMonth

DayOfYear

Decode

Encode

EndsWith

Explode

Factorial

FindInSet

FormatNumber

FromUTCTimestamp

FromUnixTime

GetArrayItem

GetJsonObject

GetMapValue

Hex

InSet

InitCap

IsNaN

IsNotNull

IsNull

LastDay

Length

Levenshtein

Like

Lower

MakeDecimal

Md5

Month

MonthsBetween

NaNvl

NextDay

Not

PromotePrecision

Quarter

RLike

Round

Second

Sha1

Sha2

ShiYLeY

ShiYRight

ShiYRightUnsigne
d

SortArray

SoundEx

StartsWith

StringInstr

StringRepeat

StringReverse

StringSpace

StringSplit

StringTrim

StringTrimLeY

StringTrimRight

TimeAdd

TimeSub

ToDate

ToUTCTimestamp

TruncDate

UnBase64

UnaryMathExpressi
on

Unhex

UnixTimestamp

Project Tungsten
Binary processing for memory management (all data types):
External sorting with managed memory
External hashing with managed memory
Memory
page

hc
ptr

…

key
value
key
value

key
value
key
value

key
value
key
value

Managed Memory HashMap in Tungsten

Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM GPU NVRAM
Where are we going?
Tungsten
backend
language
frontend
…

Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics

Pluggability: Rich IO Support
df
=
sqlContext.read

.format("json")

.option("samplingRatio",
"0.1")

.load("/home/michael/data.json”)

df.write

.format("parquet")

.mode("append")

.partitionBy("year")

.saveAsTable("fasterData")

Unified interface to reading/writing data in a variety of formats

Large Number of IO Integration
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
28
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/

Technical Directions
Early on, the focus was:
Can Spark be an engine that is faster and easier to use than Hadoop
MapReduce?
Today the question is:
Can Spark & its ecosystem make big data as easy as little data?

Who is the “Spark Community”?
thousands of users
… hundreds of developers
… dozens of distributors

Getting a better vantage point
Databricks survey - feedback from more than 1,400 users

Community trends: Library & package ecosystem
Strata NY 2014: Widespread use of core RDD API
Today: Most use built-in and community libraries
51% of users use 3 or more libraries

Spark Packages
Strata NY 2014: Didn’t exist
Today: > 100 community packages
> ./bin/spark-shell --packages databricks/spark-avro:0.2

Spark Packages
API Extensions
Clojure API
Spark Kernel
Zepplin Notebook
Indexed RDD
Deployment Utilities
Google Compute
Microsoft Azure
Spark Jobserver
Data Sources
Redshift
Avro
CSV
Elastic Search
MongoDB

Increasing storage options
Strata NY 2014: IO primarily through Hadoop InputFormat API
January 2015: Spark adds native storage API
Today: Well over 20 natively integrated storage bindings
Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,
Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…

Deployment environments
Strata NY 2014: Traction in the Hadoop community
Today: Growth beyond Hadoop… increasingly public cloud
51% of respondents run Spark in public cloud

Wrapping it up
Spark has grown and developed quickly in the last year!
Looking forward expect:
-  Engineering eﬀort on higher level API’s and performance
-  A broader surrounding ecosystem
-  The unexpected

Where to learn more about Spark?
SparkHub community portal
Spark Summit conference - https://spark-summit.org/
Massive online course (edX):
Databricks Spark training Books:

Strata NYC 2015 - What's coming for the Spark community

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Strata NYC 2015 - What's coming for the Spark community (20)

More from Databricks (20)

Recently uploaded (20)

Strata NYC 2015 - What's coming for the Spark community