Data centric Metaprogramming by Vlad Ulreche

DATA-CENTRIC
METAPROGRAMMING
Vlad Ureche

Vlad Ureche
PhD in the Scala Team @ EPFL. Soon to graduate ;)
● Working on program transformations focusing on data representation
● Author of miniboxing, which improves generics performance by up to 20x
● Contributed to the Scala compiler and to the scaladoc tool.
@
@VladUreche
@VladUreche
vlad.ureche@gmail.com
scala-miniboxing.org

Research ahead*
!
* This may not make it into a product.
But you can play with it nevertheless.

STOP
Please ask if things
are not clear!

Motivation
Transformation
Applications
Challenges
Conclusion
Spark

Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
structured-data and used with permission.

Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
structured-data and used with permission.
Performance gap between
RDDs and DataFrames

Motivation
RDD
●
strongly typed
●
slower
DataFrame

Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster

Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
?
●
strongly typed
●
faster

Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Dataset
●
strongly typed
●
faster

Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Dataset
●
strongly typed
●
faster mid-way

Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Dataset
●
strongly typed
●
faster mid-way
Why just mid-way?
What can we do to speed them up?

Object Composition
class Vector[T] { … }

Object Composition
The Vector collection
in the Scala library

Object Composition
class Employee(...)
ID NAME SALARY

Object Composition
class Employee(...)
ID NAME SALARY
Corresponds to
a table row

Object Composition
class Employee(...)
ID NAME SALARY

Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Traversal requires
dereferencing a pointer
for each employee.

A Better Representation
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

●
more efficient heap usage
●
faster iteration
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

The Problem
●
Vector[T] is unaware of Employee

The Problem
●
– Which makes Vector[Employee] suboptimal

The Problem
●
●
Not limited to Vector, other classes also affected

The Problem
●
●
– Spark pain point: Functions/closures

The Problem
●
●
– We'd like a "structured" representation throughout

The Problem
●
●
– We'd like a "structured" representation throughout
Challenge: No means of
communicating this
to the compiler

Choice: Safe or Fast
This is where my
work comes in...

Data-Centric Metaprogramming
●
compiler plug-in that allows
●
Tuning data representation
●
Website: scala-ildl.org

Transformation
Definition Application

Transformation
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort

Transformation
programmer
●
can't be automated
●
based on experience
●
●
one-time effort

Transformation
programmer
●
can't be automated
●
based on experience
●
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone

Transformation
programmer
●
can't be automated
●
based on experience
●
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone
compiler (automated)

object VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee]
type Result = EmployeeVector
def toResult(t: Target): Result = ...
def toTarget(t: Result): Target = ...
def bypass_length: Int = ...
def bypass_apply(i: Int): Employee = ...
def bypass_update(i: Int, v: Employee) = ...
def bypass_toString: String = ...
...
}

...
}
What to transform?
What to transform to?

...
}
How to
transform?

...
} How to run methods on the updated representation?

http://infoscience.epfl.ch/record/207050?ln=en

Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition

Scenario
class Employee(...)
ID NAME SALARY

Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY

Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT

Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
Oooops...

Open World Assumption
●
Globally anything can happen

●
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee

●
●
How?

●
●
How?
Using
Scopes!

Scopes
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}

Scopes
)
}
Now the method operates
on the EmployeeVector
representation.

Scopes
●
Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"

Scopes
●
Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
●
Mark locally closed parts of the code
– Incoming/outgoing values go through conversions
– You can reject unexpected values

Vector[Employee]
ID NAME SALARY
ID NAME SALARY

It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Best ...?
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Best ...?
Tungsten repr.
<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Best ...?
EmployeeJSON
{
id: 123,
name: “John Doe”
salary: 100
}
Tungsten repr.
<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Scopes allow mixing data representations
)
}

Scopes
)
}
Operating on the
EmployeeVector
representation.

Scopes
transform(VectorOfEmployeeCompact) {
)
}
Operating on the
compact binary
representation.

Scopes
transform(VectorOfEmployeeJSON) {
)
}
Operating on the
JSON-based
representation.

Composition
●
Code can be
– Left untransformed (using the original representation)
– Transformed using different representations

Composition
●
Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation

Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●

Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Easy one. Do nothing

Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Automatically introduce conversions
between values in the two representations
e.g. EmployeeVector Vector[Employee] or back→

Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Hard one. Do not introduce any conversions.
Even across separate compilation

Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Hard one. Automatically introduce double
conversions (and warn the programmer)
e.g. EmployeeVector Vector[Employee] CompactEmpVector→ →

Composition
calling
overriding
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●

Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}

Scopes
trait Printer[T] {
}
}
Method print in the class
implements
method print in the trait

Scopes
trait Printer[T] {
}
}
}

Scopes
trait Printer[T] {
}
}
} The signature of method
print changes according to
the transformation it no→
longer implements the trait

Scopes
trait Printer[T] {
}
}
} The signature of method
print changes according to
the transformation it no→
longer implements the trait
Taken care by the
compiler for you!

Column-oriented Storage
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY

Column-oriented Storage
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
iteration is 5x faster

Retrofitting value class status
(3,5)
3 5Header
reference

Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference

0l + 3 << 32 + 5
(3,5)
(3,5)
3 5Header
reference

0l + 3 << 32 + 5
(3,5)
(3,5)
3 5Header
reference
14x faster, lower
heap requirements

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4)

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8)

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) {
List(1,2,3).map(_ + 1).map(_ * 2).sum
}

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
accumulate
function

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
accumulate
function
compute:
18

Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
accumulate
function
compute:
18
6x faster

Spark
●
Optimizations
– DataFrames do deforestation
– DataFrames do predicate push-down
– DataFrames do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation

Spark
●
Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation
●
●

Spark
●
Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation
●
●
This is what
makes them slower

Spark
●
Optimizations
– Datasets do deforestation
– Datasets do predicate push-down
– Datasets do code generation
●
●

User Functions
X Y
user
function
f

User Functions
serialized
data
encoded
data
X Y
user
function
f
decode

User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode

User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
Allocate object Allocate object

User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
Modified user function
(automatically derived
by the compiler)

User Functions
serialized
data
encoded
data
encoded
data
by the compiler)

User Functions
serialized
data
encoded
data
encoded
data
by the compiler) Nowhere near as
simple as it looks

Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method

●
●
Solution: Issue compiler warnings

●
●
– Explain why it's not possible: due to the method call

●
●
– Suggest how to fix it: enclose the method in a scope

●
●
– Suggest how to fix it: enclose the method in a scope
●
Reuse the machinery in miniboxing
scala-miniboxing.org

Challenge: Internal API changes

●
Spark internals rely on Iterator[T]
– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers

●
Spark internals rely on Iterator[T]
– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
●
Solution: Extensive refactoring/rewrite

Challenge: Automation
●
Existing code should run out of the box

●
●
Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases

●
●
Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
Where are we now?

Prototype Hack
●
Modified version of Spark core
– RDD data representation is configurable

Prototype Hack
●
Modified version of Spark core
– RDD data representation is configurable
●
It's very limited:
– Custom data repr. only in map, filter and flatMap
– Otherwise we revert to costly objects
– Large parts of the automation still need to be done

Prototype Hack
sc.parallelize(/* 1 million */ records).
map(x => ...).
filter(x => ...).
collect()

Prototype Hack
sc.parallelize(/* 1 million */ records).
map(x => ...).
filter(x => ...).
collect() Not yet 2x faster,
but 1.45x faster

Conclusion
●
Object-oriented composition → inefficient representation

Conclusion
●
●
Solution: data-centric metaprogramming

Conclusion
●
●
– Opaque data → Structured data

Conclusion
●
●
– Is it possible? Yes.

Conclusion
●
●
– Is it easy? Not really.

Conclusion
●
●
– Is it easy? Not really.
– Is it worth it? You tell me!

Thank you!
Check out scala-ildl.org.

Deforestation and Language Semantics
●
Notice that we changed language semantics:
– Before: collections were eager
– After: collections are lazy
– This can lead to effects reordering

Deforestation and Language Semantics
●
Such transformations are only acceptable with
programmer consent
– JIT compilers/staged DSLs can't change semantics
– metaprogramming (macros) can, but it should be
documented/opt-in

Code Generation
●
Also known as
– Deep Embedding
– Multi-Stage Programming
●
Awesome speedups, but restricted to small DSLs
●
SparkSQL uses code gen to improve performance
– By 2-4x over Spark

Low-level Optimizers
●
Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles

Low-level Optimizers
●
Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
●
Best optimizations break semantics
– You can't do this in the JIT compiler!
– Only the programmer can decide to break semantics

Scala Macros
●
Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility
●
Scala compiler invariants
●
Object-oriented model
●
Modularity

Scala Macros
●
Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility
●
Scala compiler invariants
●
Object-oriented model
●
Modularity
●
Can we restrict macros so they're safer?
– Data-centric metaprogramming

Data centric Metaprogramming by Vlad Ulreche

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data centric Metaprogramming by Vlad Ulreche (20)

More from Spark Summit (20)

Recently uploaded (20)

Data centric Metaprogramming by Vlad Ulreche