SlideShare a Scribd company logo
DATA-CENTRIC
METAPROGRAMMING
Vlad Ureche
Vlad Ureche
PhD in the Scala Team @ EPFL. Soon to graduate ;)
● Working on program transformations focusing on data representation
● Author of miniboxing, which improves generics performance by up to 20x
● Contributed to the Scala compiler and to the scaladoc tool.
@
@VladUreche
@VladUreche
vlad.ureche@gmail.com
scala-miniboxing.org
Research ahead*
!
* This may not make it into a product.
But you can play with it nevertheless.
STOP
Please ask if things
are not clear!
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
structured-data and used with permission.
Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
structured-data and used with permission.
Performance gap between
RDDs and DataFrames
Motivation
RDD DataFrame
Motivation
RDD
●
strongly typed
●
slower
DataFrame
Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
?
●
strongly typed
●
faster
Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Dataset
●
strongly typed
●
faster
Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Dataset
●
strongly typed
●
faster mid-way
Motivation
RDD
●
strongly typed
●
slower
DataFrame
●
dynamically typed
●
faster
Dataset
●
strongly typed
●
faster mid-way
Why just mid-way?
What can we do to speed them up?
Object Composition
Object Composition
class Vector[T] { … }
Object Composition
class Vector[T] { … }
The Vector collection
in the Scala library
Object Composition
class Employee(...)
ID NAME SALARY
class Vector[T] { … }
The Vector collection
in the Scala library
Object Composition
class Employee(...)
ID NAME SALARY
class Vector[T] { … }
The Vector collection
in the Scala library
Corresponds to
a table row
Object Composition
class Employee(...)
ID NAME SALARY
class Vector[T] { … }
Object Composition
class Employee(...)
ID NAME SALARY
class Vector[T] { … }
Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Traversal requires
dereferencing a pointer
for each employee.
A Better Representation
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
A Better Representation
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
A Better Representation
●
more efficient heap usage
●
faster iteration
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
The Problem
●
Vector[T] is unaware of Employee
The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
– Spark pain point: Functions/closures
The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
– Spark pain point: Functions/closures
– We'd like a "structured" representation throughout
The Problem
●
Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
●
Not limited to Vector, other classes also affected
– Spark pain point: Functions/closures
– We'd like a "structured" representation throughout
Challenge: No means of
communicating this
to the compiler
Choice: Safe or Fast
Choice: Safe or Fast
This is where my
work comes in...
Data-Centric Metaprogramming
●
compiler plug-in that allows
●
Tuning data representation
●
Website: scala-ildl.org
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Transformation
Definition Application
Transformation
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
Transformation
programmer
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
Transformation
programmer
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone
Transformation
programmer
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone
compiler (automated)
Transformation
programmer
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone
compiler (automated)
Data-Centric Metaprogramming
object VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee]
type Result = EmployeeVector
def toResult(t: Target): Result = ...
def toTarget(t: Result): Target = ...
def bypass_length: Int = ...
def bypass_apply(i: Int): Employee = ...
def bypass_update(i: Int, v: Employee) = ...
def bypass_toString: String = ...
...
}
Data-Centric Metaprogramming
object VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee]
type Result = EmployeeVector
def toResult(t: Target): Result = ...
def toTarget(t: Result): Target = ...
def bypass_length: Int = ...
def bypass_apply(i: Int): Employee = ...
def bypass_update(i: Int, v: Employee) = ...
def bypass_toString: String = ...
...
}
What to transform?
What to transform to?
Data-Centric Metaprogramming
object VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee]
type Result = EmployeeVector
def toResult(t: Target): Result = ...
def toTarget(t: Result): Target = ...
def bypass_length: Int = ...
def bypass_apply(i: Int): Employee = ...
def bypass_update(i: Int, v: Employee) = ...
def bypass_toString: String = ...
...
}
How to
transform?
Data-Centric Metaprogramming
object VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee]
type Result = EmployeeVector
def toResult(t: Target): Result = ...
def toTarget(t: Result): Target = ...
def bypass_length: Int = ...
def bypass_apply(i: Int): Employee = ...
def bypass_update(i: Int, v: Employee) = ...
def bypass_toString: String = ...
...
} How to run methods on the updated representation?
Transformation
programmer
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone
compiler (automated)
Transformation
programmer
Definition Application
●
can't be automated
●
based on experience
●
based on speculation
●
one-time effort
●
repetitive and complex
●
affects code
readability
●
is verbose
●
is error-prone
compiler (automated)
http://infoscience.epfl.ch/record/207050?ln=en
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Scenario
class Employee(...)
ID NAME SALARY
class Vector[T] { … }
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...)
extends Employee(...)
ID NAME SALARY DEPT
Oooops...
Open World Assumption
●
Globally anything can happen
Open World Assumption
●
Globally anything can happen
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
Open World Assumption
●
Globally anything can happen
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
How?
Open World Assumption
●
Globally anything can happen
●
Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
How?
Using
Scopes!
Scopes
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Scopes
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Scopes
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Now the method operates
on the EmployeeVector
representation.
Scopes
●
Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
Scopes
●
Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
●
Mark locally closed parts of the code
– Incoming/outgoing values go through conversions
– You can reject unexpected values
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Best Representation?
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Best Representation?
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Best ...?
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Best ...?
Tungsten repr.
<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Best ...?
EmployeeJSON
{
id: 123,
name: “John Doe”
salary: 100
}
Tungsten repr.
<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Scopes allow mixing data representations
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Scopes
transform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Operating on the
EmployeeVector
representation.
Scopes
transform(VectorOfEmployeeCompact) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Operating on the
compact binary
representation.
Scopes
transform(VectorOfEmployeeJSON) {
def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] =
for (employee ← employees)
yield employee.copy(
salary = (1 + by) * employee.salary
)
}
Operating on the
JSON-based
representation.
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Composition
●
Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
Composition
●
Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Easy one. Do nothing
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Automatically introduce conversions
between values in the two representations
e.g. EmployeeVector Vector[Employee] or back→
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Hard one. Do not introduce any conversions.
Even across separate compilation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Hard one. Automatically introduce double
conversions (and warn the programmer)
e.g. EmployeeVector Vector[Employee] CompactEmpVector→ →
Composition
calling
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Composition
calling
overriding
●
Original code
●
Transformed code
●
Original code
●
Transformed code
●
Same transformation
●
Different transformation
Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
Method print in the class
implements
method print in the trait
Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
transform(VectorOfEmployeeOpt) {
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
}
Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
transform(VectorOfEmployeeOpt) {
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
} The signature of method
print changes according to
the transformation it no→
longer implements the trait
Scopes
trait Printer[T] {
def print(elements: Vector[T]): Unit
}
transform(VectorOfEmployeeOpt) {
class EmployeePrinter extends Printer[Employee] {
def print(employee: Vector[Employee]) = ...
}
} The signature of method
print changes according to
the transformation it no→
longer implements the trait
Taken care by the
compiler for you!
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Column-oriented Storage
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Column-oriented Storage
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
iteration is 5x faster
Retrofitting value class status
(3,5)
3 5Header
reference
Retrofitting value class status
Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
Retrofitting value class status
0l + 3 << 32 + 5
(3,5)
Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
Retrofitting value class status
0l + 3 << 32 + 5
(3,5)
Tuples in Scala are specialized but
are still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
14x faster, lower
heap requirements
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4)
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8)
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) {
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) {
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) {
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
accumulate
function
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) {
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
accumulate
function
compute:
18
Deforestation
List(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) {
List(1,2,3).map(_ + 1).map(_ * 2).sum
}
accumulate
function
accumulate
function
compute:
18
6x faster
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Research ahead*
!
* This may not make it into a product.
But you can play with it nevertheless.
Spark
●
Optimizations
– DataFrames do deforestation
– DataFrames do predicate push-down
– DataFrames do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
Spark
●
Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
Spark
●
Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
This is what
makes them slower
Spark
●
Optimizations
– Datasets do deforestation
– Datasets do predicate push-down
– Datasets do code generation
●
Code is specialized for the data representation
●
Functions are specialized for the data representation
User Functions
X Y
user
function
f
User Functions
serialized
data
encoded
data
X Y
user
function
f
decode
User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
Allocate object Allocate object
User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
Allocate object Allocate object
User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
User Functions
serialized
data
encoded
data
X Y
encoded
data
user
function
f
decode encode
Modified user function
(automatically derived
by the compiler)
User Functions
serialized
data
encoded
data
encoded
data
Modified user function
(automatically derived
by the compiler)
User Functions
serialized
data
encoded
data
encoded
data
Modified user function
(automatically derived
by the compiler) Nowhere near as
simple as it looks
Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
– Suggest how to fix it: enclose the method in a scope
Challenge: Transformation not possible
●
Example: Calling outside (untransformed) method
●
Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
– Suggest how to fix it: enclose the method in a scope
●
Reuse the machinery in miniboxing
scala-miniboxing.org
Challenge: Internal API changes
Challenge: Internal API changes
●
Spark internals rely on Iterator[T]
– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
Challenge: Internal API changes
●
Spark internals rely on Iterator[T]
– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
●
Solution: Extensive refactoring/rewrite
Challenge: Automation
Challenge: Automation
●
Existing code should run out of the box
Challenge: Automation
●
Existing code should run out of the box
●
Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
Challenge: Automation
●
Existing code should run out of the box
●
Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
Where are we now?
Prototype
Prototype Hack
Prototype Hack
●
Modified version of Spark core
– RDD data representation is configurable
Prototype Hack
●
Modified version of Spark core
– RDD data representation is configurable
●
It's very limited:
– Custom data repr. only in map, filter and flatMap
– Otherwise we revert to costly objects
– Large parts of the automation still need to be done
Prototype Hack
sc.parallelize(/* 1 million */ records).
map(x => ...).
filter(x => ...).
collect()
Prototype Hack
sc.parallelize(/* 1 million */ records).
map(x => ...).
filter(x => ...).
collect()
Prototype Hack
sc.parallelize(/* 1 million */ records).
map(x => ...).
filter(x => ...).
collect() Not yet 2x faster,
but 1.45x faster
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Conclusion
●
Object-oriented composition → inefficient representation
Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
– Opaque data → Structured data
Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
– Is it easy? Not really.
Conclusion
●
Object-oriented composition → inefficient representation
●
Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
– Is it easy? Not really.
– Is it worth it? You tell me!
Thank you!
Check out scala-ildl.org.
Deforestation and Language Semantics
●
Notice that we changed language semantics:
– Before: collections were eager
– After: collections are lazy
– This can lead to effects reordering
Deforestation and Language Semantics
●
Such transformations are only acceptable with
programmer consent
– JIT compilers/staged DSLs can't change semantics
– metaprogramming (macros) can, but it should be
documented/opt-in
Code Generation
●
Also known as
– Deep Embedding
– Multi-Stage Programming
●
Awesome speedups, but restricted to small DSLs
●
SparkSQL uses code gen to improve performance
– By 2-4x over Spark
Low-level Optimizers
●
Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
Low-level Optimizers
●
Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
●
Best optimizations break semantics
– You can't do this in the JIT compiler!
– Only the programmer can decide to break semantics
Scala Macros
●
Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility
●
Scala compiler invariants
●
Object-oriented model
●
Modularity
Scala Macros
●
Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility
●
Scala compiler invariants
●
Object-oriented model
●
Modularity
●
Can we restrict macros so they're safer?
– Data-centric metaprogramming

More Related Content

PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Data Source API in Spark
PDF
Spark Dataframe - Mr. Jyotiska
PPTX
Building a modern Application with DataFrames
PDF
Spark SQL with Scala Code Examples
PDF
SparkSQL and Dataframe
PPTX
Spark meetup v2.0.5
PDF
Practical Machine Learning Pipelines with MLlib
Introducing DataFrames in Spark for Large Scale Data Science
Data Source API in Spark
Spark Dataframe - Mr. Jyotiska
Building a modern Application with DataFrames
Spark SQL with Scala Code Examples
SparkSQL and Dataframe
Spark meetup v2.0.5
Practical Machine Learning Pipelines with MLlib

What's hot (20)

PPTX
OrientDB vs Neo4j - Comparison of query/speed/functionality
PDF
Introduction to df
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
PDF
Introduction to Spark SQL & Catalyst
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Python business intelligence (PyData 2012 talk)
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PPTX
Apache Spark sql
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Tactical data engineering
PDF
Bubbles – Virtual Data Objects
PDF
Sasi, cassandra on full text search ride
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
Tachyon-2014-11-21-amp-camp5
PDF
Streaming SQL with Apache Calcite
PPT
Marmagna desai
PPTX
For Beginners - Ado.net
PPT
For Beginers - ADO.Net
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
PDF
Cubes – pluggable model explained
OrientDB vs Neo4j - Comparison of query/speed/functionality
Introduction to df
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Spark SQL & Catalyst
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Python business intelligence (PyData 2012 talk)
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Apache Spark sql
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Tactical data engineering
Bubbles – Virtual Data Objects
Sasi, cassandra on full text search ride
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Tachyon-2014-11-21-amp-camp5
Streaming SQL with Apache Calcite
Marmagna desai
For Beginners - Ado.net
For Beginers - ADO.Net
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Cubes – pluggable model explained
Ad

Viewers also liked (20)

PPSX
Eipak pallantzas - chalkida - v1
PPTX
6.usuario
PPTX
MapUp Resources- Map Maker Intro Presentation-Edit 001
PPSX
Ferrography test
DOCX
Bitacoras de-tecnologia-1 (1)
PDF
Telehealth-WMC (1)
PDF
elshazly cv
PPTX
Presentación1
DOC
Aaaa apracticadesoftwareyhardware
PPT
jazmin arllette hernandez santos 1° "R"
PDF
BB 24-2015 Lokaal geld rukt op
PPTX
El valor de l'amistat
PPSX
PPTX
Ten Tips for Fixing Your Terrible Website
PPSX
Ferrography test (new)
DOCX
Data Communication and Computer Networking
PDF
Netflix and Containers: Not Stranger Things
PDF
Mar na literatura
PDF
Logical-DataWarehouse-Alluxio-meetup
PPT
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Eipak pallantzas - chalkida - v1
6.usuario
MapUp Resources- Map Maker Intro Presentation-Edit 001
Ferrography test
Bitacoras de-tecnologia-1 (1)
Telehealth-WMC (1)
elshazly cv
Presentación1
Aaaa apracticadesoftwareyhardware
jazmin arllette hernandez santos 1° "R"
BB 24-2015 Lokaal geld rukt op
El valor de l'amistat
Ten Tips for Fixing Your Terrible Website
Ferrography test (new)
Data Communication and Computer Networking
Netflix and Containers: Not Stranger Things
Mar na literatura
Logical-DataWarehouse-Alluxio-meetup
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Ad

Similar to Data centric Metaprogramming by Vlad Ulreche (20)

PDF
Quark: A Purely-Functional Scala DSL for Data Processing & Analytics
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PDF
More expressive types for spark with frameless
PPTX
Scala best practices
PDF
Scala for Java Developers
PDF
Type classes 101 - classification beyond inheritance
PPTX
Scala Back to Basics: Type Classes
PDF
Scalapeno18 - Thinking Less with Scala
PPTX
My Master's Thesis
PPTX
Taxonomy of Scala
PPTX
Improving Correctness with Types
PDF
Towards typesafe deep learning in scala
PDF
Talk - Query monad
PDF
Generic Functional Programming with Type Classes
DOCX
Exercise P8.1. Derive a class Programmer from Employee. .docx
PDF
PPTX
Monads and friends demystified
PDF
icpe2019_ishizaki_public
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PDF
Introduction to dataset
Quark: A Purely-Functional Scala DSL for Data Processing & Analytics
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
More expressive types for spark with frameless
Scala best practices
Scala for Java Developers
Type classes 101 - classification beyond inheritance
Scala Back to Basics: Type Classes
Scalapeno18 - Thinking Less with Scala
My Master's Thesis
Taxonomy of Scala
Improving Correctness with Types
Towards typesafe deep learning in scala
Talk - Query monad
Generic Functional Programming with Type Classes
Exercise P8.1. Derive a class Programmer from Employee. .docx
Monads and friends demystified
icpe2019_ishizaki_public
Spark Based Distributed Deep Learning Framework For Big Data Applications
Introduction to dataset

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Mega Projects Data Mega Projects Data
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
IB Computer Science - Internal Assessment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Computer network topology notes for revision
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
Mega Projects Data Mega Projects Data
Launch Your Data Science Career in Kochi – 2025
Miokarditis (Inflamasi pada Otot Jantung)
Galatica Smart Energy Infrastructure Startup Pitch Deck
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Data centric Metaprogramming by Vlad Ulreche