Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018

Adi Polak
Spark UDFs are EviL,
Catalyst to the rEsCue!

• Adi Polak
• Sr. Cloud Relation developer
• Previous Security researcher
• Majored in Machine Learning
• Tel Avivian
• BGU alumni
• Co-founderr of FLIP
• Spark & Scala enthusiast
• Foodie
Who am I
@adipolak
@adipolak

Real-time analytics on Big
Data

• Apache Spark with Scala
• Spark 2.3
• Catalyst optimization
• Spark custom UDFs
..OK

Fundamentals of Catalyst Optimizer
SUB
Attribute(x) SUB
some_func(1) some_func(2)
Tree Rules
SUB
Attribute(x) some_func(-1)

Spark SQL Execution Plan
Logical optimization –> Optimization rules
• Constant folding
• Predicate pushdown
• Projection pruning
• …
Physical Planning –> Planning strategies
Catalyst
Frontend Backend

What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize them."

What do we lose when
using Custom UDF ?
•Constant folding
•Predicate pushdown

Use queryExecution & explain(true)
Catalyst
Frontend Backend

Use queryExecution & explain(true) API
My UDF
Register

Use queryExecution & explain(true) API
My
UDFs
Register

What can be done instead?
sql functions DataFrame API:
Aggregate functions
Collection functions
Date time functions
Math functions
Non-aggregate functions
Sorting functions
String functions
Window functions
sql functions Column API
Expression operations..

How can I find what functions are available?
arrayContains, minute, round, rand, spark_partition_id, isin …
version

Can you show a complex example? Sure
Meh…

Using column functions ...
GREAT SUCCESS
!

Takeaways
• Use UDFs as a last resort
• Always check yourself
with dataFrame.explain(true)

Reference
• www.kaggle.com
• http://bit.ly/adiuserguide
• http://bit.ly/whatisdatabricks
• http://bit.ly/databrickstutorial
• http://bit.ly/clitools

THANK YOU
@adipolak
@adipolak
Adi.polak@Microsoft.com

Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018

More Related Content

What's hot (20)

Similar to Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018 (20)

More from Codemotion (20)

Recently uploaded (20)

Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018