SlideShare a Scribd company logo
Adi Polak
Spark UDFs are EviL,
Catalyst to the rEsCue!
• Adi Polak
• Sr. Cloud Relation developer
• Previous Security researcher
• Majored in Machine Learning
• Tel Avivian
• BGU alumni
• Co-founderr of FLIP
• Spark & Scala enthusiast
• Foodie
Who am I
@adipolak
@adipolak
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018
Real-time analytics on Big
Data
• Apache Spark with Scala
• Spark 2.3
• Catalyst optimization
• Spark custom UDFs
..OK
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018
CATALYST
Fundamentals of Catalyst Optimizer
SUB
Attribute(x) SUB
some_func(1) some_func(2)
Tree Rules
SUB
Attribute(x) some_func(-1)
Spark SQL Execution Plan
Logical optimization –> Optimization rules
• Constant folding
• Predicate pushdown
• Projection pruning
• …
Physical Planning –> Planning strategies
Catalyst
Frontend Backend
What is Spark Custom UDF
What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize them."
What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize them."
What is Spark Custom UDF
"Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF functions since UDFs are a
blackbox for Spark and so it does not even try to optimize them."
What do we lose when
using Custom UDF ?
•Constant folding
•Predicate pushdown
What can we do ?
Use queryExecution & explain(true)
Catalyst
Frontend Backend
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018
Use queryExecution & explain(true) API
My UDF
Register
Use queryExecution & explain(true) API
My
UDFs
Register
Lost Push Down filter
What can be done instead?
sql functions DataFrame API:
Aggregate functions
Collection functions
Date time functions
Math functions
Non-aggregate functions
Sorting functions
String functions
Window functions
sql functions Column API
Expression operations..
How can I find what functions are available?
arrayContains, minute, round, rand, spark_partition_id, isin …
version
Can you show a complex example? Sure
Meh…
Using column functions ...
GREAT SUCCESS
!
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018
Takeaways
• Use UDFs as a last resort
• Always check yourself
with dataFrame.explain(true)
Reference
• www.kaggle.com
• http://bit.ly/adiuserguide
• http://bit.ly/whatisdatabricks
• http://bit.ly/databrickstutorial
• http://bit.ly/clitools
THANK YOU
@adipolak
@adipolak
Adi.polak@Microsoft.com

More Related Content

PPTX
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
PPTX
Spark UDFs are EviL, Catalyst to the rEsCue!
PPTX
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
PDF
Scala for java developers 6 may 2017 - yeni
PDF
Algolia's Fury Road to a Worldwide API - Take Off Conference 2016
PDF
Fury road to a worldwide API - API Days - December 2015
PDF
とりあえず使うScalaz
PDF
AI at Scale
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Spark UDFs are EviL, Catalyst to the rEsCue!
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
Scala for java developers 6 may 2017 - yeni
Algolia's Fury Road to a Worldwide API - Take Off Conference 2016
Fury road to a worldwide API - API Days - December 2015
とりあえず使うScalaz
AI at Scale

What's hot (20)

PDF
ECMAScript 6 Overview & Comparision
PDF
Algolia - Hosted Search API
PDF
Meetup Angular.JS #12 Paris
PDF
Introduction to Scala for Java Developers
PPTX
Akkurate Akka
PPTX
Getting started with Laravel & Elasticsearch
PPTX
Laravel and SOLR
PDF
Scala Past, Present & Future
PDF
Algolia's Fury Road to a Worldwide API
PPTX
Adopting Elixir in a 10 year old codebase
PPTX
Elasticsearch for Autosuggest in Clojure at Workframe
PPTX
Using Spark Part Time
PPTX
How to Train Your Classifier: Create a Serverless Machine Learning System wit...
PDF
Cassandra Summit 2014: Astyanax — To Be or Not To Be
PPTX
Building Enterprise Search Engines using Open Source Technologies
PDF
Image Classification and Retrieval on Spark
PPTX
Building APIs with Kotlin and Spark
PDF
Koalas: Unifying Spark and pandas APIs
PDF
Intro to Apache Solr
PPTX
Serverless spark
ECMAScript 6 Overview & Comparision
Algolia - Hosted Search API
Meetup Angular.JS #12 Paris
Introduction to Scala for Java Developers
Akkurate Akka
Getting started with Laravel & Elasticsearch
Laravel and SOLR
Scala Past, Present & Future
Algolia's Fury Road to a Worldwide API
Adopting Elixir in a 10 year old codebase
Elasticsearch for Autosuggest in Clojure at Workframe
Using Spark Part Time
How to Train Your Classifier: Create a Serverless Machine Learning System wit...
Cassandra Summit 2014: Astyanax — To Be or Not To Be
Building Enterprise Search Engines using Open Source Technologies
Image Classification and Retrieval on Spark
Building APIs with Kotlin and Spark
Koalas: Unifying Spark and pandas APIs
Intro to Apache Solr
Serverless spark
Ad

Similar to Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018 (20)

PDF
Optimizing Apache Spark UDFs
PDF
Supporting Over a Thousand Custom Hive User Defined Functions
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
PDF
Spark SQL In Depth www.syedacademy.com
PPTX
Meetup Spark UDF performance
PPTX
This is training for spark SQL essential
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PDF
Portable UDFs: Write Once, Run Anywhere
PDF
Accelerating Data Processing in Spark SQL with Pandas UDFs
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
SparkSQL: A Compiler from Queries to RDDs
PPTX
Big Data Transformations Powered By Spark
PPTX
Big Data Transformation Powered By Apache Spark.pptx
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
PDF
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Optimizing Apache Spark UDFs
Supporting Over a Thousand Custom Hive User Defined Functions
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Pandas UDF: Scalable Analysis with Python and PySpark
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL In Depth www.syedacademy.com
Meetup Spark UDF performance
This is training for spark SQL essential
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Speed up UDFs with GPUs using the RAPIDS Accelerator
Portable UDFs: Write Once, Run Anywhere
Accelerating Data Processing in Spark SQL with Pandas UDFs
Introduction to Spark Datasets - Functional and relational together at last
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
SparkSQL: A Compiler from Queries to RDDs
Big Data Transformations Powered By Spark
Big Data Transformation Powered By Apache Spark.pptx
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Ad

More from Codemotion (20)

PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
PDF
Pompili - From hero to_zero: The FatalNoise neverending story
PPTX
Pastore - Commodore 65 - La storia
PPTX
Pennisi - Essere Richard Altwasser
PPTX
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
PPTX
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
PPTX
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
PPTX
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
PDF
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
PDF
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
PDF
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
PDF
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
PDF
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
PPTX
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
PDF
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
PDF
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
PDF
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
PDF
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Pompili - From hero to_zero: The FatalNoise neverending story
Pastore - Commodore 65 - La storia
Pennisi - Essere Richard Altwasser
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Spectroscopy.pptx food analysis technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Programs and apps: productivity, graphics, security and other tools
Spectroscopy.pptx food analysis technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction

Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berlin 2018

  • 1. Adi Polak Spark UDFs are EviL, Catalyst to the rEsCue!
  • 2. • Adi Polak • Sr. Cloud Relation developer • Previous Security researcher • Majored in Machine Learning • Tel Avivian • BGU alumni • Co-founderr of FLIP • Spark & Scala enthusiast • Foodie Who am I @adipolak @adipolak
  • 5. • Apache Spark with Scala • Spark 2.3 • Catalyst optimization • Spark custom UDFs ..OK
  • 9. Fundamentals of Catalyst Optimizer SUB Attribute(x) SUB some_func(1) some_func(2) Tree Rules SUB Attribute(x) some_func(-1)
  • 10. Spark SQL Execution Plan Logical optimization –> Optimization rules • Constant folding • Predicate pushdown • Projection pruning • … Physical Planning –> Planning strategies Catalyst Frontend Backend
  • 11. What is Spark Custom UDF
  • 12. What is Spark Custom UDF "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
  • 13. What is Spark Custom UDF "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
  • 14. What is Spark Custom UDF "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
  • 15. What do we lose when using Custom UDF ? •Constant folding •Predicate pushdown
  • 16. What can we do ?
  • 17. Use queryExecution & explain(true) Catalyst Frontend Backend
  • 19. Use queryExecution & explain(true) API My UDF Register
  • 20. Use queryExecution & explain(true) API My UDFs Register
  • 21. Lost Push Down filter
  • 22. What can be done instead? sql functions DataFrame API: Aggregate functions Collection functions Date time functions Math functions Non-aggregate functions Sorting functions String functions Window functions sql functions Column API Expression operations..
  • 23. How can I find what functions are available? arrayContains, minute, round, rand, spark_partition_id, isin … version
  • 24. Can you show a complex example? Sure Meh…
  • 25. Using column functions ... GREAT SUCCESS !
  • 27. Takeaways • Use UDFs as a last resort • Always check yourself with dataFrame.explain(true)
  • 28. Reference • www.kaggle.com • http://bit.ly/adiuserguide • http://bit.ly/whatisdatabricks • http://bit.ly/databrickstutorial • http://bit.ly/clitools