SlideShare a Scribd company logo
COST EFFICIENT
ALTERNATIVE TO
DATABRICKS
Georg Heiler
Exploring Alternatives for Cost-Effective and Flexible Data Pipelines  bit.ly/efficient-spark
Data expert
Academia & Industry (telco)
Specialties
data architecture, multimodal and
complex data challenges
Thought leader
Meetup organizer & speaker
• Rising importance of
understanding and shaping
supply chains (covid, Ukraine war)
• No fine-grained clean data
accessible
• Abundant un- and semistructured
data  sophisticated cleaning &
parsing required
• Extract and classify links based on
semantic context
Results at a
glance
• 43% Cost Reduction
• Software Engineering
practices
• Future proof flexibility
• Single pane of glass for
pipelines
History
• Mainframe
• Data warehouse
• Big Data (Hadoop)
• SQL on large data (Hive, Spark)
• Cloud DWH (Snowflake,
bigquery)
PaaS offering
PaaS Solution Comparison
Databricks (DBR)
• Easy to use
• Can be expensive
• Lock-in features
(permissions, catalog)
• Proprietary Photon
engine
AWS Elastic Map Reduce
(EMR)
• Price efficient
• Many tuning knobs
available (& required)
• OSS Spark managed
(scaled)
Challenges
• Runaway expenses (usage-based pricing)
• Missing software engineering best practices (notebooks)
• Developer productivity reduced
• Vendor lock-in
Vision
• 0-cost switch
• Software
engineering
practices
• Cost & lock-in
reduction
Orchestrator
(dagster)
Runtime
local
Runtime
remote DBR
Runtime
remote EMR
Spark at a glance
Dagster introduction
X No distributed monolith of CRON strings
 Asset aware event based orchestration
Observed challenges
• Remote execution
• Parameter injection
• Logging
• Opaque SaaS tools
• Single pane of glass
• Dependency bootstrap
• Missing testability in
notebooks
• Large-scale compute &
orchestrator native
development
Orchestrator
(dagster)
Runtime
local
Runtime
remote DBR
Runtime
remote EMR
Dagster-pipes
Dagster-pipes - Architecture
Dagster-pipes - Sample
External code (with metadata) Internal asset shim orchestrating the execution
of external script
Results & Demo
Demo: youtube.com/watch?v=W27C5LpdEkE
Partitioned UI
Implementati
on time of
DBR is lower
Implementation
complexity of
DBR is lower
more & more
frequent
commits for
EMR integration
Median
cost of DBR
is higher
than EMR
Variability of
execution
time of DBR
is lower
Implementation lessons
• Complexity of AWS EMR: Many low level details about AWS,
spot instances, networking required (master on spot instance
=> 💥💥)
• Abstracting the PaaS requires deep understanding of their APIs
Tips
• maximizeResourceAllocation
• LZO
• Delta zorder on partition
• spark.databricks.delta.vacuum.parallelDelete.enabled=true
Summary
• Money saved – 43%
• Bring back software engineering
best practices for data
• Flexibility
• Data PaaS as a commodity
• Take back control
• Best in breed
• Single pane of glass for pipelines
Takeaway – if
you have a
small data
problem
• Pipes allows to quickly bring in existing
scripts whilst retaining observability
• High code engineering practices scales
well
• Full control
• Compute technology can easily be
changed (i.e. duckdb, daft, …)
data-engineering.expert/2023/12/11/da
gster-dbt-duckdb-as-new-local-mds
COST EFFICIENCY
FOR DATA
Georg Heiler
bit.ly/efficient-spark
(data-engineering.expert/2024/06/21/cost-efficient-alternative-to-databricks-lock-in
arxiv.org/abs/2408.11635 github.com/ascii-supply-networks/ascii-hydra/tree/main/src/pipelines/ascii_library_demo )

More Related Content

PDF
Simple, Modular and Extensible Big Data Platform Concept
PDF
Developing Enterprise Consciousness: Building Modern Open Data Platforms
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Architecting Agile Data Applications for Scale
PDF
Productionalizing a spark application
PDF
How Service Mesh Fits into the Modern Data Stack
PPTX
Architecting an Open Source AI Platform 2018 edition
PPTX
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Simple, Modular and Extensible Big Data Platform Concept
Developing Enterprise Consciousness: Building Modern Open Data Platforms
From Pipelines to Refineries: Scaling Big Data Applications
Architecting Agile Data Applications for Scale
Productionalizing a spark application
How Service Mesh Fits into the Modern Data Stack
Architecting an Open Source AI Platform 2018 edition
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...

Similar to [DSC DACH 24] Cost efficient alternative to databricks lock-in - Georg Heiler (20)

PPTX
Big Data_Architecture.pptx
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Data Infrastructure for a World of Music
PPTX
Big Data/Hadoop Option Analysis
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
PDF
Agile data lake? An oxymoron?
PDF
Drill architecture 20120913
PPTX
The Evolution of Data Architecture
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
PDF
Introduction to Spark Training
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Emerging trends in data analytics
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PDF
C19013010 the tutorial to build shared ai services session 2
PPTX
Intro to Spark development
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Big Data_Architecture.pptx
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Data Infrastructure for a World of Music
Big Data/Hadoop Option Analysis
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Agile data lake? An oxymoron?
Drill architecture 20120913
The Evolution of Data Architecture
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Introduction to Spark Training
Sa introduction to big data pipelining with cassandra & spark west mins...
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with python apache arrow, spark,...
Emerging trends in data analytics
From Pipelines to Refineries: scaling big data applications with Tim Hunter
C19013010 the tutorial to build shared ai services session 2
Intro to Spark development
AWS Big Data Demystified #1: Big data architecture lessons learned
Ad

More from DataScienceConferenc1 (20)

PPTX
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
PPTX
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
PPTX
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
PPTX
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
PPTX
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
PPTX
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
PPTX
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
PDF
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
PDF
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
PPTX
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
PPTX
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
PPTX
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
PDF
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
PPTX
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
PPTX
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
PDF
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
PPTX
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
PPTX
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
PPTX
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
PPTX
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Computer network topology notes for revision
PDF
Foundation of Data Science unit number two notes
PDF
Clinical guidelines as a resource for EBP(1).pdf
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Computer network topology notes for revision
Foundation of Data Science unit number two notes
Clinical guidelines as a resource for EBP(1).pdf

[DSC DACH 24] Cost efficient alternative to databricks lock-in - Georg Heiler

  • 1. COST EFFICIENT ALTERNATIVE TO DATABRICKS Georg Heiler Exploring Alternatives for Cost-Effective and Flexible Data Pipelines  bit.ly/efficient-spark
  • 2. Data expert Academia & Industry (telco) Specialties data architecture, multimodal and complex data challenges Thought leader Meetup organizer & speaker
  • 3. • Rising importance of understanding and shaping supply chains (covid, Ukraine war) • No fine-grained clean data accessible • Abundant un- and semistructured data  sophisticated cleaning & parsing required • Extract and classify links based on semantic context
  • 4. Results at a glance • 43% Cost Reduction • Software Engineering practices • Future proof flexibility • Single pane of glass for pipelines
  • 5. History • Mainframe • Data warehouse • Big Data (Hadoop) • SQL on large data (Hive, Spark) • Cloud DWH (Snowflake, bigquery)
  • 7. PaaS Solution Comparison Databricks (DBR) • Easy to use • Can be expensive • Lock-in features (permissions, catalog) • Proprietary Photon engine AWS Elastic Map Reduce (EMR) • Price efficient • Many tuning knobs available (& required) • OSS Spark managed (scaled)
  • 8. Challenges • Runaway expenses (usage-based pricing) • Missing software engineering best practices (notebooks) • Developer productivity reduced • Vendor lock-in
  • 9. Vision • 0-cost switch • Software engineering practices • Cost & lock-in reduction Orchestrator (dagster) Runtime local Runtime remote DBR Runtime remote EMR
  • 10. Spark at a glance
  • 11. Dagster introduction X No distributed monolith of CRON strings  Asset aware event based orchestration
  • 12. Observed challenges • Remote execution • Parameter injection • Logging • Opaque SaaS tools • Single pane of glass • Dependency bootstrap • Missing testability in notebooks • Large-scale compute & orchestrator native development Orchestrator (dagster) Runtime local Runtime remote DBR Runtime remote EMR
  • 15. Dagster-pipes - Sample External code (with metadata) Internal asset shim orchestrating the execution of external script
  • 20. Implementation complexity of DBR is lower more & more frequent commits for EMR integration
  • 21. Median cost of DBR is higher than EMR
  • 23. Implementation lessons • Complexity of AWS EMR: Many low level details about AWS, spot instances, networking required (master on spot instance => 💥💥) • Abstracting the PaaS requires deep understanding of their APIs Tips • maximizeResourceAllocation • LZO • Delta zorder on partition • spark.databricks.delta.vacuum.parallelDelete.enabled=true
  • 24. Summary • Money saved – 43% • Bring back software engineering best practices for data • Flexibility • Data PaaS as a commodity • Take back control • Best in breed • Single pane of glass for pipelines
  • 25. Takeaway – if you have a small data problem • Pipes allows to quickly bring in existing scripts whilst retaining observability • High code engineering practices scales well • Full control • Compute technology can easily be changed (i.e. duckdb, daft, …) data-engineering.expert/2023/12/11/da gster-dbt-duckdb-as-new-local-mds
  • 26. COST EFFICIENCY FOR DATA Georg Heiler bit.ly/efficient-spark (data-engineering.expert/2024/06/21/cost-efficient-alternative-to-databricks-lock-in arxiv.org/abs/2408.11635 github.com/ascii-supply-networks/ascii-hydra/tree/main/src/pipelines/ascii_library_demo )

Editor's Notes

  • #1: Mention talk modality interactive ask questions during the talk
  • #2: Supply Chain, Text analytics & data architecture & pipelines, graphs, spatial time series
  • #4: Physical goods Software 400TiB Commoncrawl, AIS, Satellite, OSINT, …),
  • #5: Easy GPU/accelerator prototyping AWS is migrating to daft https://www.getdaft.io/ https://aws.amazon.com/de/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/ as we are in control of how the individual steps in the pipeline relate to each other we can relatively easily switch out the compute framework (like AWS did it)
  • #9: PaaS solutions offer big benefits. Easy scalability Single centerpiece of the data engineering strategy not just an implementation detail Runaway expenses due to usage-based pricing High CI costs for spinning up resources Prioritizes simplicity over best practices All-notebook environments: Limited code reuse Limited testability Limited VCS integration Developer productivity hampered by VM spin-up times Single central platform dependence
  • #10: Pass as implementation detail Containerization, CI/CD, testability flexibility
  • #11: Img https://intellipaat.com/blog/tutorial/spark-tutorial/spark-architecture/
  • #13: Single pane of glass for operative monitoring of opaque saas tools. Allows us to get standard software best practices like testing, modularity, DRY, ... and maintain these in the SaaS tools. It even allows us to abstract SaaS vendors (Databricks vs. EMR) and substitute one against the other one to save money for large-scale pipelines where we just need compute without wanting to pay for all the enterprisey extra features. 1) boostrap of the remote execution environments 2) centralized logging 3) single/simple start/stop of all pipeline in dagster 3) integration of upstream/downstream pipeline steps in one place PaaS solutions offer big benefits. Easy scalability Single centerpiece of the data engineering strategy not just an implementation detail Runaway expenses due to usage-based pricing High CI costs for spinning up resources Prioritizes simplicity over best practices All-notebook environments: Limited code reuse Limited testability Limited VCS integration Developer productivity hampered by VM spin-up times Single central platform dependence
  • #20: The volume of trial runs required to achieve stability on EMR is high. It shows the complex setup and optimization demands of these platforms. Yet, once set up now, it proves hugely beneficial for us.
  • #21: EMR was labor-intensive. This is shown by more failed and successful trials. They happened before the product was ready. In fact we required almost twice as many trial runs for EMR as for Databricks. The increase was mainly due to the complexity of setting up EMR systems. They had to handle large datasets well and safely. This setup needs lots of customization and tuning. Databricks provided these features out of the box. EMR demanded more frequent code changes. They were sometimes extensive. This reflected a steeper learning curve and higher complexity. But, it was in exchange for lower costs.
  • #24: Teams need to learn the specifics of each platform. This includes API differences, setup quirks, and best practices. This learning curve can delay initial deployment and require additional training.
  • #25: For small-ish to medium workloads EMR works fine out of the box => we can save money (cheaper compute) For very large workloads, we use EMR with fine-tuning. It lets us save money with cheaper compute. For special workloads, Photon is very profitable. We use Databricks to save money and time again.
  • #26: https://aws.amazon.com/de/blogs/opensource/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2/
  • #27: Mention talk modality interactive ask questions during the talk