SlideShare a Scribd company logo
Best Practices for
Development on
Azure Databricks
Dustin Vannoy
Kimberly Mahoney
#PASSDataSummit
Guiding Best Practices Through the Full
Developer Lifecycle on Databricks
• Version Control
• Writing Code
• Testing
• Getting to production
Agenda
Dustin
Specialist Solutions Architect
in/dustinvannoy
youtube.com/DustinVannoy
dustinvannoy.com
Databricks
he/him
Vannoy
Kimberly
Solutions Architect
in/mahoneykimberly
Databricks
she/her
Mahoney
kimberly.mahoney@databricks.com
#PASSDataSummit
The Developer Lifecycle
#PASSDataSummit
The Developer Lifecycle
#PASSDataSummit
The Developer Lifecycle
#PASSDataSummit
The Developer Lifecycle
#PASSDataSummit
The Developer Lifecycle
Inner Dev Loop
● Develop
● Test
● Debug
Outer Dev Loop
● Test
● Integrate
● Deploy
#PASSDataSummit
Why do we need best practices
Deploying
1. Improve reliability
2. Enhance collaboration and team efficiency
3. More flexible to changes
4. Faster releases
#PASSDataSummit
What is Databricks?
First….
ETL &
Real-time Analytics
Orchestration Data
Warehousing
Data Science
& AI
Mosaic AI Delta Live Tables Workflows Databricks SQL
Unified security, governance, and cataloging
Unity Catalog
Databricks Data Intelligence Platform
Unified data storage for reliability and sharing
Delta Lake UniForm
An AI powered data intelligence engine to understand the semantics of your data
DatabricksIQ
Open Data Lake
All Raw Data
(Logs, Texts, Audio, Video, Images)
#PASSDataSummit
Version Control
#PASSDataSummit
Why:
• Track from the beginning
• Backup & Recovery
• Quicker and cleaner iteration
• Improved collaboration
Best practice: Use Version Control
#PASSDataSummit
Without Version Control
#PASSDataSummit
With Version Control
checkout
c1 c2 c3
peer review
checkout
c1
peer review
#PASSDataSummit
Benefits of Version Control
• Track incremental changes
• Avoid conflicting edits
• Maintain a complete project
history
• Easily revert to earlier versions
#PASSDataSummit
Databricks Git Folders
Integrate git repositories into workspace filesystem
Inspect, commit, push and pull
code changes. Create and work
with different branches.
New - Resolve merge conflicts.
Git Folders support for git lets
users collaborate on the same
code without interfering with
each others’ work.
Protected branches and pull
requests provide guardrails for
what code moves to production
AWS
CodeCommit
#PASSDataSummit
Demo
Writing Code
#PASSDataSummit
Development Options (Code Editors)
IDE Support
Databricks-native
● Databricks Files and Notebooks
● Multi-language
○ R, Python, SQL, Scala
● Collaborative editing
● AI-Assisted authoring
● Databricks VS Code Extension
● Pycharm plugin by JetBrains
#PASSDataSummit
Best practice: Write clean code
Clean Code - readable, modular, structured, simple
The why
• Improves readability and maintainability
• Simplifies Debugging and Testing
• Enhances Coding Assistant Suggestions
#PASSDataSummit
Best practice: Write clean code
Clean Code - readable, modular, structured, simple
The how
• Adopt a style guide
• Follow language conventions
• Start small
• Use meaningful names
• Break up long blocks of logic into smaller chunks
Clean and Readable Code: The Foundation
of Good Software
Poor naming
Hardcoded values
No comments
No reusability
Clean and Readable Code: The Foundation
of Good Software
Meaningful names
No hardcoded values
Comments & documentation
Reusable modules
#PASSDataSummit
Improving code quality
Syntax Highlighting
Automatically highlight potential issues in
real-time.
Databricks Assistant
Provides an interactive assistant to guide
you through fixing identified errors
Modularize Code
Import R and Python modules stored in
workspace files alongside your Databricks
notebooks. Protect against out of order variable references in notebooks
#PASSDataSummit
And more..
• Debugger
• Code Formatter
• Autocomplete
• Variable Inspection
• Go to definition
Seamless
Workflow
Integrates error detection,
correction, and testing into
a single process.
Testing
#PASSDataSummit
5 reasons for automated tests
1. Reduce repetition
2. Ensure quality early
3. Avoid fear of code changes
4. Clarify purpose of functions
5. Speed up development
The test pyramid
- Test full pipeline (source to
target)
- Test one step in pipeline
- Single function
#PASSDataSummit
Difference in test types
Unit Testing Integration Testing End-to-end
Code / Resource Python functions Notebook / Python
script
Databricks Workflow
+ external services
Framework pytest / unittest Workflow +
validation code
Workflow +
run job task +
validation code
Test trigger Developer CI pipeline or
Developer
Release pipeline
Demo
Getting to Production
#PASSDataSummit
Best practice: Automate and Orchestrate
Workflows
• Connect dependent tasks
• Automate Error Handling and Retries
• Monitor performance
• Trigger with time and event-based triggers
#PASSDataSummit
Unified orchestration for your data projects
• Orchestrate many different task types
• Flexible triggers and scheduling
• Built in alerting and monitoring
• Support for parameters
Enter Databricks Workflows
#PASSDataSummit
Example Workflow
#PASSDataSummit
Workflow Task Types
Many task types to choose from,
all focused around running on
Databricks.
If you can do it on Databricks,
you can do it with a workflow.
Is that it?
#PASSDataSummit
What about these other best practices..
• Version Workflow Configurations
• Track configurations to maintain consistency.
• Automate Workflow Deployments
• Streamline updates and reduce manual effort.
• Run Tests in the Outer Dev Loop
• Validate workflows in production-like conditions.
• Separate Environments for Deployments
• Use dev, staging, and prod to ensure stability.
#PASSDataSummit
MAPPING TO DATABRICKS
● Databricks Notebooks
● Libraries (e.g. Python Wheels)
● SQL files
● JARs
● …
Code
“I swear this cell was working
yesterday… Did someone change it?”
#PASSDataSummit
MAPPING TO DATABRICKS
● Databricks Workflows
● Delta Live Table Pipelines
● Model Serving Endpoints
● Lakehouse Monitoring
● …
Resource configuration
“The refresh job failed this morning…
Was it updated to use the most recent
set of expected parameters?”
#PASSDataSummit
MAPPING TO DATABRICKS
● Databricks Workspaces
● Unity Catalog (e.g. catalogs)
● Execution identity
○ User or service principal
Environment separation
“We need a larger cluster in prod, who
has access to update it?”
#PASSDataSummit
MAPPING TO DATABRICKS
● Databricks Workspaces
● Unity Catalog (e.g. catalogs)
● Execution identity
○ User or service principal
Environment separation
“We need a larger cluster in prod, who
has access to update it?”
<catalog>.<schema>.<table
>
dev
prod
FROM DEVELOPMENT TO PRODUCTION
Development
Dev Catalog
Staging
Staging Catalog
Production
Prod Catalog
Version control
deploy deploy deploy
main
my-feature release
pull
request
release
#PASSDataSummit
PROPERTIES WE’RE LOOKING FOR
● Single source of truth for code and resource configuration
● Define resources using an easy to understand format (e.g., YAML)
● Specialize resources based for target environment
(dev/staging/prod)
● Ensure deployment isolation
● Easily automate deployment (CI/CD)
Enter Databricks Asset Bundles
#PASSDataSummit
Databricks Asset Bundles
What?
YAML files that specify the artifacts, resources, and
configurations of a Databricks project - databricks.yml
#PASSDataSummit
Databricks Asset Bundles
How?
The bundle subcommand of the Databricks CLI lets you
interact with this definition; to validate it, to deploy the
bundle, or to run resources defined in the bundle.
#PASSDataSummit
Databricks Asset Bundles
Where?
These commands are executed client-side, so really
anywhere. You can do development on your laptop
and/or run automated production deployments from
CI/CD systems such as GitHub Actions or Azure DevOps.
Demo
#PASSDataSummit
Summary
• Use version control + peer review
• Write clean, readable code
• Test Your Code Thoroughly with Automated
Tests
• Automate Workflows & Deployments
#PASSDataSummit
Summary
• Databricks Git Folders
• Rich native editor and support for IDE’s
• Build out testing and deployment pipelines with
Databricks Asset Bundles
#PASSDataSummit
Try it out yourself
#PASSDataSummit
Evaluate this session at:
www.PASSDataCommunitySummit.com/evaluation
Your feedback is
important to us
#PASSDataSummit
Thank you
● youtube.com/DustinVannoy
Develop and Deploy Code Easily With IDEs
● How to Get the Most Out of Databricks Notebooks
● Databricks Asset Bundles: A Unifying Tool for Deployment on
Databricks
● Best Practices for Unit Testing PySpark
To learn more…
Resources + Repo at dustinvannoy.com

More Related Content

PPTX
Azure DevOps Best Practices Webinar
PPTX
Testing Big Data solutions fast and furiously
PDF
The "Holy Grail" of Dev/Ops
PDF
Building A Product Assortment Recommendation Engine
PPTX
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
PPTX
Continues Integration and Continuous Delivery with Azure DevOps - Deploy Anyt...
PPTX
Industry Keynote at Large Scale Testing Workshop 2015
PPTX
Devops
Azure DevOps Best Practices Webinar
Testing Big Data solutions fast and furiously
The "Holy Grail" of Dev/Ops
Building A Product Assortment Recommendation Engine
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
Continues Integration and Continuous Delivery with Azure DevOps - Deploy Anyt...
Industry Keynote at Large Scale Testing Workshop 2015
Devops

Similar to PASS 2024 - Best Practices for Development on Azure Databricks (20)

PPTX
Udvid din test portefølje med coded ui test og cloud load test
PDF
Developers Testing - Girl Code at bloomon
PPTX
Acceptance Test Driven Development
PDF
Automate across Platform, OS, Technologies with TaaS
PPTX
Continuous Delivery with Jenkins declarative pipeline XPDays-2018-12-08
PDF
Revolutionize DevOps with ML capabilities. Introduction to Amazon CodeGuru an...
PDF
Integration testing in enterprises using TaaS
PPTX
Bringing DevOps to the Database
PDF
DevOps and Decoys How to Build a Successful Microsoft DevOps Including the Data
PPTX
DevOps for Machine Learning overview en-us
PDF
Productionalizing Models through CI/CD Design with MLflow
PPTX
Software Performance Benchmarking using BenchmarkDotNet Webinar
PPTX
Technical Without Code
PDF
Automate across Platform, OS, Technologies with TaaS
PDF
Agile2013 - Integration testing in enterprises using TaaS - via Case Study
PPTX
Deploy multi-environment application with Azure DevOps
PPTX
Machine Learning and AI
PPTX
Devops architecture
PDF
Visual Studio ALM and DevOps Tools Walkthrough
KEY
Beyond TDD: Enabling Your Team to Continuously Deliver Software
Udvid din test portefølje med coded ui test og cloud load test
Developers Testing - Girl Code at bloomon
Acceptance Test Driven Development
Automate across Platform, OS, Technologies with TaaS
Continuous Delivery with Jenkins declarative pipeline XPDays-2018-12-08
Revolutionize DevOps with ML capabilities. Introduction to Amazon CodeGuru an...
Integration testing in enterprises using TaaS
Bringing DevOps to the Database
DevOps and Decoys How to Build a Successful Microsoft DevOps Including the Data
DevOps for Machine Learning overview en-us
Productionalizing Models through CI/CD Design with MLflow
Software Performance Benchmarking using BenchmarkDotNet Webinar
Technical Without Code
Automate across Platform, OS, Technologies with TaaS
Agile2013 - Integration testing in enterprises using TaaS - via Case Study
Deploy multi-environment application with Azure DevOps
Machine Learning and AI
Devops architecture
Visual Studio ALM and DevOps Tools Walkthrough
Beyond TDD: Enabling Your Team to Continuously Deliver Software
Ad

More from Dustin Vannoy (6)

PDF
Azure Data Platform Overview.pdf
PDF
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
PPTX
Getting Started with Spark Structured Streaming - Current 22
PPTX
Spark Streaming with Azure Databricks
PPTX
Delta Lake with Azure Databricks
PPTX
PASS_Summit_2019_Azure_Storage_Options_for_Analytics
Azure Data Platform Overview.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
Getting Started with Spark Structured Streaming - Current 22
Spark Streaming with Azure Databricks
Delta Lake with Azure Databricks
PASS_Summit_2019_Azure_Storage_Options_for_Analytics
Ad

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
IBA_Chapter_11_Slides_Final_Accessible.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
1_Introduction to advance data techniques.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Supervised vs unsupervised machine learning algorithms
ISS -ESG Data flows What is ESG and HowHow
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

PASS 2024 - Best Practices for Development on Azure Databricks