SlideShare a Scribd company logo
Andre Mesarovic
Sr. Specialist Solutions Architect
23 November 2022
Object Relationships
Databricks
● Databricks MLflow objects (runs, experiments, registered models and their
versions, notebooks) form a complex web of relationships.
● Objects live in different places: workspace objects, DBFS (cloud) and MySQL.
○ A run’s metadata lives in MySQL, its artifacts in cloud and its notebook in the workspace and/or git.
● Experiments have zero or more runs.
● Registered models have 0 or more versions that point to a run’s MLflow model.
● Code that generated a run’s MLflow model:
○ MLflow runs have pointers to a notebook revision that generated the model.
○ Runs will/should have pointers to the git version of a notebook that generated the model.
Overview
● Model is an overloaded term with three meanings:
○ Native model artifact - this is the lowest level and is simply the native flavor’s serialized format. For
sklearn it’s a pickle file, for Keras it’s a directory with TensorFlow’s native SaveModel format files.
○ MLflow model - a wrapper around the native model artifact with metadata in the MLmodel file and
environment information in conda.yaml and requirements.txt files.
○ Registered model - a bucket of model versions. A model version contains one MLflow model that is
cached in the model repository. A version has the following links (expressed as tags):
■ run_id - points to the run that generated the version’s model.
■ source - points to the path of MLflow model in the run that corresponds to the version’s model.
■ workspace_uri - currently missing. Needed if using shared model registry. ML-19472.
Model terminology
Model relationships
● Runs
○ Contains one or more MLflow models
● Experiments
○ Notebook experiments
○ Workspace experiments
● Registered models
○ A registered model contains versions
○ A version points to one run’s MLflow model
○ Native model artifacts - the actual bits that execute predictions that are part of the MLflow model
● Notebooks
Databricks MLflow object relationships
Databricks MLflow objects relationships
● Diagram uses the UML modeling language.
○ *: indicates a many relationship
○ 1: indicates a required one relationship.
○ 0..1: indicates an optional one relationship.
● This is a logical diagram. Not all nuances are captured for simplification.
● The diagram represents a notebook experiment.
● A workspace experiment is not represented in the diagram.
Diagram legend
● A registered model is a bucket for model versions.
● A version has one MLflow model which is linked to the run that generated it.
● The production and staging stage have one "latest" version.
● Registered model versions are cached in the model registry.
● This is a clone of the run's MLflow model that the version points to.
● If source run is in a different workspace we have a lineage reachability problem.
See ML-19472 - Add workspace URI field in ModelVersion for a registered
model to make run reachable.
Registered models
● An experiment has zero or more runs.
● Two types of experiments:
○ Notebook experiment
■ Relationship of experiment to notebook is one-to-one.
■ Workspace path of the experiment is the same as its notebook.
○ Workspace experiment
■ Relationship of experiment to notebook is one-to-many.
■ Explicitly specify the experiment path with set_experiment method.
■ Different notebooks can create runs in the same experiment.
Experiments
● A run belongs to only one experiment.
● A run is linked to one notebook revision. MLflow notebook tags:
○ mlflow.databricks.notebookRevisionID
○ mlflow.databricks.notebookID
○ mlflow.databricks.notebookPath
● Optionally a run’s notebook can be linked to a git reference.
○ See discussion on Notebook below for details.
● A run can have one or more MLflow models (flavors) such as Sklearn and ONNX.
● Every run has a default Pyfunc flavor which is wrapper around the native model.
Runs
MLflow Run Details
● An MLflow Run has three basic components
○ Metadata (params, metrics, tags) residing in a MySQL database.
○ MLflow model artifact which lives in DBFS (cloud). Note you can also have arbitrary customer
artifacts.
○ Link to code:
■ For Databricks, the run points to either:
● Workspace notebook revision
● Repos notebook a pointer to git.
■ For open source the link points to git.
MLflow Run Details Legend
● A notebook has many revisions.
● Optionally, a notebook revision can be checked into git with Databricks Repos.
● Need to capture git reference analogous to the MLflow open source tags:
○ mlflow.source.git.commit
○ mlflow.source.git.repoURL
○ mlflow.gitRepoURL
● See ML-19473 - Add git reference tags to Databricks run if its notebook is synced with
Repos
● Two sources of truth for a notebook snapshot that can be confusing:
○ Databricks notebook revision
○ Git version
Notebooks
journey!
Happy

More Related Content

PPTX
Pythonsevilla2019 - Introduction to MLFlow
PDF
MLOps Using MLflow
PDF
Simplifying Model Management with MLflow
PDF
MLFlow: Platform for Complete Machine Learning Lifecycle
PPTX
From Data Science to MLOps
PDF
MLflow: A Platform for Production Machine Learning
PDF
MLOps for production-level machine learning
PDF
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Pythonsevilla2019 - Introduction to MLFlow
MLOps Using MLflow
Simplifying Model Management with MLflow
MLFlow: Platform for Complete Machine Learning Lifecycle
From Data Science to MLOps
MLflow: A Platform for Production Machine Learning
MLOps for production-level machine learning
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...

What's hot (20)

PDF
MLflow with Databricks
PDF
Introduction to MLflow
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
PDF
Production machine learning: Managing models, workflows and risk at scale
PDF
Automatic Machine Learning, AutoML
PDF
Databricks Overview for MLOps
PDF
Productionzing ML Model Using MLflow Model Serving
PDF
Managing the Complete Machine Learning Lifecycle with MLflow
PPTX
MLOps.pptx
PDF
Machine Learning Model Deployment: Strategy to Implementation
PDF
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
PDF
Ml ops on AWS
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PDF
Mlflow with databricks
PDF
The A-Z of Data: Introduction to MLOps
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PPTX
Google Vertex AI
PDF
Ml ops intro session
PDF
MLOps with Kubeflow
MLflow with Databricks
Introduction to MLflow
"Managing the Complete Machine Learning Lifecycle with MLflow"
Production machine learning: Managing models, workflows and risk at scale
Automatic Machine Learning, AutoML
Databricks Overview for MLOps
Productionzing ML Model Using MLflow Model Serving
Managing the Complete Machine Learning Lifecycle with MLflow
MLOps.pptx
Machine Learning Model Deployment: Strategy to Implementation
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Using MLOps to Bring ML to Production/The Promise of MLOps
Ml ops on AWS
MLOps Bridging the gap between Data Scientists and Ops.
Mlflow with databricks
The A-Z of Data: Introduction to MLOps
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Google Vertex AI
Ml ops intro session
MLOps with Kubeflow
Ad

Similar to Databricks MLflow Object Relationships (20)

PPTX
MLflow_MLOps_Databricks_Architecture.pptx
PDF
MLlib with MLFlow.pdf
PPTX
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
PPTX
Data Engineer's Lunch #54: dbt and Spark
PDF
Ballerina Tutorial @ SummerSOC 2019
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PDF
Applied Machine learning for business analytics
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Wattpad - Spark Stories
PDF
Single Responsibility Principle
PDF
Design and Implementation of the Security Graph Language
ODP
Introduction to MongoDB
PDF
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
PDF
MLFlow 1.0 Meetup
PPTX
Mongo db
PDF
Stefan Richter - Writing simple, readable and robust code: Examples in Java, ...
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PDF
Collaborative modeling with sirius
PDF
Productionalizing Spark ML
PDF
New c sharp3_features_(linq)_part_iv
MLflow_MLOps_Databricks_Architecture.pptx
MLlib with MLFlow.pdf
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
Data Engineer's Lunch #54: dbt and Spark
Ballerina Tutorial @ SummerSOC 2019
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Applied Machine learning for business analytics
Apache Iceberg - A Table Format for Hige Analytic Datasets
Wattpad - Spark Stories
Single Responsibility Principle
Design and Implementation of the Security Graph Language
Introduction to MongoDB
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
MLFlow 1.0 Meetup
Mongo db
Stefan Richter - Writing simple, readable and robust code: Examples in Java, ...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Collaborative modeling with sirius
Productionalizing Spark ML
New c sharp3_features_(linq)_part_iv
Ad

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx

Databricks MLflow Object Relationships

  • 1. Andre Mesarovic Sr. Specialist Solutions Architect 23 November 2022 Object Relationships Databricks
  • 2. ● Databricks MLflow objects (runs, experiments, registered models and their versions, notebooks) form a complex web of relationships. ● Objects live in different places: workspace objects, DBFS (cloud) and MySQL. ○ A run’s metadata lives in MySQL, its artifacts in cloud and its notebook in the workspace and/or git. ● Experiments have zero or more runs. ● Registered models have 0 or more versions that point to a run’s MLflow model. ● Code that generated a run’s MLflow model: ○ MLflow runs have pointers to a notebook revision that generated the model. ○ Runs will/should have pointers to the git version of a notebook that generated the model. Overview
  • 3. ● Model is an overloaded term with three meanings: ○ Native model artifact - this is the lowest level and is simply the native flavor’s serialized format. For sklearn it’s a pickle file, for Keras it’s a directory with TensorFlow’s native SaveModel format files. ○ MLflow model - a wrapper around the native model artifact with metadata in the MLmodel file and environment information in conda.yaml and requirements.txt files. ○ Registered model - a bucket of model versions. A model version contains one MLflow model that is cached in the model repository. A version has the following links (expressed as tags): ■ run_id - points to the run that generated the version’s model. ■ source - points to the path of MLflow model in the run that corresponds to the version’s model. ■ workspace_uri - currently missing. Needed if using shared model registry. ML-19472. Model terminology
  • 5. ● Runs ○ Contains one or more MLflow models ● Experiments ○ Notebook experiments ○ Workspace experiments ● Registered models ○ A registered model contains versions ○ A version points to one run’s MLflow model ○ Native model artifacts - the actual bits that execute predictions that are part of the MLflow model ● Notebooks Databricks MLflow object relationships
  • 7. ● Diagram uses the UML modeling language. ○ *: indicates a many relationship ○ 1: indicates a required one relationship. ○ 0..1: indicates an optional one relationship. ● This is a logical diagram. Not all nuances are captured for simplification. ● The diagram represents a notebook experiment. ● A workspace experiment is not represented in the diagram. Diagram legend
  • 8. ● A registered model is a bucket for model versions. ● A version has one MLflow model which is linked to the run that generated it. ● The production and staging stage have one "latest" version. ● Registered model versions are cached in the model registry. ● This is a clone of the run's MLflow model that the version points to. ● If source run is in a different workspace we have a lineage reachability problem. See ML-19472 - Add workspace URI field in ModelVersion for a registered model to make run reachable. Registered models
  • 9. ● An experiment has zero or more runs. ● Two types of experiments: ○ Notebook experiment ■ Relationship of experiment to notebook is one-to-one. ■ Workspace path of the experiment is the same as its notebook. ○ Workspace experiment ■ Relationship of experiment to notebook is one-to-many. ■ Explicitly specify the experiment path with set_experiment method. ■ Different notebooks can create runs in the same experiment. Experiments
  • 10. ● A run belongs to only one experiment. ● A run is linked to one notebook revision. MLflow notebook tags: ○ mlflow.databricks.notebookRevisionID ○ mlflow.databricks.notebookID ○ mlflow.databricks.notebookPath ● Optionally a run’s notebook can be linked to a git reference. ○ See discussion on Notebook below for details. ● A run can have one or more MLflow models (flavors) such as Sklearn and ONNX. ● Every run has a default Pyfunc flavor which is wrapper around the native model. Runs
  • 12. ● An MLflow Run has three basic components ○ Metadata (params, metrics, tags) residing in a MySQL database. ○ MLflow model artifact which lives in DBFS (cloud). Note you can also have arbitrary customer artifacts. ○ Link to code: ■ For Databricks, the run points to either: ● Workspace notebook revision ● Repos notebook a pointer to git. ■ For open source the link points to git. MLflow Run Details Legend
  • 13. ● A notebook has many revisions. ● Optionally, a notebook revision can be checked into git with Databricks Repos. ● Need to capture git reference analogous to the MLflow open source tags: ○ mlflow.source.git.commit ○ mlflow.source.git.repoURL ○ mlflow.gitRepoURL ● See ML-19473 - Add git reference tags to Databricks run if its notebook is synced with Repos ● Two sources of truth for a notebook snapshot that can be confusing: ○ Databricks notebook revision ○ Git version Notebooks

Editor's Notes

  • #3: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #4: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #5: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #6: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #7: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #8: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #9: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #10: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #11: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #12: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #13: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?
  • #14: What did they do with us? what are they trying to do? recommendation? content curation? how does that work? How come Delta and Spark and those things can help with that thing (recommendation, or whatever they do)?