MLOps pipelines using MLFlow - From training to production

MLOps pipelines using MLFlow - From training to production
Dr. Andreas Weiden, skillbyte
CAIML#24, April 13th, 2023

Problem Description
Team A, a (majority) Data Scientist team, creates many machine learning models
Team B, a (majority) Data Engineer team, needs to deploy these models into production and use them for a
recommender system
So two problems:
1. Technical
Need to deploy these models to multiple targets (→ this talk)
2. Organizational
Need to make two teams work together (→ not this talk)

MLOps
Got its name from DevOps and GitOps
Continuous training and deployment for machine learning systems
ml-ops.org

Pipelines
Created and maintained by Team A
Daily training runs, since fresh data is constantly coming in
Output various artifacts which are needed by the prediction services, run by Team B
Examples of what the Data Science Magic™ can be:
popularity of items
user embeddings from user-item interactions
item embeddings from item descriptions
Need somewhere to store the outputs of those
pipelines
And deploy them, too

Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Models
Registry
Basic unit is a Run
Whenever your pipeline runs, a new Run is created
Runs can be grouped under Experiments
You can add arbitrary data to a Run as well as the output
artifacts
import mlflow
import datetime, numpy as np, pickle, random
from tempfile import TemporaryDirectory
mlflow.set_experiment("Pipeline A")
run_name = f"Pipeline A {datetime.datetime.now().isoformat()}"
tags = {"version": "0.0.1"}
with mlflow.start_run(run_name=run_name, tags=tags) as run:
mlflow.log_param("ndims", 1024)
mlflow.log_metric("recall", random.random())
with TemporaryDirectory() as temp_dir:
with open(f"{temp_dir}/out.pickle", "wb") as f:
pickle.dump([np.random.rand(1024) for _ in range(100)], f)
mlflow.log_artifacts(temp_dir)

lifecycle
Open source: Github
Four pillars:
Tracking
metrics, artifacts
Projects
Models
Registry
Basic unit is a Run
Whenever your pipeline runs, a new Run is created
Runs can be grouped under Experiments
You can add arbitrary data to a Run as well as the output
artifacts

lifecycle
Open source: Github
Four pillars:
Tracking
metrics, artifacts
Projects
Package Data Science code including
dependencies
Git and containerization already does
this, if you lock your dependencies,
which you should
Models
Registry

lifecycle
Open source: Github
Four pillars:
Tracking
metrics, artifacts
Projects
dependencies
which you should
Models
Package ML models and deploy them
Containers and/or model artifacts
Registry
A standard format for packaging machine learning models
that can be used in a variety of downstream tools e.g.
real-time serving through a REST API
batch inference on Apache Spark
Saves model specific data and environment data:
# Directory written by mlflow.sklearn.save_model(model, "my_model")
my_model/
├── MLmodel
├── model.pkl
├── conda.yaml
├── python_env.yaml
└── requirements.txt

lifecycle
Open source: Github
Four pillars:
Tracking
metrics, artifacts
Projects
dependencies
which you should
Models
Registry
Model storage and lifecycle
(versioning, stage transitions)
Can associate runs with a Model:
import mlflow
with mlflow.start_run() as run:
...
mlflow.register_model(f"runs:/{run.info.run_id}", "Model A")

lifecycle
Open source: Github
Four pillars:
Tracking
metrics, artifacts
Projects
dependencies
which you should
Models
Registry
Model storage and lifecycle
(versioning, stage transitions)
Sounds interesting, but only very
rudimentary

Deployment options
ml-ops.org
Allow deploying arbitrary machine learning
artifacts
Full control of the API

Not so fast…
… managed vector stores are also a thing
They give you e.g.
pre-filtering
a full-blown query syntax
Quite a few options exist nowadays
Elasticsearch
Google Vertex AI
RediSearch
Milvus
…
→ Need a generic way to deploy machine learning artifacts to
multiple targets

Watcher
Simple microservice that periodically polls the MLFlow registry
Pushes the updated artifacts to all ML deployment options
Uploads embeddings to managed databases
Updates ConfigMap definitions with the correct Run ID per model and environment

Deployment targets
Elasticsearch Google Vertex AI Kubernetes
ConfigMap
POST /${ES_INDEX}/_doc/ HTTP/1.1
Host: ${ES_URL}
Content-Type: application/json
{
"id": "foo",
"vector": [1, 2, 3],
"popularity": 123,
"type": "bar"
}
GET /v1/${INDEX_URL}:upsertDatapoint HTTP/1.1
Host: ${VERTEX_ENDPOINT}
Content-Type: application/json
Authorization: Bearer `gcloud auth print-access-token`
{
"datapoints": [
{
"datapoint_id": "foo",
"feature_vector": [1, 2, 3],
"restricts": {
"namespace": "type",
"allow_list": ["bar"]
}
}
]
}
apiVersion: v1
kind: ConfigMap
metadata:
name: foo-configmap
data:
RUN_ID: "1234abcde"

Configmap reloader
https://github.com/stakater/Reloader
Ensures that all deployments that rely on a ConfigMap get restarted whenever
that ConfigMap changes
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
configmap.reloader.stakater.com/reload: "foo-configmap"
spec:
template:
spec:
containers:
- name: foo
image: foo:0.0.1
env:
- name: MLFLOW_RUN_ID
valueFrom:
configMapKeyRef:
name: foo-configmap
key: RUN_ID
apiVersion: v1
kind: ConfigMap
metadata:
name: foo-configmap
data:
RUN_ID: "1234abcde"

Alternatives
Possible alternatives that we considered:
Model deployment directly through MLFlow
Seldon Core
AWS SageMaker
(Got more? Let me know!)
However, they all have the same drawbacks:
No control over final images
Image size not optimised
Custom logging, metrics, tracing, … difficult
No control over API
Only deploy to REST-APIs, but we also want other targets

Summary
Embeddings and models are centrally produced → Need some central model storage
Each of the targets supports training and or deploying ML models individually, none of them support doing
so for all targets
→ If your needs are diverse enough, you may need to roll your own (ML deployment)
→ But use existing tools where applicable

MLOps pipelines using MLFlow - From training to production

More Related Content

Similar to MLOps pipelines using MLFlow - From training to production (20)

Recently uploaded (20)

MLOps pipelines using MLFlow - From training to production