How to deploy Jupyter notebooks as components of a Kubeflow ML pipeline (Part 2)
An easy way to run your Jupyter notebook on a Kubernetes cluster
In Part 1, I showed you how to create and deploy a Kubeflow ML pipeline using Docker components. In Part 2, I will show you how to make a Jupyter notebook a component of a Kubeflow ML pipeline. Where the Docker components are for the folks operationalizing machine learning models, being able to run a Jupyter notebook on arbitrary hardware is more suitable for data scientists.
I’ll assume that you already have a Kubeflow pipelines cluster up and running as explained in the previous article. Typically, the “ML Platform Team” (part of the IT department) would manage the cluster for use by a team of data scientists.
Step 1. Start JupyterHub
Note: When I wrote this article, you had to run a notebook on the cluster itself. Now, though, a much better way is to use AI Platform Notebooks and submit code to the cluster remotely. Follow the instructions in this README file, making sure to start a TensorFlow 1.10 Notebook VM. Skip this step, and start from Step 2.
From the Kubeflow pipelines user interface (http://localhost:8085/pipeline if you followed the instructions in the previous post), click on the link for Notebooks:
This will prompt you to start JupyterHub if this is the first time. Use your GCP username/password to login. Then, select the version of Tensorflow that you want:
I chose TensorFlow v1.10 on a cpu.
Step 2. Clone git repo
Then, open up a Terminal window and git clone my repo:
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
Switch back to the Jupyter notebooks listing and navigate to data-science-on-gcp/updates/cloudml and open up flights_model.ipynb.
Step 3. Predicting flight delays using TensorFlow
The actual TensorFlow code (See full notebook here: flights_model.ipynb) isn’t important, but I want you to notice a few things. One is that I developed this notebook mostly in Eager mode, for easy debugging:
if EAGER_MODE:
dataset = load_dataset(TRAIN_DATA_PATTERN)
for n, data in enumerate(dataset):
numpy_data = {k: v.numpy() for k, v in data.items()} # .numpy() works only in eager mode
print(numpy_data)
if n>3: break
Then, I trained the model for a few steps and specified more steps if “not in develop mode”:
num_steps = 10 if DEVELOP_MODE else (1000000 // train_batch_size)
Finally, I deployed it to Cloud ML Engine as a web service:
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version 1.10
and made sure that I could send JSON to the deployed model:
{"dep_delay": 14.0, "taxiout": 13.0, "distance": 319.0, "avg_dep_delay": 25.863039, "avg_arr_delay": 27.0, "carrier": "WN", "dep_lat": 32.84722, "dep_lon": -96.85167, "arr_lat": 31.9425, "arr_lon": -102.20194, "origin": "DAL", "dest": "MAF"}
{"dep_delay": -9.0, "taxiout": 21.0, "distance": 301.0, "avg_dep_delay": 41.050808, "avg_arr_delay": -7.0, "carrier": "EV", "dep_lat": 29.984444, "dep_lon": -95.34139, "arr_lat": 27.544167, "arr_lon": -99.46167, "origin": "IAH", "dest": "LRD"}
to get back, for each instance the probability that the flight will be late.
Step 4. Deploying the notebook as a component
So, I have a full-fledged notebook that does some ML workflow. Can I execute this as a component as part of a Kubeflow pipeline? Recall from Part 1 that all it takes for something to be a component is for it to be a self-contained container that takes a few parameters and writes outputs to files, either on the Kubeflow cluster or on Cloud Storage.
In order to deploy the flights_model notebook as a component:
- I have a cell at the top of my notebook whose tag is “parameters”. In this cell, I define any variables that I will want to re-execute the notebook with. In particular, I set up a variable called DEVELOP_MODE. In develop mode, I will read small datasets; in not-develop-mode, I’ll train on the full dataset. Because I want you to be able to change them easily, I also make the PROJECT (to be billed) and the BUCKET (to store outputs) as parameters.
- I then build a Docker image that is capable of executing my notebook. To execute a notebook, I will use the Python package papermill. My notebook uses Python3, gcloud and tensorflow. So my Dockerfile captures all those dependencies in the Dockerfile
FROM google/cloud-sdk:latestRUN apt-get update -y && apt-get install --no-install-recommends -y -q ca-certificates python3-dev python3-setuptools python3-pipRUN python3 -m pip install tensorflow==1.10 jupyter papermillCOPY run_notebook.sh ./ENTRYPOINT ["bash", "./run_notebook.sh"]
- The entry point to the Docker image is run_notebook.sh which uses papermill to execute the notebook:
gsutil cp $IN_NB_GCS input.ipynb
gsutil cp $PARAMS_GCS params.yaml
papermill input.ipynb output.ipynb -f params.yaml --log-output
gsutil cp output.ipynb $OUT_NB_GCS
- Essentially, the script copies the notebook to be run from Google Cloud Storage to the Kubeflow pod, runs the notebook with papermill and copies the resulting output back to Google Cloud Storage.
- But params.yaml? What’s params.yaml? Those are the configurable parameters to the notebook. For example, it could be:
---
BUCKET: cloud-training-demos-ml
PROJECT: cloud-training-demos
DEVELOP_MODE: False
- That’s it! When this Docker image is run, it will execute the supplied notebook and copy the output notebook (with plots plotted, models trained, etc.) to GCS.
Step 5. Launch the notebook component as part of a pipeline
The point of running the notebook as one step of the pipeline is so that it can be orchestrated and reused in other pipelines. But just to show you how it can be done, this is how you would create a pipeline that executes only this notebook:
import kfp.components as comp
import kfp.dsl as dsl# a single-op pipeline that runs the flights pipeline on the pod
@dsl.pipeline(
name='FlightsPipeline',
description='Trains, deploys flights model'
)
def flights_pipeline(
inputnb=dsl.PipelineParam('inputnb'),
outputnb=dsl.PipelineParam('outputnb'),
params=dsl.PipelineParam('params')
):
notebookop = dsl.ContainerOp(
name='flightsmodel',
image='gcr.io/cloud-training-demos/submitnotebook:latest',
arguments=[
inputnb,
outputnb,
params
]
)
Nothing fancy — I’m creating a container, telling it to use my image that has TensorFlow, papermill, etc. and giving it the input and output notebooks and params. As the pipeline runs, logs get streamed to the pipelines log, and will show up in Stackdriver:
In my GitHub repo, creating and deploying the pipeline is shown in launcher.ipynb.
Try it out
If you haven’t do so already, please read and walk through Part 1 of how to create and deploy a Kubeflow ML pipeline using Docker images.
Try out this article on how to deploy a Jupyter notebook as a component in a Kubeflow pipeline:
- Start a cluster as explained in Part 1
- On the cluster, open flights_model.ipynb, change the PROJECT and BUCKET to be something you own, and run the notebook, making sure it works.
- Open launcher.ipynb and walk through the steps of running flights_model.ipynb and as a Kubeflow pipelines component.
The launcher notebook also includes the ability to launch the flights_model notebook on the Deep Learning VM, but ignore it for now — I’ll cover that in Part 3 of this series.
The notebook can be a unit of composability and reusability — but for this to happen, you have to take care to write small, single-purpose notebooks. What I did in this article — a huge, monolithic notebook — is not a good idea. The tradeoff is that if you use smaller notebooks, dependency tracking becomes difficult.
When to use what
- Use the Deep Learning VM for development and automation if you are a small team and don’t have anyone maintaining ML infrastructure like Kubeflow clusters for you. I will cover this in Part 3 of this series.
- If you work in a large organization where a separate ML Platform team manages your ML infrastructure (i.e., a Kubeflow cluster), this article (Part 2) shows you how to develop in Jupyter notebooks and deploy to Kubeflow pipelines. (The IT team will probably help you with the Docker parts if you show them this article).
- While notebooks will help you be agile, you will also be building up a lot of technical debt. Monolithic notebooks make reusability hard and single-purpose notebooks make it hard to track dependencies. Secondly, even though your logs will go to GCP’s logging platform (Stackdriver), they are probably unstructured cell output. This makes it hard to monitor the pipeline and react to failures. Plan, therefore, on moving mature code and models out of notebooks into separate pipeline components, each of which is a container. This is what I showed you in Part 1.
In other words, use Deep Learning VM for small teams, Jupyter components for experimental work, and container-ops for mature models