MLlib with MLFlow.pdf

MLlib with MLFlow
Michelle Hoogenhout
July 17th 2021

What I’ll cover
Use MLlib with Mlﬂow end-to-end to:
● Prepare data in pyspark for use with MLLib
● Train and evaluate several classiﬁer models
● Log model performance with MLFlow Tracking

What you’ll need
● Pyspark / Docker
● MLﬂow
https://github.com/michellehoog/mllib-example

Why pyspark?
Enables scalable analysis (without having to know Scala!)
Allows distributed processing
Creates ML pipelines
Interacts with Pandas

What algorithms are available on
pyspark MLLib?
● Variety of classiﬁcation and regression models, incl.
○ Linear & Logistic Regression
○ Tree-based models
○ Multilayer Perceptron
○ Naive Bayes
● Clustering
● Collaborative ﬁltering
● Frequent pattern mining

Spark workflow
● DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which
can hold a variety of data types. E.g., a DataFrame could have different columns
storing text, feature vectors, true labels, and predictions.
● Transformer: A Transformer is an algorithm which can transform one DataFrame
into another DataFrame. E.g., an ML model is a Transformer which transforms
DataFrame with features into a DataFrame with predictions.
● Estimator: An Estimator is an algorithm which can be **fit** on a DataFrame to
produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
● Pipeline: A Pipeline chains multiple Transformers and Estimators together to
specify an ML workflow.
● Parameter:: All Transformers and Estimators now share a common API for
specifying parameters.

Things to note
Data format
● Dense format
● Numeric and zero-indexed (non-negative for Naive
Bayes)
● Named ‘label’ and ‘features’
Pipelines

MLFlow
pip install mlflow[extras]
https://www.mlﬂow.org/docs/latest/tutorials-and-examples/
tutorial.html
Open source tracking and deployment of ML models
Not speciﬁc to Spark / MLLib

MLFlow can log:
● Git commit hash
● Start & end time
● Source
● Parameters
● Metrics
● Artifacts (output)

MLﬂow tracking overview
Step 1. Create experiment
Step 2. Add runs to your code
Step 3. View logs

MLﬂow tracking overview
All MLflow runs are logged to the active experiment, which can be set using any of the
following ways:
● Use the mlflow.set_experiment() command.
● Use the experiment_id parameter in the mlflow.start_run() command.
● Set one of the MLflow environment variables MLFLOW_EXPERIMENT_NAME or
MLFLOW_EXPERIMENT_ID.
If no active experiment is set, runs are logged to the notebook experiment.

Viewing the Tracking MLflow UI
The tracking API writes data to local ./mlruns directory.
To view:
Run MLflow instance with mlflow ui
MLflow’s Tracking UI: http://localhost:5000/#/

MLlib with MLFlow.pdf

More Related Content

Similar to MLlib with MLFlow.pdf (20)

Recently uploaded (20)

MLlib with MLFlow.pdf