SlideShare a Scribd company logo
MLlib with MLFlow
Michelle Hoogenhout
July 17th 2021
What I’ll cover
Use MLlib with Mlflow end-to-end to:
● Prepare data in pyspark for use with MLLib
● Train and evaluate several classifier models
● Log model performance with MLFlow Tracking
What you’ll need
● Pyspark / Docker
● MLflow
https://github.com/michellehoog/mllib-example
Why pyspark?
Enables scalable analysis (without having to know Scala!)
Allows distributed processing
Creates ML pipelines
Interacts with Pandas
What algorithms are available on
pyspark MLLib?
● Variety of classification and regression models, incl.
○ Linear & Logistic Regression
○ Tree-based models
○ Multilayer Perceptron
○ Naive Bayes
● Clustering
● Collaborative filtering
● Frequent pattern mining
Spark workflow
● DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which
can hold a variety of data types. E.g., a DataFrame could have different columns
storing text, feature vectors, true labels, and predictions.
● Transformer: A Transformer is an algorithm which can transform one DataFrame
into another DataFrame. E.g., an ML model is a Transformer which transforms
DataFrame with features into a DataFrame with predictions.
● Estimator: An Estimator is an algorithm which can be **fit** on a DataFrame to
produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
● Pipeline: A Pipeline chains multiple Transformers and Estimators together to
specify an ML workflow.
● Parameter:: All Transformers and Estimators now share a common API for
specifying parameters.
Things to note
Data format
● Dense format
● Numeric and zero-indexed (non-negative for Naive
Bayes)
● Named ‘label’ and ‘features’
Pipelines
MLFlow
pip install mlflow[extras]
https://www.mlflow.org/docs/latest/tutorials-and-examples/
tutorial.html
Open source tracking and deployment of ML models
Not specific to Spark / MLLib
MLFlow can log:
● Git commit hash
● Start & end time
● Source
● Parameters
● Metrics
● Artifacts (output)
MLlib with MLFlow.pdf
MLflow tracking overview
Step 1. Create experiment
Step 2. Add runs to your code
Step 3. View logs
MLflow tracking overview
All MLflow runs are logged to the active experiment, which can be set using any of the
following ways:
● Use the mlflow.set_experiment() command.
● Use the experiment_id parameter in the mlflow.start_run() command.
● Set one of the MLflow environment variables MLFLOW_EXPERIMENT_NAME or
MLFLOW_EXPERIMENT_ID.
If no active experiment is set, runs are logged to the notebook experiment.
Viewing the Tracking MLflow UI
The tracking API writes data to local ./mlruns directory.
To view:
Run MLflow instance with mlflow ui
MLflow’s Tracking UI: http://localhost:5000/#/

More Related Content

PDF
MLflow with Databricks
PDF
Mlflow with databricks
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
PDF
MLFlow: Platform for Complete Machine Learning Lifecycle
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PDF
Managing the Complete Machine Learning Lifecycle with MLflow
PPTX
Apache Spark MLlib
PDF
Utilisation de MLflow pour le cycle de vie des projet Machine learning
MLflow with Databricks
Mlflow with databricks
"Managing the Complete Machine Learning Lifecycle with MLflow"
MLFlow: Platform for Complete Machine Learning Lifecycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Managing the Complete Machine Learning Lifecycle with MLflow
Apache Spark MLlib
Utilisation de MLflow pour le cycle de vie des projet Machine learning

Similar to MLlib with MLFlow.pdf (20)

PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
mlflow: Accelerating the End-to-End ML lifecycle
PDF
Introduction to MLflow
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PPTX
databricks ml flow demonstration using automatic features engineering
PDF
Use MLflow to manage and deploy Machine Learning model on Spark
PDF
Spark DataFrames and ML Pipelines
PDF
MLflow-presentation______________________________
PDF
Scaling up Machine Learning Development
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
PPTX
MLflow Model Serving - DAIS 2021
PPTX
Open, Secure & Transparent AI Pipelines
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PDF
Practical Machine Learning Pipelines with MLlib
PDF
MLFlow 1.0 Meetup
PDF
Managing the Machine Learning Lifecycle with MLflow
Practical Distributed Machine Learning Pipelines on Hadoop
mlflow: Accelerating the End-to-End ML lifecycle
Introduction to MLflow
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
databricks ml flow demonstration using automatic features engineering
Use MLflow to manage and deploy Machine Learning model on Spark
Spark DataFrames and ML Pipelines
MLflow-presentation______________________________
Scaling up Machine Learning Development
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Machine Learning Pipelines - Joseph Bradley - Databricks
MLflow Model Serving - DAIS 2021
Open, Secure & Transparent AI Pipelines
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Combining Machine Learning Frameworks with Apache Spark
Apache Spark's MLlib's Past Trajectory and new Directions
Practical Machine Learning Pipelines with MLlib
MLFlow 1.0 Meetup
Managing the Machine Learning Lifecycle with MLflow
Ad

Recently uploaded (20)

DOCX
Factor Analysis Word Document Presentation
PDF
Microsoft Core Cloud Services powerpoint
PDF
How to run a consulting project- client discovery
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Transcultural that can help you someday.
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Introduction to Data Science and Data Analysis
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
annual-report-2024-2025 original latest.
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Factor Analysis Word Document Presentation
Microsoft Core Cloud Services powerpoint
How to run a consulting project- client discovery
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Transcultural that can help you someday.
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Data Science and Data Analysis
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
[EN] Industrial Machine Downtime Prediction
annual-report-2024-2025 original latest.
IMPACT OF LANDSLIDE.....................
importance of Data-Visualization-in-Data-Science. for mba studnts
retention in jsjsksksksnbsndjddjdnFPD.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
CYBER SECURITY the Next Warefare Tactics
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Ad

MLlib with MLFlow.pdf

  • 1. MLlib with MLFlow Michelle Hoogenhout July 17th 2021
  • 2. What I’ll cover Use MLlib with Mlflow end-to-end to: ● Prepare data in pyspark for use with MLLib ● Train and evaluate several classifier models ● Log model performance with MLFlow Tracking
  • 3. What you’ll need ● Pyspark / Docker ● MLflow https://github.com/michellehoog/mllib-example
  • 4. Why pyspark? Enables scalable analysis (without having to know Scala!) Allows distributed processing Creates ML pipelines Interacts with Pandas
  • 5. What algorithms are available on pyspark MLLib? ● Variety of classification and regression models, incl. ○ Linear & Logistic Regression ○ Tree-based models ○ Multilayer Perceptron ○ Naive Bayes ● Clustering ● Collaborative filtering ● Frequent pattern mining
  • 6. Spark workflow ● DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions. ● Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions. ● Estimator: An Estimator is an algorithm which can be **fit** on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. ● Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. ● Parameter:: All Transformers and Estimators now share a common API for specifying parameters.
  • 7. Things to note Data format ● Dense format ● Numeric and zero-indexed (non-negative for Naive Bayes) ● Named ‘label’ and ‘features’ Pipelines
  • 9. MLFlow can log: ● Git commit hash ● Start & end time ● Source ● Parameters ● Metrics ● Artifacts (output)
  • 11. MLflow tracking overview Step 1. Create experiment Step 2. Add runs to your code Step 3. View logs
  • 12. MLflow tracking overview All MLflow runs are logged to the active experiment, which can be set using any of the following ways: ● Use the mlflow.set_experiment() command. ● Use the experiment_id parameter in the mlflow.start_run() command. ● Set one of the MLflow environment variables MLFLOW_EXPERIMENT_NAME or MLFLOW_EXPERIMENT_ID. If no active experiment is set, runs are logged to the notebook experiment.
  • 13. Viewing the Tracking MLflow UI The tracking API writes data to local ./mlruns directory. To view: Run MLflow instance with mlflow ui MLflow’s Tracking UI: http://localhost:5000/#/