SlideShare a Scribd company logo
Dr. Jim Dowling1,2
Slides together with Alexandru A. Ormenisan1,2
, Mahmoud Ismail1,2
PROVENANCE
FOR MACHINE LEARNING PIPELINES
KTH - Royal Institute of Technology (1)
Logical Clocks AB (2)
Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
2
Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
ML PLATFORM
TRAIN and SERVE
FEATURE
STORE
What is provenance for ML Pipelines?
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models
Governance
…search:
Discovery
Debug &
Analyse ?
Serving
Feature
Engineering
Training &
Validating
?
Pipeline
Automation
Integrity &
Garbage
Collection
Traceability &
Compliance
Why track provenance?
End-to-End Machine Learning (ML) Pipelines
▪ Logical Clocks – Hopsworks (world’s first/only fully open source)
▪ Uber Michelangelo
▪ Airbnb – Bighead/Zipline
▪ Comcast
▪ Twitter
▪ GO-JEK Feast
▪ Conde Nast
▪ Facebook FB Learner
▪ Netflix
▪ Reference: www.featurestore.org
Feature Stores in Production
Event DataRaw Data
Data Lake
Data
Pipelines
BI
Platforms
SQL Data
Feature
Pipelines
Feature
Store
FEATURES FOR MODEL TRAINING
SERVE RT FEATURES TO ONLINE MODELS
FEATURES FOR ANALYTICAL MODELS (BATCH)
Feature Stores make existing Data Infrastructure available to Data Scientists and Online Apps
Click features every 10
secs
CDC data every 30
secs
User profile updates every
hour
Featurized weblogs data every
day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2
secs)
Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Pipelines update the Feature Store (2 Databases!) with data from backend Platforms
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Profile Updates
Weblogs
Real-time features
Kafka Output
The FeatureGroup abstraction hides the complexity of dealing with 2 databases
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
The FeatureGroup abstraction hides the complexity of dealing with 2 databasesFeature Store Concepts in Hopsworks
Example Ingestion of data into a FeatureGroup
https://docs.hopsworks.ai/
dataframe = spark.read.json("s3://dataset/rain.json")
# do feature engineering on your dataframe
df.withColumn('precipitation', (df.val-min)/(max-min))
fg = fs.create_feature_group("rain",
version=1,
description="Rain features",
primary_key=['date', 'location_id'],
online_enabled=True)
fg.save(dataframe)
Example Creation of Train/Test Data from a Feature Store
https://docs.hopsworks.ai/
# Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN key.
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all()))
td = fs.create_training_dataset("training_dataset",
version=1,
data_format="tfrecords",
description="Training dataset, TfRecords format",
splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})
# The train/test/validation files are now saved to the filesystem (S3, HDFS, etc)
td.save(feature_join)
# Use the training data as follows:
df = td.read(split="train")
Event DataRaw Data
Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING
MONITOR
Data Lake
The end of the End-to-End ML Pipeline!
ML Pipelines start and stop at the Feature Store
End-to-End ML Pipelines on Hopsworks. Provenance is Collecting Metadata.
HopsFS
Code and
configuration
Data Lake,
Warehous
e, Kafka
Feature
Store
Model
registry
Prediction Logs
Monitoring Logs
Feature
Engineering
Serving on
Kubernetes
Model
Training
Model
Deploy
Serving and
Monitoring
Experiments/
Development Scaleout
Metadata
Features
Validate
Deploy to
Log
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync
Metadata is data that describes other data.
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata?
Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore (Database)
Provenance queries
● SQL or Free-Text or Graph?
● Update Throughput?
● Latency of queries?
● Size of Metadata?
https://www.dataplatformschool.com/blog/w0y8g0-the-data-governance-zoo
Metadata Cataloging Systems - a whole industry
3 Mechanisms for Metadata Collection. Polyglot Metadata Storage for Efficient Querying.
File Systems, Databases, Data Warehouses, Message Bus, etc
Metastore (Database)
Crawler
Job
Pull
(REST) API
Push
Change Data
Capture(CDC) API
Application
Instrumented APIJob
Graph DB Search (Elastic)
Metadata Query API
Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore
Consistency issues
Synchronization
?
Metadata is data that describes other data.
Unspoken Assumption:
Why are Data and Metadata always separate stores?
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata Revisited?
Artifacts and Metadata in End-to-End ML Pipelines
Raw Data Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
File System (S3, HopsFS, etc)
Metastore (Database)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
Mechanism 4: Artifacts and Metadata in the same system - a Unified Metadata Layer (Hopsworks)
Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
HopsFS
Metastore (NDB)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
Metastore (NDB)
Raw Data
extends extends extends extends
Libraries
Application
Data
platform
Metadata
store
Explicit
Top–down tracking of provenance.
Push/Pull, CDC, or instrumented
application or library code.
Standalone Metadata Store.
Implicit
Bottom-up tracking of provenance.
Requires redesigning the platform.
Conventions link files to artifacts.
Metadata is strongly consistent
with storage platform.
Mechanism 4: Implicit Provenance
ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
Mahmoud Ismail1
, Mikael Ronström2
, Seif Haridi1
, Jim Dowling1
1
KTH - Royal Institute of Technology 2
Oracle
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2019), May 15th
25
Tightly coupled Metadata and Data - replicating Metadata to External Systems
HopsFS
Scaleout
Metadata
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync
DN1 DN2 DN3 DN4 DN5
26
● Highly scalable next-generation distribution of HDFS
mkdir /Images
write /Images/cat.png
Images
cat.png
/
DN1, DN3, DN5
What is HopsFS?
27
inodeID name parentID
Block
storage
(Datenodes)
Metadata
storage
NDB
mkdir /Images
1 / 0
2 Images 1
3 cat.png 2
write /Images/cat.png
What is HopsFS?
28
● Drop-in replacement distribution of HDFS
● 16X - 37X the throughput of HDFS
● 37 larger clusters than HDFS
● 10 times lower latency
What is HopsFS?
29
ePipe: Near Real-Time Polyglot Persistence of HopsFS
Metadata
30
HopsFS
Get all images with 1 cat and 1 guitar
1 cat and 1 guitar
Full-text search is not supported by NDB
Search in HopsFS
31
Store
X
Store
Y
HopsFS
App A
App B
?
Polyglot Persistence - Replicating Metadata to External Systems for Efficient Querying
32
ePipe: Near Real-Time Polyglot Persistence of HopsFS
Metadata
33
● ePipe is a databus that provides replicated metadata as a service for HopsFS
● ePipe internally
• creates a consistent and correctly ordered change stream for HopsFS
metadata
• and eventually delivers the change stream with low latency (sub second)
(Near Real-time) to consumers
ePipe
34
● Extend HopsFS with a logging table to log file system changes
● Leverage the NDB events API to live stream changes on the logging table to
ePipe
● ePipe enriches the file system events with appropriate data and publish the
enriched events to the consumers
ePipe: Design Decisions
35
Create /f1
name operationinodeID name parentID
1 / 0
Inodes table logging table
NDB
HopsFS Namenodes
2 f1 1
3 f2 1
f1 CREATE
f2 CREATE
f2 DELETE
f1 DELETE
Create /f2Delete /f2Delete /f1
Inodes table and logging table updated in the same Transaction to ensure Consistency/Integrity
36
HopsFS
NDB
NDB
log fs
changes
ePipe
Change
stream
Store
X
Store
Y
App A
App Benrichment
subscribe
for changes
ePipe
37
HopsFS
NDB
NDB
ePipe
Create f1
Append f1
Create f2
Delete f1
Delete f2
Create f1
Append f1
Create f2
Delete f1
Delete f2
Epoch1
Epoch2Epoch3
Order across epochs Order within epoch
Delete f1 after Create f1 Create f1 ?? Append f1
Delete f2 after Create f2 Create f2 ?? Delete f1
…..
Inconsistencies
100 ms
Ordering of Log Entries
38
● Property 1: The epochs are totally ordered.
● Property 2: The changes within the same transaction happen in the same
epoch.
● Property 3: The changes on files are ordered only if they are in different
epochs, that is, no ordering is guaranteed within the same epoch.
NDB Ordering Properties
39
HopsFS
NDB
NDB
ePipe
Epoch1
Epoch2Epoch3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
We introduced a version number per inode
which we will increment whenever
a change occurs to an inode.
Append f1 after Create f1
Create f2 ?? Delete f1
Strengthening NDB Ordering Properties
40
● Property 1 & 2 & 3
● Property 4 & 5: The version number ensures the serializability of the changes
on the same file/directory within epochs.
● Property 6: The order of changes for different files/directories within the same
epoch doesn't matter.
ePipe ordering Properties
41
Logging overhead on HopsFS
42
Logging overhead on HopsFS
43
logscalebase10
Notifications Throughput
44
logscalebase10
Latency: average Lag Time
45
● Supports failure recovery thanks to the persistent logging table
• The log entries are deleted only once the associated events are successfully
replicated to the downstream consumers.
• At least once delivery semantics.
● Pluggable architecture
• For example, filter events based on file name or any other attribute.
● Not Limited to HopsFS
• Can be extended to watch for other logging tables for different purposes.
More about ePipe
46
● A databus that provides replicated metadata as a service for HopsFS
● Low overhead on HopsFS
● Low replication lag (sub-second)
● High throughput
● Pluggable architecture
ePipe Properties
What is provenance - ML Pipeline
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models
MLFlow Metadata - Explicit API calls
def train(data_path, max_depth, min_child_weight, estimators, model_name):
X_train, X_test, y_train, y_test = build_data(..)
mlflow.set_tracking_uri("jdbc:mysql://username:password@host:3306/database")
mlflow.set_experiment("My Experiment")
with mlflow.start_run() as run:
...
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("min_child_weight", min_child_weight)
mlflow.log_param("estimators", estimators)
with open("test.txt", "w") as f:
f.write("hello world!")
mlflow.log_artifacts("/full/path/to/test.txt")
...
model.fit(X_train, y_train) # auto-logging
...
mlflow.tensorflow.log_model(model, "tensorflow-model",
registered_model_name=model_name)
Hopsworks Metadata - Implicit Metadata
def train(data_path, max_depth, min_child_weight, estimators):
X_train, X_test, y_train, y_test = build_data(..)
...
print("hello world") # monkeypatched - prints in notebook
...
model.fit(X_train, y_train) # auto-logging
…
#Saves model to ”hopsfs://Projects/myProj/models/..”
hops.export_model(model, "tensorflow",..,model_name)
...
# maggy makes an API call to track this dict
return {'accuracy': accuracy, 'loss': loss, 'diagram': 'diagram.png'}
from maggy import experiment
experiment.lagom(train, name="My Experiment", ...)
Metadata
In [ ]:
add(fg_eng, raw_data, features)
…
add(training, features, model)
<fg_eng, raw_data, features>
Pipeline code
What is provenance - Metadata
Feature
engineering
Training Serving
Raw
Data
Features Models
ePipe (with ML Provenance)
Distributed File System (HopsFS)
Full Text Search (Elastic)
Feature
engineering
Training Serving
Raw
Data
Features Models
Let the platform manage the metadata!
ML Artifacts
Features, Feature Metadata,
Train/Test Datasets
Models, Model Metadata
Possibly thousands of files
Distributed File System
Generate thousands of operations
Change Data Capture (CDC)
Capture only relevant operations
Systems Challenges - Operations
More context for file system operations?
user: John user: Alex
Are any of these operations related?
user: John,
app1
user: John,
app3
user: Alex,
app2
Certificates (with AppId) enabled FS Operation
Order of operations
Order of operations
Richer provenance information
Distributed File System
Read/Write/Create/Delete/XAttr/Metadata
Resource Manager - Yarn (Application Context)
Application X
Job Manager - Hopsworks (Job Context)
Workflow Manager - Airflow (Pipeline Context)
Link input/output files via Apps
Different Executions of the same Job
Jobs as Stages of the same Pipeline
Additional Context
Richer provenance information
<file, op, user_id, app_id, job_id, pipeline_id>
Hopsworks Conventions
/training_datasets
/models
/logs
/notebooks
/featurestore
CDC API - Filtering Mechanisms
/training_datasets
/models
/featurestoreProject
Example
CDC API - Filtering Mechanisms
Path based filtering
Path based filtering
Tag based filtering
Example:
Custom metadata based on HDFS XAttr.
Tag: <tutorial>, <debug>
Tags can enable logging of all operations,
if path based filtering is not easy to set
CDC API - Filtering Mechanisms
Path based filtering
Tag based filtering
Coalesce FS Operations
Example:
Read file1
Read file2
…
Read filen
Access1
Training Dataset
CDC API - Filtering Mechanisms
Parent Create Artifact Create
Parent Delete Artifact Delete
Children Read Artifact Access
Children
Create/Delete/
Append/Truncate
Artifact Mutation
Namenodes
NDB
ePipe
Cache
per namenode
Log table
With duplicates
Remove duplicates
In [ ]:
hops.load_training_dataset(
“/Projects/LC/Training_Datasets/ImageNet”)
…
hops.save_model(“/Projects/LC/Models/ResNet”)
Optimization - FS Operation Coalesce
Path based filtering
Tag based filtering
Coalesce FS Operations
Filtered Operations
Filesystem Op Metadata Stored
Create/Delete Artifact existence
XAttr Add metadata to artifact
Read Artifact used by ..
Children Files
Create/Delete
Artifact mutation
Append/Truncate Artifact mutation
Permissions/ACL Artifact metadata mutation
CDC API - Filtering Mechanisms
DataOps
CI/CD Platform
Feature Store
...
Commit-0002
Commit-0001
Commit-0097
Model Training &
Model Validation
MLOps
CI/CD Platform
Model Repository
Model Serving
& Monitoring
Data
Develop/Test
Feature Pipelines2Data1 Develop Model3
Train/Validate
Model4
Deploy/
Monitor5
Hopsworks ML Pipelines
Metadata Store
CDC events CDC events
CDC events
API calls
CDC events
API calls
Bias Detected
!
?
Provenance example
What do I do
DeltaCommit
10/01/20@10:10:01
DeltaCommit
10/01/20@10:12:01
DeltaCommit
10/01/20@12:10:01
… …
DeltaCommit
20/07/20@02:10:01
Hudi Feature Timeline
Claim of Model Bias!
Can we determine the exact features used?
Provenance + Time travel
Feature
engineering
Training Serving
Raw Data Features Models
Training
timestamp
Application
● Provenance improves understanding of complex ML Pipelines.
● Provenance should not change the core ML pipeline code.
● Provenance facilitates Debugging, Analyzing, Automating and Cleaning
of ML Pipelines.
● Provenance and Time Travel facilitate reproducibility of experiments.
● In Hopsworks, we introduced a new mechanism for provenance based
on embedded metadata in a scale-out consistent metadata layer.
Summary
● Ormenisan et al, Time-travel and Provenance for ML Pipelines, Usenix OpML 2020
● Niazi et al, HopsFS, Usenix Fast 2017
● Ismail et al, ePipe, CCGrid 2019
● Small Files in HopsFS, ACM Middleware 2018
● Ismail et al, HopsFS-S3, ACM Middleware 2020
● Meister et al, Oblivious Training Functions, 2020
● Hopsworks
References
@hopsworks
http://github.com/logicalclocks/hopsworks

More Related Content

PDF
Hopsworks MLOps World talk june 21
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
StreamSQL Feature Store (Apache Pulsar Summit)
PDF
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
PDF
The Bitter Lesson of ML Pipelines
PDF
Managed Feature Store for Machine Learning
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Hopsworks MLOps World talk june 21
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Hopsworks Feature Store 2.0 a new paradigm
StreamSQL Feature Store (Apache Pulsar Summit)
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
The Bitter Lesson of ML Pipelines
Managed Feature Store for Machine Learning
PyData Meetup - Feature Store for Hopsworks and ML Pipelines

What's hot (20)

PDF
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
PPTX
Feature store: Solving anti-patterns in ML-systems
PDF
The Feature Store in Hopsworks
PDF
Hops fs huawei internal conference july 2021
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Hopsworks data engineering melbourne april 2020
PDF
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PPTX
ADF Gold Nuggets (Oracle Open World 2011)
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PDF
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
PPTX
The CoFX Data Model
PPTX
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
PDF
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
PPTX
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
PPT
Accelerated data access
PDF
AutoML for Data Science Productivity and Toward Better Digital Decisions
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Feature store: Solving anti-patterns in ML-systems
The Feature Store in Hopsworks
Hops fs huawei internal conference july 2021
Berlin buzzwords 2020-feature-store-dowling
Hopsworks data engineering melbourne april 2020
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Hopsworks at Google AI Huddle, Sunnyvale
ADF Gold Nuggets (Oracle Open World 2011)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
The CoFX Data Model
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Accelerated data access
AutoML for Data Science Productivity and Toward Better Digital Decisions
Ad

Similar to Metadata and Provenance for ML Pipelines with Hopsworks (20)

PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PPTX
Data provenance in Hopsworks
PDF
Data Science with the Help of Metadata
PPTX
Polyglot metadata for Hadoop
PPTX
Flux - Open Machine Learning Stack / Pipeline
PDF
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
PPTX
Log Data Analysis Platform
PPTX
Log Data Analysis Platform by Valentin Kropov
PPTX
Productionalizing ML : Real Experience
PDF
Apache Eagle: Secure Hadoop in Real Time
PDF
Apache Eagle at Hadoop Summit 2016 San Jose
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PDF
Enterprise Data Lakes
PDF
SnappyData at Spark Summit 2017
PPTX
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
PDF
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
PPTX
Apache Flink Overview at SF Spark and Friends
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Data Secrets From a Platform Engineer (Bilbro)
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
Why apache Flink is the 4G of Big Data Analytics Frameworks
Data provenance in Hopsworks
Data Science with the Help of Metadata
Polyglot metadata for Hadoop
Flux - Open Machine Learning Stack / Pipeline
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Log Data Analysis Platform
Log Data Analysis Platform by Valentin Kropov
Productionalizing ML : Real Experience
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle at Hadoop Summit 2016 San Jose
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Enterprise Data Lakes
SnappyData at Spark Summit 2017
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
Apache Flink Overview at SF Spark and Friends
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Data Secrets From a Platform Engineer (Bilbro)
Serverless ML Workshop with Hopsworks at PyData Seattle
Ad

More from Jim Dowling (18)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
PDF
GANs for Anti Money Laundering
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PDF
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
PDF
Jfokus 2019-dowling-logical-clocks
PDF
Berlin buzzwords 2018 TensorFlow on Hops
PPTX
All AI Roads lead to Distribution - Dot AI
PDF
Distributed TensorFlow on Hops (Papis London, April 2018)
PDF
End-to-End Platform Support for Distributed Deep Learning in Finance
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
PDF
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
PDF
Odsc workshop - Distributed Tensorflow on Hops
ARVC and flecainide case report[EI] Jim.docx.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Building Hopsworks, a cloud-native managed feature store for machine learning
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
GANs for Anti Money Laundering
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Hopsworks in the cloud Berlin Buzzwords 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Jfokus 2019-dowling-logical-clocks
Berlin buzzwords 2018 TensorFlow on Hops
All AI Roads lead to Distribution - Dot AI
Distributed TensorFlow on Hops (Papis London, April 2018)
End-to-End Platform Support for Distributed Deep Learning in Finance
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Odsc workshop - Distributed Tensorflow on Hops

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Cloud computing and distributed systems.
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
GamePlan Trading System Review: Professional Trader's Honest Take
Cloud computing and distributed systems.
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Metadata and Provenance for ML Pipelines with Hopsworks

  • 1. Dr. Jim Dowling1,2 Slides together with Alexandru A. Ormenisan1,2 , Mahmoud Ismail1,2 PROVENANCE FOR MACHINE LEARNING PIPELINES KTH - Royal Institute of Technology (1) Logical Clocks AB (2)
  • 2. Growing Consensus on how to manage complexity of AI Data validation Distributed ENGINEER Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) 2
  • 3. Growing Consensus on how to manage complexity of AI Data validation Distributed ENGINEER Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) ML PLATFORM TRAIN and SERVE FEATURE STORE
  • 4. What is provenance for ML Pipelines? ML Pipeline Feature engineering Training Serving Raw Data Features Models
  • 5. Governance …search: Discovery Debug & Analyse ? Serving Feature Engineering Training & Validating ? Pipeline Automation Integrity & Garbage Collection Traceability & Compliance Why track provenance?
  • 7. ▪ Logical Clocks – Hopsworks (world’s first/only fully open source) ▪ Uber Michelangelo ▪ Airbnb – Bighead/Zipline ▪ Comcast ▪ Twitter ▪ GO-JEK Feast ▪ Conde Nast ▪ Facebook FB Learner ▪ Netflix ▪ Reference: www.featurestore.org Feature Stores in Production
  • 8. Event DataRaw Data Data Lake Data Pipelines BI Platforms SQL Data Feature Pipelines Feature Store FEATURES FOR MODEL TRAINING SERVE RT FEATURES TO ONLINE MODELS FEATURES FOR ANALYTICAL MODELS (BATCH) Feature Stores make existing Data Infrastructure available to Data Scientists and Online Apps
  • 9. Click features every 10 secs CDC data every 30 secs User profile updates every hour Featurized weblogs data every day Online Feature Store Offline Feature Store SQL DW S3, HDFS SQL Event Data Real-Time Data User-Entered Features (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App Feature Store No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores. <10ms TBs/PBs Feature Pipelines update the Feature Store (2 Databases!) with data from backend Platforms
  • 10. Feature Store ClickFeatureGroup TableFeatureGroup UserFeatureGroup LogsFeatureGroup Event Data SQL DW S3, HDFS SQL DataFrameAPI Kafka Input Flink RTFeatureGroup Online App Train, Batch App User Clicks DB Updates User Profile Updates Weblogs Real-time features Kafka Output The FeatureGroup abstraction hides the complexity of dealing with 2 databases
  • 11. Features name Pclass Sex Survive Name Balance Train / Test Datasets Survivename PClass Sex Balance Join key Feature Groups Titanic Passenger List Passenger Bank Account File format .tfrecord .npy .csv .hdf5, .petastorm, etc Storage GCS Amazon S3 HopsFS Features, FeatureGroups, and Train/Test Datasets are all versioned The FeatureGroup abstraction hides the complexity of dealing with 2 databasesFeature Store Concepts in Hopsworks
  • 12. Example Ingestion of data into a FeatureGroup https://docs.hopsworks.ai/ dataframe = spark.read.json("s3://dataset/rain.json") # do feature engineering on your dataframe df.withColumn('precipitation', (df.val-min)/(max-min)) fg = fs.create_feature_group("rain", version=1, description="Rain features", primary_key=['date', 'location_id'], online_enabled=True) fg.save(dataframe)
  • 13. Example Creation of Train/Test Data from a Feature Store https://docs.hopsworks.ai/ # Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN key. feature_join = rain_fg.select_all() .join(temperature_fg.select_all(), on=["date", "location_id"]) .join(location_fg.select_all())) td = fs.create_training_dataset("training_dataset", version=1, data_format="tfrecords", description="Training dataset, TfRecords format", splits={'train': 0.7, 'test': 0.2, 'validate': 0.1}) # The train/test/validation files are now saved to the filesystem (S3, HDFS, etc) td.save(feature_join) # Use the training data as follows: df = td.read(split="train")
  • 14. Event DataRaw Data Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING MONITOR Data Lake The end of the End-to-End ML Pipeline! ML Pipelines start and stop at the Feature Store
  • 15. End-to-End ML Pipelines on Hopsworks. Provenance is Collecting Metadata. HopsFS Code and configuration Data Lake, Warehous e, Kafka Feature Store Model registry Prediction Logs Monitoring Logs Feature Engineering Serving on Kubernetes Model Training Model Deploy Serving and Monitoring Experiments/ Development Scaleout Metadata Features Validate Deploy to Log Artifact (File) Artifact Metadata Elasticsearch Sync
  • 16. Metadata is data that describes other data. Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata?
  • 17. Artifacts and Metadata in End-to-End ML Pipelines File System (S3, HopsFS, etc) Metastore (Database) Provenance queries ● SQL or Free-Text or Graph? ● Update Throughput? ● Latency of queries? ● Size of Metadata?
  • 19. 3 Mechanisms for Metadata Collection. Polyglot Metadata Storage for Efficient Querying. File Systems, Databases, Data Warehouses, Message Bus, etc Metastore (Database) Crawler Job Pull (REST) API Push Change Data Capture(CDC) API Application Instrumented APIJob Graph DB Search (Elastic) Metadata Query API
  • 20. Artifacts and Metadata in End-to-End ML Pipelines File System (S3, HopsFS, etc) Metastore Consistency issues Synchronization ?
  • 21. Metadata is data that describes other data. Unspoken Assumption: Why are Data and Metadata always separate stores? Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata Revisited?
  • 22. Artifacts and Metadata in End-to-End ML Pipelines Raw Data Features Experiments (Progs, Logs, Checkpoints) Models Artifacts Metadata File System (S3, HopsFS, etc) Metastore (Database) Experiments (HParams, Env, Results, Graphs) Feature Stats (Min,Max,Std, Mean, Distrib.) Governance (Privileges,Audit, Retention, etc) Model Desc (Privileges, Perf, Provenance, etc)
  • 23. Mechanism 4: Artifacts and Metadata in the same system - a Unified Metadata Layer (Hopsworks) Features Experiments (Progs, Logs, Checkpoints) Models Artifacts Metadata HopsFS Metastore (NDB) Experiments (HParams, Env, Results, Graphs) Feature Stats (Min,Max,Std, Mean, Distrib.) Governance (Privileges,Audit, Retention, etc) Model Desc (Privileges, Perf, Provenance, etc) Metastore (NDB) Raw Data extends extends extends extends
  • 24. Libraries Application Data platform Metadata store Explicit Top–down tracking of provenance. Push/Pull, CDC, or instrumented application or library code. Standalone Metadata Store. Implicit Bottom-up tracking of provenance. Requires redesigning the platform. Conventions link files to artifacts. Metadata is strongly consistent with storage platform. Mechanism 4: Implicit Provenance
  • 25. ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata Mahmoud Ismail1 , Mikael Ronström2 , Seif Haridi1 , Jim Dowling1 1 KTH - Royal Institute of Technology 2 Oracle 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2019), May 15th 25 Tightly coupled Metadata and Data - replicating Metadata to External Systems HopsFS Scaleout Metadata Artifact (File) Artifact Metadata Elasticsearch Sync
  • 26. DN1 DN2 DN3 DN4 DN5 26 ● Highly scalable next-generation distribution of HDFS mkdir /Images write /Images/cat.png Images cat.png / DN1, DN3, DN5 What is HopsFS?
  • 27. 27 inodeID name parentID Block storage (Datenodes) Metadata storage NDB mkdir /Images 1 / 0 2 Images 1 3 cat.png 2 write /Images/cat.png What is HopsFS?
  • 28. 28 ● Drop-in replacement distribution of HDFS ● 16X - 37X the throughput of HDFS ● 37 larger clusters than HDFS ● 10 times lower latency What is HopsFS?
  • 29. 29 ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
  • 30. 30 HopsFS Get all images with 1 cat and 1 guitar 1 cat and 1 guitar Full-text search is not supported by NDB Search in HopsFS
  • 31. 31 Store X Store Y HopsFS App A App B ? Polyglot Persistence - Replicating Metadata to External Systems for Efficient Querying
  • 32. 32 ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
  • 33. 33 ● ePipe is a databus that provides replicated metadata as a service for HopsFS ● ePipe internally • creates a consistent and correctly ordered change stream for HopsFS metadata • and eventually delivers the change stream with low latency (sub second) (Near Real-time) to consumers ePipe
  • 34. 34 ● Extend HopsFS with a logging table to log file system changes ● Leverage the NDB events API to live stream changes on the logging table to ePipe ● ePipe enriches the file system events with appropriate data and publish the enriched events to the consumers ePipe: Design Decisions
  • 35. 35 Create /f1 name operationinodeID name parentID 1 / 0 Inodes table logging table NDB HopsFS Namenodes 2 f1 1 3 f2 1 f1 CREATE f2 CREATE f2 DELETE f1 DELETE Create /f2Delete /f2Delete /f1 Inodes table and logging table updated in the same Transaction to ensure Consistency/Integrity
  • 37. 37 HopsFS NDB NDB ePipe Create f1 Append f1 Create f2 Delete f1 Delete f2 Create f1 Append f1 Create f2 Delete f1 Delete f2 Epoch1 Epoch2Epoch3 Order across epochs Order within epoch Delete f1 after Create f1 Create f1 ?? Append f1 Delete f2 after Create f2 Create f2 ?? Delete f1 ….. Inconsistencies 100 ms Ordering of Log Entries
  • 38. 38 ● Property 1: The epochs are totally ordered. ● Property 2: The changes within the same transaction happen in the same epoch. ● Property 3: The changes on files are ordered only if they are in different epochs, that is, no ordering is guaranteed within the same epoch. NDB Ordering Properties
  • 39. 39 HopsFS NDB NDB ePipe Epoch1 Epoch2Epoch3 Delete f2 ,2 Create f1 ,1 Append f1 ,2 Create f2 ,1 Delete f1 ,3 Delete f2 ,2 Create f1 ,1 Append f1 ,2 Create f2 ,1 Delete f1 ,3 We introduced a version number per inode which we will increment whenever a change occurs to an inode. Append f1 after Create f1 Create f2 ?? Delete f1 Strengthening NDB Ordering Properties
  • 40. 40 ● Property 1 & 2 & 3 ● Property 4 & 5: The version number ensures the serializability of the changes on the same file/directory within epochs. ● Property 6: The order of changes for different files/directories within the same epoch doesn't matter. ePipe ordering Properties
  • 45. 45 ● Supports failure recovery thanks to the persistent logging table • The log entries are deleted only once the associated events are successfully replicated to the downstream consumers. • At least once delivery semantics. ● Pluggable architecture • For example, filter events based on file name or any other attribute. ● Not Limited to HopsFS • Can be extended to watch for other logging tables for different purposes. More about ePipe
  • 46. 46 ● A databus that provides replicated metadata as a service for HopsFS ● Low overhead on HopsFS ● Low replication lag (sub-second) ● High throughput ● Pluggable architecture ePipe Properties
  • 47. What is provenance - ML Pipeline ML Pipeline Feature engineering Training Serving Raw Data Features Models
  • 48. MLFlow Metadata - Explicit API calls def train(data_path, max_depth, min_child_weight, estimators, model_name): X_train, X_test, y_train, y_test = build_data(..) mlflow.set_tracking_uri("jdbc:mysql://username:password@host:3306/database") mlflow.set_experiment("My Experiment") with mlflow.start_run() as run: ... mlflow.log_param("max_depth", max_depth) mlflow.log_param("min_child_weight", min_child_weight) mlflow.log_param("estimators", estimators) with open("test.txt", "w") as f: f.write("hello world!") mlflow.log_artifacts("/full/path/to/test.txt") ... model.fit(X_train, y_train) # auto-logging ... mlflow.tensorflow.log_model(model, "tensorflow-model", registered_model_name=model_name)
  • 49. Hopsworks Metadata - Implicit Metadata def train(data_path, max_depth, min_child_weight, estimators): X_train, X_test, y_train, y_test = build_data(..) ... print("hello world") # monkeypatched - prints in notebook ... model.fit(X_train, y_train) # auto-logging … #Saves model to ”hopsfs://Projects/myProj/models/..” hops.export_model(model, "tensorflow",..,model_name) ... # maggy makes an API call to track this dict return {'accuracy': accuracy, 'loss': loss, 'diagram': 'diagram.png'} from maggy import experiment experiment.lagom(train, name="My Experiment", ...)
  • 50. Metadata In [ ]: add(fg_eng, raw_data, features) … add(training, features, model) <fg_eng, raw_data, features> Pipeline code What is provenance - Metadata Feature engineering Training Serving Raw Data Features Models
  • 51. ePipe (with ML Provenance) Distributed File System (HopsFS) Full Text Search (Elastic) Feature engineering Training Serving Raw Data Features Models Let the platform manage the metadata!
  • 52. ML Artifacts Features, Feature Metadata, Train/Test Datasets Models, Model Metadata Possibly thousands of files Distributed File System Generate thousands of operations Change Data Capture (CDC) Capture only relevant operations Systems Challenges - Operations
  • 53. More context for file system operations? user: John user: Alex Are any of these operations related? user: John, app1 user: John, app3 user: Alex, app2 Certificates (with AppId) enabled FS Operation Order of operations Order of operations Richer provenance information
  • 54. Distributed File System Read/Write/Create/Delete/XAttr/Metadata Resource Manager - Yarn (Application Context) Application X Job Manager - Hopsworks (Job Context) Workflow Manager - Airflow (Pipeline Context) Link input/output files via Apps Different Executions of the same Job Jobs as Stages of the same Pipeline Additional Context Richer provenance information <file, op, user_id, app_id, job_id, pipeline_id>
  • 56. /training_datasets /models /featurestoreProject Example CDC API - Filtering Mechanisms Path based filtering
  • 57. Path based filtering Tag based filtering Example: Custom metadata based on HDFS XAttr. Tag: <tutorial>, <debug> Tags can enable logging of all operations, if path based filtering is not easy to set CDC API - Filtering Mechanisms
  • 58. Path based filtering Tag based filtering Coalesce FS Operations Example: Read file1 Read file2 … Read filen Access1 Training Dataset CDC API - Filtering Mechanisms
  • 59. Parent Create Artifact Create Parent Delete Artifact Delete Children Read Artifact Access Children Create/Delete/ Append/Truncate Artifact Mutation Namenodes NDB ePipe Cache per namenode Log table With duplicates Remove duplicates In [ ]: hops.load_training_dataset( “/Projects/LC/Training_Datasets/ImageNet”) … hops.save_model(“/Projects/LC/Models/ResNet”) Optimization - FS Operation Coalesce
  • 60. Path based filtering Tag based filtering Coalesce FS Operations Filtered Operations Filesystem Op Metadata Stored Create/Delete Artifact existence XAttr Add metadata to artifact Read Artifact used by .. Children Files Create/Delete Artifact mutation Append/Truncate Artifact mutation Permissions/ACL Artifact metadata mutation CDC API - Filtering Mechanisms
  • 61. DataOps CI/CD Platform Feature Store ... Commit-0002 Commit-0001 Commit-0097 Model Training & Model Validation MLOps CI/CD Platform Model Repository Model Serving & Monitoring Data Develop/Test Feature Pipelines2Data1 Develop Model3 Train/Validate Model4 Deploy/ Monitor5 Hopsworks ML Pipelines Metadata Store CDC events CDC events CDC events API calls CDC events API calls
  • 63. DeltaCommit 10/01/20@10:10:01 DeltaCommit 10/01/20@10:12:01 DeltaCommit 10/01/20@12:10:01 … … DeltaCommit 20/07/20@02:10:01 Hudi Feature Timeline Claim of Model Bias! Can we determine the exact features used? Provenance + Time travel Feature engineering Training Serving Raw Data Features Models Training timestamp Application
  • 64. ● Provenance improves understanding of complex ML Pipelines. ● Provenance should not change the core ML pipeline code. ● Provenance facilitates Debugging, Analyzing, Automating and Cleaning of ML Pipelines. ● Provenance and Time Travel facilitate reproducibility of experiments. ● In Hopsworks, we introduced a new mechanism for provenance based on embedded metadata in a scale-out consistent metadata layer. Summary
  • 65. ● Ormenisan et al, Time-travel and Provenance for ML Pipelines, Usenix OpML 2020 ● Niazi et al, HopsFS, Usenix Fast 2017 ● Ismail et al, ePipe, CCGrid 2019 ● Small Files in HopsFS, ACM Middleware 2018 ● Ismail et al, HopsFS-S3, ACM Middleware 2020 ● Meister et al, Oblivious Training Functions, 2020 ● Hopsworks References