Metadata and Provenance for ML Pipelines with Hopsworks

Dr. Jim Dowling1,2
Slides together with Alexandru A. Ormenisan1,2
, Mahmoud Ismail1,2
PROVENANCE
FOR MACHINE LEARNING PIPELINES
KTH - Royal Institute of Technology (1)
Logical Clocks AB (2)

Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
2

Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
ML PLATFORM
TRAIN and SERVE
FEATURE
STORE

What is provenance for ML Pipelines?
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models

Governance
…search:
Discovery
Debug &
Analyse ?
Serving
Feature
Engineering
Training &
Validating
?
Pipeline
Automation
Integrity &
Garbage
Collection
Traceability &
Compliance
Why track provenance?

End-to-End Machine Learning (ML) Pipelines

▪ Logical Clocks – Hopsworks (world’s first/only fully open source)
▪ Uber Michelangelo
▪ Airbnb – Bighead/Zipline
▪ Comcast
▪ Twitter
▪ GO-JEK Feast
▪ Conde Nast
▪ Facebook FB Learner
▪ Netflix
▪ Reference: www.featurestore.org
Feature Stores in Production

Event DataRaw Data
Data Lake
Data
Pipelines
BI
Platforms
SQL Data
Feature
Pipelines
Feature
Store
FEATURES FOR MODEL TRAINING
SERVE RT FEATURES TO ONLINE MODELS
FEATURES FOR ANALYTICAL MODELS (BATCH)
Feature Stores make existing Data Infrastructure available to Data Scientists and Online Apps

Click features every 10
secs
CDC data every 30
secs
User proﬁle updates every
hour
Featurized weblogs data every
day
Online
Feature
Store
Oﬄine
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2
secs)
Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Pipelines update the Feature Store (2 Databases!) with data from backend Platforms

Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Proﬁle Updates
Weblogs
Real-time features
Kafka Output
The FeatureGroup abstraction hides the complexity of dealing with 2 databases

Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
The FeatureGroup abstraction hides the complexity of dealing with 2 databasesFeature Store Concepts in Hopsworks

Example Ingestion of data into a FeatureGroup
https://docs.hopsworks.ai/
dataframe = spark.read.json("s3://dataset/rain.json")
# do feature engineering on your dataframe
df.withColumn('precipitation', (df.val-min)/(max-min))
fg = fs.create_feature_group("rain",
version=1,
description="Rain features",
primary_key=['date', 'location_id'],
online_enabled=True)
fg.save(dataframe)

Example Creation of Train/Test Data from a Feature Store
https://docs.hopsworks.ai/
# Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN key.
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all()))
td = fs.create_training_dataset("training_dataset",
version=1,
data_format="tfrecords",
description="Training dataset, TfRecords format",
splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})
# The train/test/validation files are now saved to the filesystem (S3, HDFS, etc)
td.save(feature_join)
# Use the training data as follows:
df = td.read(split="train")

Event DataRaw Data
Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING
MONITOR
Data Lake
The end of the End-to-End ML Pipeline!
ML Pipelines start and stop at the Feature Store

End-to-End ML Pipelines on Hopsworks. Provenance is Collecting Metadata.
HopsFS
Code and
conﬁguration
Data Lake,
Warehous
e, Kafka
Feature
Store
Model
registry
Prediction Logs
Monitoring Logs
Feature
Engineering
Serving on
Kubernetes
Model
Training
Model
Deploy
Serving and
Monitoring
Experiments/
Development Scaleout
Metadata
Features
Validate
Deploy to
Log
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync

Metadata is data that describes other data.
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata?

Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore (Database)
Provenance queries
● SQL or Free-Text or Graph?
● Update Throughput?
● Latency of queries?
● Size of Metadata?

https://www.dataplatformschool.com/blog/w0y8g0-the-data-governance-zoo
Metadata Cataloging Systems - a whole industry

3 Mechanisms for Metadata Collection. Polyglot Metadata Storage for Eﬃcient Querying.
File Systems, Databases, Data Warehouses, Message Bus, etc
Crawler
Job
Pull
(REST) API
Push
Change Data
Capture(CDC) API
Application
Instrumented APIJob
Graph DB Search (Elastic)
Metadata Query API

Metastore
Consistency issues
Synchronization
?

Metadata is data that describes other data.
Unspoken Assumption:
Why are Data and Metadata always separate stores?
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata Revisited?

Raw Data Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)

Mechanism 4: Artifacts and Metadata in the same system - a Uniﬁed Metadata Layer (Hopsworks)
Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
HopsFS
Metastore (NDB)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
Metastore (NDB)
Raw Data
extends extends extends extends

Libraries
Application
Data
platform
Metadata
store
Explicit
Top–down tracking of provenance.
Push/Pull, CDC, or instrumented
application or library code.
Standalone Metadata Store.
Implicit
Bottom-up tracking of provenance.
Requires redesigning the platform.
Conventions link ﬁles to artifacts.
Metadata is strongly consistent
with storage platform.
Mechanism 4: Implicit Provenance

ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
Mahmoud Ismail1
, Mikael Ronström2
, Seif Haridi1
, Jim Dowling1
1
KTH - Royal Institute of Technology 2
Oracle
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2019), May 15th
25
Tightly coupled Metadata and Data - replicating Metadata to External Systems
HopsFS
Scaleout
Metadata
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync

DN1 DN2 DN3 DN4 DN5
26
● Highly scalable next-generation distribution of HDFS
mkdir /Images
write /Images/cat.png
Images
cat.png
/
DN1, DN3, DN5
What is HopsFS?

27
inodeID name parentID
Block
storage
(Datenodes)
Metadata
storage
NDB
mkdir /Images
1 / 0
2 Images 1
3 cat.png 2
write /Images/cat.png
What is HopsFS?

28
● Drop-in replacement distribution of HDFS
● 16X - 37X the throughput of HDFS
● 37 larger clusters than HDFS
● 10 times lower latency
What is HopsFS?

29
ePipe: Near Real-Time Polyglot Persistence of HopsFS
Metadata

30
HopsFS
Get all images with 1 cat and 1 guitar
1 cat and 1 guitar
Full-text search is not supported by NDB
Search in HopsFS

31
Store
X
Store
Y
HopsFS
App A
App B
?
Polyglot Persistence - Replicating Metadata to External Systems for Eﬃcient Querying

32
ePipe: Near Real-Time Polyglot Persistence of HopsFS
Metadata

33
● ePipe is a databus that provides replicated metadata as a service for HopsFS
● ePipe internally
• creates a consistent and correctly ordered change stream for HopsFS
metadata
• and eventually delivers the change stream with low latency (sub second)
(Near Real-time) to consumers
ePipe

34
● Extend HopsFS with a logging table to log ﬁle system changes
● Leverage the NDB events API to live stream changes on the logging table to
ePipe
● ePipe enriches the ﬁle system events with appropriate data and publish the
enriched events to the consumers
ePipe: Design Decisions

35
Create /f1
name operationinodeID name parentID
1 / 0
Inodes table logging table
NDB
HopsFS Namenodes
2 f1 1
3 f2 1
f1 CREATE
f2 CREATE
f2 DELETE
f1 DELETE
Create /f2Delete /f2Delete /f1
Inodes table and logging table updated in the same Transaction to ensure Consistency/Integrity

36
HopsFS
NDB
NDB
log fs
changes
ePipe
Change
stream
Store
X
Store
Y
App A
App Benrichment
subscribe
for changes
ePipe

37
HopsFS
NDB
NDB
ePipe
Create f1
Append f1
Create f2
Delete f1
Delete f2
Create f1
Append f1
Create f2
Delete f1
Delete f2
Epoch1
Epoch2Epoch3
Order across epochs Order within epoch
Delete f1 after Create f1 Create f1 ?? Append f1
Delete f2 after Create f2 Create f2 ?? Delete f1
…..
Inconsistencies
100 ms
Ordering of Log Entries

38
● Property 1: The epochs are totally ordered.
● Property 2: The changes within the same transaction happen in the same
epoch.
● Property 3: The changes on ﬁles are ordered only if they are in different
epochs, that is, no ordering is guaranteed within the same epoch.
NDB Ordering Properties

39
HopsFS
NDB
NDB
ePipe
Epoch1
Epoch2Epoch3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
We introduced a version number per inode
which we will increment whenever
a change occurs to an inode.
Append f1 after Create f1
Create f2 ?? Delete f1
Strengthening NDB Ordering Properties

40
● Property 1 & 2 & 3
● Property 4 & 5: The version number ensures the serializability of the changes
on the same ﬁle/directory within epochs.
● Property 6: The order of changes for different ﬁles/directories within the same
epoch doesn't matter.
ePipe ordering Properties

43
logscalebase10
Notiﬁcations Throughput

44
logscalebase10
Latency: average Lag Time

45
● Supports failure recovery thanks to the persistent logging table
• The log entries are deleted only once the associated events are successfully
replicated to the downstream consumers.
• At least once delivery semantics.
● Pluggable architecture
• For example, ﬁlter events based on ﬁle name or any other attribute.
● Not Limited to HopsFS
• Can be extended to watch for other logging tables for different purposes.
More about ePipe

46
● A databus that provides replicated metadata as a service for HopsFS
● Low overhead on HopsFS
● Low replication lag (sub-second)
● High throughput
● Pluggable architecture
ePipe Properties

What is provenance - ML Pipeline
ML Pipeline
Feature
engineering
Training Serving

MLFlow Metadata - Explicit API calls
def train(data_path, max_depth, min_child_weight, estimators, model_name):
X_train, X_test, y_train, y_test = build_data(..)
mlflow.set_tracking_uri("jdbc:mysql://username:password@host:3306/database")
mlflow.set_experiment("My Experiment")
with mlflow.start_run() as run:
...
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("min_child_weight", min_child_weight)
mlflow.log_param("estimators", estimators)
with open("test.txt", "w") as f:
f.write("hello world!")
mlflow.log_artifacts("/full/path/to/test.txt")
...
model.fit(X_train, y_train) # auto-logging
...
mlflow.tensorflow.log_model(model, "tensorflow-model",
registered_model_name=model_name)

Hopsworks Metadata - Implicit Metadata
def train(data_path, max_depth, min_child_weight, estimators):
X_train, X_test, y_train, y_test = build_data(..)
...
print("hello world") # monkeypatched - prints in notebook
...
model.fit(X_train, y_train) # auto-logging
…
#Saves model to ”hopsfs://Projects/myProj/models/..”
hops.export_model(model, "tensorflow",..,model_name)
...
# maggy makes an API call to track this dict
return {'accuracy': accuracy, 'loss': loss, 'diagram': 'diagram.png'}
from maggy import experiment
experiment.lagom(train, name="My Experiment", ...)

Metadata
In [ ]:
add(fg_eng, raw_data, features)
…
add(training, features, model)
<fg_eng, raw_data, features>
Pipeline code
What is provenance - Metadata
Feature
engineering
Training Serving
Raw
Data
Features Models

ePipe (with ML Provenance)
Distributed File System (HopsFS)
Full Text Search (Elastic)
Feature
engineering
Training Serving
Raw
Data
Features Models
Let the platform manage the metadata!

ML Artifacts
Features, Feature Metadata,
Train/Test Datasets
Models, Model Metadata
Possibly thousands of ﬁles
Distributed File System
Generate thousands of operations
Change Data Capture (CDC)
Capture only relevant operations
Systems Challenges - Operations

More context for ﬁle system operations?
user: John user: Alex
Are any of these operations related?
user: John,
app1
user: John,
app3
user: Alex,
app2
Certiﬁcates (with AppId) enabled FS Operation
Order of operations
Order of operations
Richer provenance information

Distributed File System
Read/Write/Create/Delete/XAttr/Metadata
Resource Manager - Yarn (Application Context)
Application X
Job Manager - Hopsworks (Job Context)
Workflow Manager - Airflow (Pipeline Context)
Link input/output files via Apps
Different Executions of the same Job
Jobs as Stages of the same Pipeline
Additional Context
Richer provenance information
<file, op, user_id, app_id, job_id, pipeline_id>

Hopsworks Conventions
/training_datasets
/models
/logs
/notebooks
/featurestore
CDC API - Filtering Mechanisms

/training_datasets
/models
/featurestoreProject
Example
Path based ﬁltering

Tag based ﬁltering
Example:
Custom metadata based on HDFS XAttr.
Tag: <tutorial>, <debug>
Tags can enable logging of all operations,
if path based ﬁltering is not easy to set

Coalesce FS Operations
Example:
Read file1
Read file2
…
Read filen
Access1
Training Dataset

Parent Create Artifact Create
Parent Delete Artifact Delete
Children Read Artifact Access
Children
Create/Delete/
Append/Truncate
Artifact Mutation
Namenodes
NDB
ePipe
Cache
per namenode
Log table
With duplicates
Remove duplicates
In [ ]:
hops.load_training_dataset(
“/Projects/LC/Training_Datasets/ImageNet”)
…
hops.save_model(“/Projects/LC/Models/ResNet”)
Optimization - FS Operation Coalesce

Coalesce FS Operations
Filtered Operations
Filesystem Op Metadata Stored
Create/Delete Artifact existence
XAttr Add metadata to artifact
Read Artifact used by ..
Children Files
Create/Delete
Artifact mutation
Append/Truncate Artifact mutation
Permissions/ACL Artifact metadata mutation

DataOps
CI/CD Platform
Feature Store
...
Commit-0002
Commit-0001
Commit-0097
Model Training &
Model Validation
MLOps
CI/CD Platform
Model Repository
Model Serving
& Monitoring
Data
Develop/Test
Feature Pipelines2Data1 Develop Model3
Train/Validate
Model4
Deploy/
Monitor5
Hopsworks ML Pipelines
Metadata Store
CDC events CDC events
CDC events
API calls
CDC events
API calls

Bias Detected
!
?
Provenance example
What do I do

DeltaCommit
10/01/20@10:10:01
DeltaCommit
10/01/20@10:12:01
DeltaCommit
10/01/20@12:10:01
… …
DeltaCommit
20/07/20@02:10:01
Hudi Feature Timeline
Claim of Model Bias!
Can we determine the exact features used?
Provenance + Time travel
Feature
engineering
Training Serving
Training
timestamp
Application

● Provenance improves understanding of complex ML Pipelines.
● Provenance should not change the core ML pipeline code.
● Provenance facilitates Debugging, Analyzing, Automating and Cleaning
of ML Pipelines.
● Provenance and Time Travel facilitate reproducibility of experiments.
● In Hopsworks, we introduced a new mechanism for provenance based
on embedded metadata in a scale-out consistent metadata layer.
Summary

● Ormenisan et al, Time-travel and Provenance for ML Pipelines, Usenix OpML 2020
● Niazi et al, HopsFS, Usenix Fast 2017
● Ismail et al, ePipe, CCGrid 2019
● Small Files in HopsFS, ACM Middleware 2018
● Ismail et al, HopsFS-S3, ACM Middleware 2020
● Meister et al, Oblivious Training Functions, 2020
● Hopsworks
References

@hopsworks
http://github.com/logicalclocks/hopsworks

Metadata and Provenance for ML Pipelines with Hopsworks

More Related Content

What's hot (20)

Similar to Metadata and Provenance for ML Pipelines with Hopsworks (20)

More from Jim Dowling (18)

Recently uploaded (20)

Metadata and Provenance for ML Pipelines with Hopsworks