ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX

End-to-end feature analysis, validation,
and transformation in TFX
Alkis (npolyzotis@google.com)
Ananth (ananthr@google.com)

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017).
https://youtu.be/fPTwLVCq00U

Focus of this paper
Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Shared Utilities for Garbage Collection, Data Access Controls
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Logging
Tuner
Data
Analysis
Data
Transformation
Figure 1: High-level component overview of a machine learning platform.

Focus of this talk
Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Logging
Tuner
Data
Analysis
Data
Transformation
“How do I connect my data
to training/serving?”
“What is the shape
of my data?”
“How do I derive more
signals from the raw data?”
“Any errors in
the data?”
Goals
Provide turn-key functionality for a variety of use cases
Codify and enforce end-to-end best practices for ML data

Data Ingestion, Analysis, and Validation

Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Problem: Diverse data storage systems with different formats
Schema
Validation
Data
Ingestion
Standardized Format,
Location, GC Policy,
etc.
Solution: Data ingestion normalizes data to a standard representation
When needed, enforces consistent data handling b/w training and serving
TFX
Components

Data Ingestion
Data Analysis
Data Validation
Google Research Blog: Facets: An Open Source Visualization Tool for Machine Learning Training Data
Problem: Gaining understanding of TB of data with O(1000s) of features is non-trivial
Solution: Scalable data analysis and visualization tools
Model-driven
Validation
Skew
Detection
Schema
Validation

Data Ingestion
Data Analysis
Data Validation
Problem: Finding errors in TB of data with O(1000s) of features is challenging
● ML data formats have limited semantics
● Not all anomalies are important
● Data errors must be explainable
E.g., “Data distribution changed” vs “Default value for feature lang is too frequent”
Data management challenges in Production Machine Learning tutorial in SIGMOD’17
Model-driven
Validation
Skew
Detection
Schema
Validation

feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
value: ‘CONVERSION’
}
}
Also in the schema:
● Context (training vs serving) where feature appears
● Constraints on value distribution
● + many more ML-related constraints
Schema Example
event is a required feature that takes exactly one bytes
value in {“CLICK”, “CONVERSION”}.
Schema life cycle:
● TFX infers initial schema by analyzing the data
● TFX proposes changes as the data evolves
● User curates proposed changes
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation

feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
}
}
feature {
name: ‘num_impressions’
type: INT
}
feature {
name: ‘event’
value: ‘IMPRESSION’
}
feature {
value: 0.64
}
TFX Data
Validation
Training Example
Schema
‘event’: unexpected value
Fix: update domain
feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
+ value: ‘IMPRESSION’
}
}
‘num_impressions’: wrong type
Fix: deprecate feature
feature {
type: INT
+ deprecated: true
}
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation

TF Training
10 ...
11 i = tf.log(num_impressions)
12 ...
Line 11: invalid argument for tf.log
Synthetic Example
feature {
name: ‘event’
}
feature {
name: `num_impressions’
value: [0 1 -1 9999999999]
}
Data
Generator
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation
feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
}
}
feature {
type: INT
}
Schema

Is training data in day N
“similar” to day N-1?
Is training data “similar”
to serving data?
Dataset “similarity” checks:
● Do the datasets conform to the same schema?
● Are the distributions similar?
● Are features exactly the same for the same examples?
Skew problems common in production and usually easy to fix once detected
⇒ Greatest bang for buck for data validation
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation

Item 1
Item 2
Item 3
...
ItemsUser
Items
Learner
Model
Logs
User Actions
Recommender
System

“
”
+2%
App install rate by fixing
training-serving feature skew.

Data Ingestion, Analysis, and Validation in TFX
/ Treat ML data as assets on par with source code and infrastructure
/ Develop processes for testing, monitoring, cataloguing, tracking, …, ML data
/ Consider the end-to-end story from training to serving and back
/ Explore the research problems in the intersection of ML and DB

Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Logging
Tuner
Data
Analysis
Data
Transformation

Motivation: Training/Serving Skew

data
batch processing
During training
live processing
During serving
request

● Need to keep batch and live processing in sync.
● All other tooling (e.g. evaluation) must also be kept in sync with
batch processing.

● Do everything in the training graph.
● Do everything in the training graph + using statistics/vocabs
generated from raw data.

data
tf.Transform batch
processing
During training During serving
transform as
tf.Graph
request

● “Analyze” is like scikit-learn “fit”
○ Takes a user-defined pipeline and training data.
○ Produces a TF graph.
● “Transform” is like scikit-learn “transform”
○ Takes the graph produced by “Analyze” and applies it, in a Beam
Map, to the data.
○ “Transform” materializes the transformed data.
● The same Transform TF graph can be used in training and serving.

● tf.Transform works by limiting transformations to those with a serving
equivalent.
○ Similar to scikit-learn analyzers (fit + transform).
○ The serving graph must operate independently on each instance.
○ The serving graph must also be expressible as a TF graph.
● The analysis is not so limited.

data
tf.Transform
Transform
trainer
processed
data
tf.Transform
Analyze
save for use
at inference

Defining a preprocessing function in TFX
def preprocessing_fn(inputs):
x = inputs['X']
...
return {
"A": tft.bucketize(
tft.normalize(x) * y),
"B": tensorflow_fn(y, z),
"C": tft.ngrams(z)
}
mean stddev
normalize
multiply
quantiles
bucketize Many operations available for dealing with text and
numeric, user can define their own.
X Y Z
A B C

mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented by
arbitrary Beam code
Transforms
Instance-to-instance
(don’t change batch
dimension)
Pure TensorFlow

Analyze
mean stddev
normalize
multiply
quantiles
bucketize
normalize
multiply
bucketize
constant
tensors
data

normalize
multiply
bucketize
Transform transformed
data
Training
data

normalize
multiply
bucketize
Transform
instance
Transform transformed instance
Training
Serving
data
transformed
data

● Prerequisite: All your serving-time logic is or can be expressed as TF ops.
Pre-computation (analyzers) can be anything.
● If this is possible, tf.Transform will help you to
○ do batch processing prior to training, and do the same processing in the serving graph, or
○ do processing that requires full-pass operations (e.g. vocabs, normalization),
○ apply a rich set of pre-built feature transformations and analyzers (normalization,
bucketization/quantiles, integerization, principal component analysis, correlation)
○ optionally materialize expensive transformations

Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
Apply another TensorFlow Model
tft.apply_saved_model
...

tf.Transform is built on Apache Beam
Apache Beam is an open source,
unified model for defining both
batch and streaming data-parallel
processing pipelines.

tf.Transform is built on Apache Beam
● Beam is the direct successor of MapReduce, Flume,
MillWheel, etc.
● Beam provides a unified API that allows for execution on
many* different runners (Local, Spark, Flink, IBM Streams,
Google Cloud Dataflow, …)
● Beam also runs internally at Google on Borg1
.
1
https://research.google.com/pubs/pub43438.html
*work in progress for Python.

● tf.Transform provides a set of operations as Beam PTransforms
● These can be mixed with existing Beam transforms (e.g reads and writes)
Running the pipeline with Beam

Running the pipeline as Beam Pipeline
# Schema definition for input data.
schema = dataset_schema.Schema(...)
metadata = dataset_metadata.DatasetMetadata(schema)
# Define preprocessing_fn as before
def preprocessing_fn(inputs):
...
# Execute the Beam pipeline.
with beam.Pipeline() as pipeline:
# Read input.
train_data = pipeline | tfrecordio.ReadFromTFRecord('/path/to/input*'), coder=ExampleProtoCoder(schema))
# Perform analysis.
transform_fn = (train_data, metadata) | AnalyzeDataset(preprocessing_fn)
transform_fn | transform_fn_io.WriteTransformFn('/transform_fn/output/dir')
# Optional materialization.
transformed_data, transformed_metadata = (train_data, metadata) | TransformDataset()
transformed_data | tfrecordio.WriteToTFRecord('/output/path', coder=ExampleProtoCoder(transformed_metadata.schema))

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX

// It doesn’t matter if you can train or serve fast if the data is wrong
/ Data analysis and validation are critical
// Having the right features is critical for model quality
/ Feature transformations are an important part of feature engineering
// End-to-end matters
/ Analysis/validation/transformations need to cover both training and serving
/ Solution packaged in TFX, Google’s end-to-end platform for production ML

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX

More Related Content

What's hot (20)

Similar to ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX (20)

Recently uploaded (20)

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX