SlideShare a Scribd company logo
End-to-end feature analysis, validation,
and transformation in TFX
Alkis (npolyzotis@google.com)
Ananth (ananthr@google.com)
Introduction
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017).
https://youtu.be/fPTwLVCq00U
Focus of this paper
Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Shared Utilities for Garbage Collection, Data Access Controls
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Logging
Tuner
Data
Analysis
Data
Transformation
Figure 1: High-level component overview of a machine learning platform.
Focus of this talk
Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Shared Utilities for Garbage Collection, Data Access Controls
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Logging
Tuner
Data
Analysis
Data
Transformation
“How do I connect my data
to training/serving?”
“What is the shape
of my data?”
“How do I derive more
signals from the raw data?”
“Any errors in
the data?”
Goals
Provide turn-key functionality for a variety of use cases
Codify and enforce end-to-end best practices for ML data
Data Ingestion, Analysis, and Validation
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Problem: Diverse data storage systems with different formats
Schema
Validation
Data
Ingestion
Standardized Format,
Location, GC Policy,
etc.
Solution: Data ingestion normalizes data to a standard representation
When needed, enforces consistent data handling b/w training and serving
TFX
Components
Data Ingestion
Data Analysis
Data Validation
Google Research Blog: Facets: An Open Source Visualization Tool for Machine Learning Training Data
Problem: Gaining understanding of TB of data with O(1000s) of features is non-trivial
Solution: Scalable data analysis and visualization tools
Model-driven
Validation
Skew
Detection
Schema
Validation
Data Ingestion
Data Analysis
Data Validation
Problem: Finding errors in TB of data with O(1000s) of features is challenging
● ML data formats have limited semantics
● Not all anomalies are important
● Data errors must be explainable
E.g., “Data distribution changed” vs “Default value for feature lang is too frequent”
Data management challenges in Production Machine Learning tutorial in SIGMOD’17
Model-driven
Validation
Skew
Detection
Schema
Validation
feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
value: ‘CONVERSION’
}
}
Also in the schema:
● Context (training vs serving) where feature appears
● Constraints on value distribution
● + many more ML-related constraints
Schema Example
event is a required feature that takes exactly one bytes
value in {“CLICK”, “CONVERSION”}.
Schema life cycle:
● TFX infers initial schema by analyzing the data
● TFX proposes changes as the data evolves
● User curates proposed changes
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation
feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
value: ‘CONVERSION’
}
}
feature {
name: ‘num_impressions’
type: INT
}
feature {
name: ‘event’
value: ‘IMPRESSION’
}
feature {
name: ‘num_impressions’
value: 0.64
}
TFX Data
Validation
Training Example
Schema
‘event’: unexpected value
Fix: update domain
feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
value: ‘CONVERSION’
+ value: ‘IMPRESSION’
}
}
‘num_impressions’: wrong type
Fix: deprecate feature
feature {
name: ‘num_impressions’
type: INT
+ deprecated: true
}
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation
TF Training
10 ...
11 i = tf.log(num_impressions)
12 ...
Line 11: invalid argument for tf.log
Synthetic Example
feature {
name: ‘event’
value: ‘CONVERSION’
}
feature {
name: `num_impressions’
value: [0 1 -1 9999999999]
}
Data
Generator
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation
feature {
name: ‘event’
presence: REQUIRED
valency: SINGLE
type: BYTES
domain {
value: ‘CLICK’
value: ‘CONVERSION’
}
}
feature {
name: ‘num_impressions’
type: INT
}
Schema
Is training data in day N
“similar” to day N-1?
Is training data “similar”
to serving data?
Dataset “similarity” checks:
● Do the datasets conform to the same schema?
● Are the distributions similar?
● Are features exactly the same for the same examples?
Skew problems common in production and usually easy to fix once detected
⇒ Greatest bang for buck for data validation
Data Ingestion
Data Analysis
Data Validation
Model-driven
Validation
Skew
Detection
Schema
Validation
Item 1
Item 2
Item 3
...
ItemsUser
Items
Learner
Model
Logs
User Actions
Recommender
System
“
”
+2%
App install rate by fixing
training-serving feature skew.
Data Ingestion, Analysis, and Validation in TFX
/ Treat ML data as assets on par with source code and infrastructure
/ Develop processes for testing, monitoring, cataloguing, tracking, 
, ML data
/ Consider the end-to-end story from training to serving and back
/ Explore the research problems in the intersection of ML and DB
TensorFlow Transform
Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Shared Utilities for Garbage Collection, Data Access Controls
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Logging
Tuner
Data
Analysis
Data
Transformation
Data
Ingestion
Data
Validation
Trainer
Model Evaluation
and Validation
Serving
Pipeline Storage
Shared Utilities for Garbage Collection, Data Access Controls
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
Logging
Tuner
Data
Analysis
Data
Transformation
Motivation: Training/Serving Skew
data
batch processing
During training
live processing
During serving
request
● Need to keep batch and live processing in sync.
● All other tooling (e.g. evaluation) must also be kept in sync with
batch processing.
● Do everything in the training graph.
● Do everything in the training graph + using statistics/vocabs
generated from raw data.
data
tf.Transform batch
processing
During training During serving
transform as
tf.Graph
request
● “Analyze” is like scikit-learn “fit”
○ Takes a user-defined pipeline and training data.
○ Produces a TF graph.
● “Transform” is like scikit-learn “transform”
○ Takes the graph produced by “Analyze” and applies it, in a Beam
Map, to the data.
○ “Transform” materializes the transformed data.
● The same Transform TF graph can be used in training and serving.
● tf.Transform works by limiting transformations to those with a serving
equivalent.
○ Similar to scikit-learn analyzers (fit + transform).
○ The serving graph must operate independently on each instance.
○ The serving graph must also be expressible as a TF graph.
● The analysis is not so limited.
data
tf.Transform
Transform
trainer
processed
data
tf.Transform
Analyze
save for use
at inference
Defining a preprocessing function in TFX
def preprocessing_fn(inputs):
x = inputs['X']
...
return {
"A": tft.bucketize(
tft.normalize(x) * y),
"B": tensorflow_fn(y, z),
"C": tft.ngrams(z)
}
mean stddev
normalize
multiply
quantiles
bucketize Many operations available for dealing with text and
numeric, user can define their own.
X Y Z
A B C
mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented by
arbitrary Beam code
Transforms
Instance-to-instance
(don’t change batch
dimension)
Pure TensorFlow
Analyze
mean stddev
normalize
multiply
quantiles
bucketize
normalize
multiply
bucketize
constant
tensors
data
normalize
multiply
bucketize
Transform transformed
data
Training
data
normalize
multiply
bucketize
Transform
instance
Transform transformed instance
Training
Serving
data
transformed
data
When to use tf.Transform
● Prerequisite: All your serving-time logic is or can be expressed as TF ops.
Pre-computation (analyzers) can be anything.
● If this is possible, tf.Transform will help you to
○ do batch processing prior to training, and do the same processing in the serving graph, or
○ do processing that requires full-pass operations (e.g. vocabs, normalization),
○ apply a rich set of pre-built feature transformations and analyzers (normalization,
bucketization/quantiles, integerization, principal component analysis, correlation)
○ optionally materialize expensive transformations
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
Apply another TensorFlow Model
tft.apply_saved_model
...
How to use tf.Transform
tf.Transform is built on Apache Beam
Apache Beam is an open source,
unified model for defining both
batch and streaming data-parallel
processing pipelines.
tf.Transform is built on Apache Beam
● Beam is the direct successor of MapReduce, Flume,
MillWheel, etc.
● Beam provides a unified API that allows for execution on
many* different runners (Local, Spark, Flink, IBM Streams,
Google Cloud Dataflow, 
)
● Beam also runs internally at Google on Borg1
.
1
https://research.google.com/pubs/pub43438.html
*work in progress for Python.
● tf.Transform provides a set of operations as Beam PTransforms
● These can be mixed with existing Beam transforms (e.g reads and writes)
Running the pipeline with Beam
Running the pipeline as Beam Pipeline
# Schema definition for input data.
schema = dataset_schema.Schema(...)
metadata = dataset_metadata.DatasetMetadata(schema)
# Define preprocessing_fn as before
def preprocessing_fn(inputs):
...
# Execute the Beam pipeline.
with beam.Pipeline() as pipeline:
# Read input.
train_data = pipeline | tfrecordio.ReadFromTFRecord('/path/to/input*'), coder=ExampleProtoCoder(schema))
# Perform analysis.
transform_fn = (train_data, metadata) | AnalyzeDataset(preprocessing_fn)
transform_fn | transform_fn_io.WriteTransformFn('/transform_fn/output/dir')
# Optional materialization.
transformed_data, transformed_metadata = (train_data, metadata) | TransformDataset()
transformed_data | tfrecordio.WriteToTFRecord('/output/path', coder=ExampleProtoCoder(transformed_metadata.schema))
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX
// It doesn’t matter if you can train or serve fast if the data is wrong
/ Data analysis and validation are critical
// Having the right features is critical for model quality
/ Feature transformations are an important part of feature engineering
// End-to-end matters
/ Analysis/validation/transformations need to cover both training and serving
/ Solution packaged in TFX, Google’s end-to-end platform for production ML

More Related Content

PDF
Introduction to MLflow
PDF
TFX: A tensor flow-based production-scale machine learning platform
PDF
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
PDF
Apply MLOps at Scale by H&M
PPTX
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
PDF
Whats new in_mlflow
PDF
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
PDF
Building A Machine Learning Platform At Quora (1)
Introduction to MLflow
TFX: A tensor flow-based production-scale machine learning platform
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Apply MLOps at Scale by H&M
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Whats new in_mlflow
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Building A Machine Learning Platform At Quora (1)

What's hot (20)

PDF
Scaling up Machine Learning Development
PDF
ML Platform Q1 Meetup: An introduction to LinkedIn's Ranking and Federation L...
PDF
ML Infra for Netflix Recommendations - AI NEXTCon talk
PDF
Machine Learning Pipelines
PDF
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
PDF
Automating machine learning lifecycle with kubeflow
PDF
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
PDF
Monitoring AI with AI
PDF
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
PDF
AutoML Toolkit – Deep Dive
PDF
mlflow: Accelerating the End-to-End ML lifecycle
PDF
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
PDF
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
PPTX
Machine Learning In Production
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PDF
Near real-time anomaly detection at Lyft
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PDF
Reproducible AI using MLflow and PyTorch
PDF
Productionizing Deep Reinforcement Learning with Spark and MLflow
Scaling up Machine Learning Development
ML Platform Q1 Meetup: An introduction to LinkedIn's Ranking and Federation L...
ML Infra for Netflix Recommendations - AI NEXTCon talk
Machine Learning Pipelines
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Automating machine learning lifecycle with kubeflow
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Monitoring AI with AI
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
AutoML Toolkit – Deep Dive
mlflow: Accelerating the End-to-End ML lifecycle
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Machine Learning In Production
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Near real-time anomaly detection at Lyft
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Infrastructure Agnostic Machine Learning Workload Deployment
Reproducible AI using MLflow and PyTorch
Productionizing Deep Reinforcement Learning with Spark and MLflow
Ad

Similar to ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX (20)

PDF
TensorFlow Extension (TFX) and Apache Beam
PPTX
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
PDF
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
PDF
Moving Your Machine Learning Models to Production with TensorFlow Extended
PDF
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
PDF
Streaming Inference with Apache Beam and TFX
PDF
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
PDF
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PDF
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PDF
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PPTX
Tensorflow Ecosystem
PDF
Machine learning operations model book mlops
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PDF
running Tensorflow in Production
PDF
BlaBlaConf'22 The art of MLOps in TensorFlow Ecosystem
PDF
TF Dev Summit 2019
PPTX
Data Science in business World
PPTX
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
PDF
Google machine learning engineer exam dumps 2022
TensorFlow Extension (TFX) and Apache Beam
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Moving Your Machine Learning Models to Production with TensorFlow Extended
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Streaming Inference with Apache Beam and TFX
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Tensorflow Ecosystem
Machine learning operations model book mlops
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
running Tensorflow in Production
BlaBlaConf'22 The art of MLOps in TensorFlow Ecosystem
TF Dev Summit 2019
Data Science in business World
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Google machine learning engineer exam dumps 2022
Ad

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Transform Your Business with a Software ERP System
PDF
Nekopoi APK 2025 free lastest update
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Softaken Excel to vCard Converter Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
L1 - Introduction to python Backend.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Choose the Right IT Partner for Your Business in Malaysia
Transform Your Business with a Software ERP System
Nekopoi APK 2025 free lastest update
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Reimagine Home Health with the Power of Agentic AI​
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms I-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 2 - PM Management and IT Context
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transformation in TFX

  • 1. End-to-end feature analysis, validation, and transformation in TFX Alkis (npolyzotis@google.com) Ananth (ananthr@google.com)
  • 3. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. KDD (2017). https://youtu.be/fPTwLVCq00U
  • 4. Focus of this paper Data Ingestion Data Validation Trainer Model Evaluation and Validation Serving Pipeline Storage Shared Utilities for Garbage Collection, Data Access Controls Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization Logging Tuner Data Analysis Data Transformation Figure 1: High-level component overview of a machine learning platform.
  • 5. Focus of this talk Data Ingestion Data Validation Trainer Model Evaluation and Validation Serving Pipeline Storage Shared Utilities for Garbage Collection, Data Access Controls Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization Logging Tuner Data Analysis Data Transformation “How do I connect my data to training/serving?” “What is the shape of my data?” “How do I derive more signals from the raw data?” “Any errors in the data?” Goals Provide turn-key functionality for a variety of use cases Codify and enforce end-to-end best practices for ML data
  • 6. Data Ingestion, Analysis, and Validation
  • 7. Data Ingestion Data Analysis Data Validation Model-driven Validation Skew Detection Problem: Diverse data storage systems with different formats Schema Validation Data Ingestion Standardized Format, Location, GC Policy, etc. Solution: Data ingestion normalizes data to a standard representation When needed, enforces consistent data handling b/w training and serving TFX Components
  • 8. Data Ingestion Data Analysis Data Validation Google Research Blog: Facets: An Open Source Visualization Tool for Machine Learning Training Data Problem: Gaining understanding of TB of data with O(1000s) of features is non-trivial Solution: Scalable data analysis and visualization tools Model-driven Validation Skew Detection Schema Validation
  • 9. Data Ingestion Data Analysis Data Validation Problem: Finding errors in TB of data with O(1000s) of features is challenging ● ML data formats have limited semantics ● Not all anomalies are important ● Data errors must be explainable E.g., “Data distribution changed” vs “Default value for feature lang is too frequent” Data management challenges in Production Machine Learning tutorial in SIGMOD’17 Model-driven Validation Skew Detection Schema Validation
  • 10. feature { name: ‘event’ presence: REQUIRED valency: SINGLE type: BYTES domain { value: ‘CLICK’ value: ‘CONVERSION’ } } Also in the schema: ● Context (training vs serving) where feature appears ● Constraints on value distribution ● + many more ML-related constraints Schema Example event is a required feature that takes exactly one bytes value in {“CLICK”, “CONVERSION”}. Schema life cycle: ● TFX infers initial schema by analyzing the data ● TFX proposes changes as the data evolves ● User curates proposed changes Data Ingestion Data Analysis Data Validation Model-driven Validation Skew Detection Schema Validation
  • 11. feature { name: ‘event’ presence: REQUIRED valency: SINGLE type: BYTES domain { value: ‘CLICK’ value: ‘CONVERSION’ } } feature { name: ‘num_impressions’ type: INT } feature { name: ‘event’ value: ‘IMPRESSION’ } feature { name: ‘num_impressions’ value: 0.64 } TFX Data Validation Training Example Schema ‘event’: unexpected value Fix: update domain feature { name: ‘event’ presence: REQUIRED valency: SINGLE type: BYTES domain { value: ‘CLICK’ value: ‘CONVERSION’ + value: ‘IMPRESSION’ } } ‘num_impressions’: wrong type Fix: deprecate feature feature { name: ‘num_impressions’ type: INT + deprecated: true } Data Ingestion Data Analysis Data Validation Model-driven Validation Skew Detection Schema Validation
  • 12. TF Training 10 ... 11 i = tf.log(num_impressions) 12 ... Line 11: invalid argument for tf.log Synthetic Example feature { name: ‘event’ value: ‘CONVERSION’ } feature { name: `num_impressions’ value: [0 1 -1 9999999999] } Data Generator Data Ingestion Data Analysis Data Validation Model-driven Validation Skew Detection Schema Validation feature { name: ‘event’ presence: REQUIRED valency: SINGLE type: BYTES domain { value: ‘CLICK’ value: ‘CONVERSION’ } } feature { name: ‘num_impressions’ type: INT } Schema
  • 13. Is training data in day N “similar” to day N-1? Is training data “similar” to serving data? Dataset “similarity” checks: ● Do the datasets conform to the same schema? ● Are the distributions similar? ● Are features exactly the same for the same examples? Skew problems common in production and usually easy to fix once detected ⇒ Greatest bang for buck for data validation Data Ingestion Data Analysis Data Validation Model-driven Validation Skew Detection Schema Validation
  • 14. Item 1 Item 2 Item 3 ... ItemsUser Items Learner Model Logs User Actions Recommender System
  • 15. “ ” +2% App install rate by fixing training-serving feature skew.
  • 16. Data Ingestion, Analysis, and Validation in TFX / Treat ML data as assets on par with source code and infrastructure / Develop processes for testing, monitoring, cataloguing, tracking, 
, ML data / Consider the end-to-end story from training to serving and back / Explore the research problems in the intersection of ML and DB
  • 18. Data Ingestion Data Validation Trainer Model Evaluation and Validation Serving Pipeline Storage Shared Utilities for Garbage Collection, Data Access Controls Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization Logging Tuner Data Analysis Data Transformation
  • 19. Data Ingestion Data Validation Trainer Model Evaluation and Validation Serving Pipeline Storage Shared Utilities for Garbage Collection, Data Access Controls Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization Logging Tuner Data Analysis Data Transformation
  • 21. data batch processing During training live processing During serving request
  • 22. ● Need to keep batch and live processing in sync. ● All other tooling (e.g. evaluation) must also be kept in sync with batch processing.
  • 23. ● Do everything in the training graph. ● Do everything in the training graph + using statistics/vocabs generated from raw data.
  • 24. data tf.Transform batch processing During training During serving transform as tf.Graph request
  • 25. ● “Analyze” is like scikit-learn “fit” ○ Takes a user-defined pipeline and training data. ○ Produces a TF graph. ● “Transform” is like scikit-learn “transform” ○ Takes the graph produced by “Analyze” and applies it, in a Beam Map, to the data. ○ “Transform” materializes the transformed data. ● The same Transform TF graph can be used in training and serving.
  • 26. ● tf.Transform works by limiting transformations to those with a serving equivalent. ○ Similar to scikit-learn analyzers (fit + transform). ○ The serving graph must operate independently on each instance. ○ The serving graph must also be expressible as a TF graph. ● The analysis is not so limited.
  • 28. Defining a preprocessing function in TFX def preprocessing_fn(inputs): x = inputs['X'] ... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } mean stddev normalize multiply quantiles bucketize Many operations available for dealing with text and numeric, user can define their own. X Y Z A B C
  • 29. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented by arbitrary Beam code Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  • 33. When to use tf.Transform
  • 34. ● Prerequisite: All your serving-time logic is or can be expressed as TF ops. Pre-computation (analyzers) can be anything. ● If this is possible, tf.Transform will help you to ○ do batch processing prior to training, and do the same processing in the serving graph, or ○ do processing that requires full-pass operations (e.g. vocabs, normalization), ○ apply a rich set of pre-built feature transformations and analyzers (normalization, bucketization/quantiles, integerization, principal component analysis, correlation) ○ optionally materialize expensive transformations
  • 35. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join Apply another TensorFlow Model tft.apply_saved_model ...
  • 36. How to use tf.Transform
  • 37. tf.Transform is built on Apache Beam Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.
  • 38. tf.Transform is built on Apache Beam ● Beam is the direct successor of MapReduce, Flume, MillWheel, etc. ● Beam provides a unified API that allows for execution on many* different runners (Local, Spark, Flink, IBM Streams, Google Cloud Dataflow, 
) ● Beam also runs internally at Google on Borg1 . 1 https://research.google.com/pubs/pub43438.html *work in progress for Python.
  • 39. ● tf.Transform provides a set of operations as Beam PTransforms ● These can be mixed with existing Beam transforms (e.g reads and writes) Running the pipeline with Beam
  • 40. Running the pipeline as Beam Pipeline # Schema definition for input data. schema = dataset_schema.Schema(...) metadata = dataset_metadata.DatasetMetadata(schema) # Define preprocessing_fn as before def preprocessing_fn(inputs): ... # Execute the Beam pipeline. with beam.Pipeline() as pipeline: # Read input. train_data = pipeline | tfrecordio.ReadFromTFRecord('/path/to/input*'), coder=ExampleProtoCoder(schema)) # Perform analysis. transform_fn = (train_data, metadata) | AnalyzeDataset(preprocessing_fn) transform_fn | transform_fn_io.WriteTransformFn('/transform_fn/output/dir') # Optional materialization. transformed_data, transformed_metadata = (train_data, metadata) | TransformDataset() transformed_data | tfrecordio.WriteToTFRecord('/output/path', coder=ExampleProtoCoder(transformed_metadata.schema))
  • 42. // It doesn’t matter if you can train or serve fast if the data is wrong / Data analysis and validation are critical // Having the right features is critical for model quality / Feature transformations are an important part of feature engineering // End-to-end matters / Analysis/validation/transformations need to cover both training and serving / Solution packaged in TFX, Google’s end-to-end platform for production ML