SlideShare a Scribd company logo
Developing Machine Learning Models
Kevin Tsai
Developing a ML model using TF Estimator
GOTO Chicago 2018
Developing Machine Learning Models
Kevin Tsai, Google
Agenda
Machine Learning with tf.estimator
Performance pipelines with TFRecords and tf.data
Distributed Training with GPUs and TPUs
Machine Learning with tf.estimator
Model Serving
Training Pipeline
Inference Pipeline
Predictions
Model Building
Machine Learning Pipeline
Transform & Feature
Engineering
Transform & Feature
Engineering
Training
Data
Inference
Requests
Model Serving
Training Pipeline
Inference Pipeline
Predictions
Machine Learning Pipeline
train()
evaluate()
Model
split()
Transform & Feature
Engineering
Transform & Feature
Engineering
Training
Data
Inference
Requests
Model Serving
Training Pipeline
Inference Pipeline
Predictions
Machine Learning Pipeline
train()
evaluate()
Model
split()
Logs
Transform & Feature
Engineering
Transform & Feature
Engineering
Training
Data
Inference
Requests
Model Serving
Training Pipeline
Inference Pipeline
Predictions
Machine Learning Pipeline
train()
export()
evaluate()
Model
split()
Logs
Transform & Feature
Engineering
Transform & Feature
Engineering
Training
Data
Inference
Requests
tf.estimator
estimator
model_fn
Model Serving
Training Pipeline
Inference Pipeline
Transform & Feature
Engineering
Transform & Feature
Engineering
Predictions
Training
Data
Inference
Requests
Machine Learning Pipeline with tf.estimator
tf.estimator
estimator
model_fn
train()
evaluate()
predict()
Model Serving
Training Pipeline
Inference Pipeline
Transform & Feature
Engineering
Transform & Feature
Engineering
Predictions
Training
Data
Inference
Requests
Machine Learning Pipeline with tf.estimator
tf.estimator
estimator
model_fn
train_input_fn
eval_input_fn
predict_input_fn
train()
evaluate()
predict()
Model Serving
Training Pipeline
Inference Pipeline
Transform & Feature
Engineering
Transform & Feature
Engineering
Predictions
Training
Data
Inference
Requests
Machine Learning Pipeline with tf.estimator
tf.estimator
estimator
model_fn
train_input_fn
eval_input_fn
predict_input_fn
train()
evaluate()
predict()
Model Serving
Training Pipeline
Inference Pipeline
Transform & Feature
Engineering
Transform & Feature
Engineering
Predictions
Logs
Training
Data
Inference
Requests
Machine Learning Pipeline with tf.estimator
tf.estimator
estimator
model_fn
train_input_fn
eval_input_fn
predict_input_fn
train()
evaluate()
predict()
serving_input_rec_fn
export_savedmodel()
Model Serving
Training Pipeline
Inference Pipeline
Transform & Feature
Engineering
Transform & Feature
Engineering
Predictions
Logs
Training
Data
Inference
Requests
Machine Learning Pipeline with tf.estimator
Estimator in TensorFlow
TensorFlow Distributed Execution Engine
Python
Layers
Canned estimators
Estimator Keras model
Models in a box
Train and evaluate models
Build models
C++ Java Go ...
CPU GPU TPU Android iOS ...
Example: MNIST
? 1
MNIST: Yann LeCunn
Training:
1. Import data
2. Create input functions
3. Create estimator
4. Create train and evaluate specs
5. Run train and evaluate
Test Inference:
6. Test run predict()
7. Export model
Steps to Run Training on Premade Estimator
Premade Estimators
DNNClassifier
DNNRegressor
LinearClassifier
LinearRegressor
DNNLinearCombinedClassifier
DNNLinearCombinedRegressor
+ tf.contrib.estimator
Premade and Custom Estimators
or Custom Estimator
Creating Custom Estimator
(-1,784)
#1 Inference()
#1 Inference()
Reshape
Convolution
ReLU
Max Pool
Convolution
ReLU
Max Pool
(-1,784)
(-1,28,28,1)
(-1,14,14,64)
(-1,7,7,64)
#1 Inference()
#1 Inference()
Reshape
Convolution
ReLU
Max Pool
Reshape
Dense
ReLU
Convolution
ReLU
Max Pool
(-1,784)
(-1,28,28,1)
(-1,14,14,64)
(-1,7,7,64)
(-1,1024)
(-1,3136)
#1 Inference()
Dropout
#1 Inference()
Reshape
Convolution
ReLU
Max Pool
Reshape
Dense
ReLU
Softmax
Convolution
ReLU
Max Pool
(-1,784)
(-1,28,28,1)
(-1,14,14,64)
(-1,7,7,64)
(-1,1024)
(-1,10)
(-1,3136)
(-1,10)
#1 Inference()
Dropout
Dense
#1 Inference()
Reshape
Convolution
ReLU
Max Pool
Reshape
Dense
ReLU
Softmax
Convolution
ReLU
Max Pool
(-1,784)
(-1,28,28,1)
(-1,14,14,64)
(-1,7,7,64)
(-1,1024)
(-1,10)
(-1,3136)
(-1,10)
log
tf.name_scope
tf.nn.conv2d
tf.nn.relu
tf.nn.max_pool
tf.summary.histogram(
tf.summary.histogram
tf.summary.histogram
log relu=False
tf.name_scope
tf.add tf.matmul
tf.nn.relu
tf.summary.histogram
tf.summary.histogram
tf.summary.histogram
#1 Inference()
Dropout
Dense
Hands-on TensorBoard: Dandelion Mane
#1 Inference()
#### 1 INFERENCE MODEL
input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
conv1 = _conv(input_layer,kernel=[5,5,1,64],name='conv1',log=params['log'])
conv2 = _conv(conv1,kernel=[5,5,64,64],name='conv2',log=params['log'])
dense = _dense(conv2,size_in=7*7*64,size_out=params['dense_units'],
name='Dense',relu=True,log=params['log'])
if mode==tf.estimator.ModeKeys.TRAIN:
dense = tf.nn.dropout(dense,params['drop_out'])
logits = _dense(dense,size_in=params['dense_units'],
size_out=10,name='Output',relu=False,log=params['log'])
#1 Inference() and #2 Calculations and Metrics
#1 Inference()
#### 2 CALCULATIONS AND METRICS
predictions = {"classes": tf.argmax(input=logits,axis=1),
"logits": logits,
"probabilities": tf.nn.softmax(logits,name='softmax')}
export_outputs = {'predictions': tf.estimator.export.PredictOutput(predictions)}
if (mode==tf.estimator.ModeKeys.TRAIN or mode==tf.estimator.ModeKeys.EVAL):
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels,logits=logits)
accuracy = tf.metrics.accuracy(
labels=labels, predictions=tf.argmax(logits,axis=1))
metrics = {'accuracy':accuracy}
tf.summary.scalar('accuracy',accuracy[1])
#1 Inference() and #2 Calculations and Metrics #2 Calculations
#### 3 MODE = PREDICT
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=mode, predictions=predictions, export_outputs=export_outputs)
#3 Predict, #4 Train, and #5 Eval #3 MODE = PREDICT
#### 4 MODE = TRAIN
if mode == tf.estimator.ModeKeys.TRAIN:
learning_rate = tf.train.exponential_decay(
params['learning_rate'],tf.train.get_global_step(),
decay_steps=100000,decay_rate=0.96)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
if params['replicate']==True:
optimizer = tf.contrib.estimator.TowerOptimizer(optimizer)
train_op = optimizer.minimize(loss=loss,global_step=tf.train.get_global_step())
tf.summary.scalar('learning_rate', learning_rate)
return tf.estimator.EstimatorSpec(
mode=mode, loss=loss, train_op=train_op)
#3 Predict, #4 Train, and #5 Eval #4 MODE = TRAIN
#### 5 MODE = EVAL
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(
mode=mode,loss=loss,eval_metric_ops=metrics)
#3 Predict, #4 Train, and #5 Eval
#5 MODE = EVAL
TFRecords and tf.data
(Inefficient) Pipeline
# Load Data
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
train_data = mnist.train.images
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
# Create Input Functions
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_data},y=train_labels,batch_size=100,num_epochs=None,shuffle=True)
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},y=eval_labels,num_epochs=1,shuffle=False)
pred_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},num_epochs=1,shuffle=False)
TFRecord
train004.tfrecords
tf.train.Example tf.train.Example tf.train.Example tf.train.Example tf.train.Example
...
train004.tfrecords
tf.train.Example tf.train.Example tf.train.Example tf.train.Example tf.train.Example
...
tf.train.Example
tf.train.Features
tf.train.Feature
tf.train.Feature
tf.train.Feature
...
tf.python_io.TFRecordWriter
tf.train.Feature
tf.train.Feature
tf.train.Example tf.train.Features
TFRecord
tf.data
tf.train.Example
tf.train.Features
tf.train.Feature
tf.train.Feature
tf.train.Feature
...
tf.FixedLenFeature
tf.FixedLenFeature
tf.FixedLenFeature
tf.parse_single_example
dataset = tf.data.TFRecordDataset(c['files'],num_parallel_reads=c['threads'])
tf.data input_fn()
dataset = dataset.map(parse_tfrecord,num_parallel_calls=c['threads'])
dataset = dataset.map(lambda x,y: (image_scaling(x),y),num_parallel_calls=c['threads'])
tf.data input_fn()
if c['mode']==tf.estimator.ModeKeys.TRAIN:
dataset = dataset.map(lambda x,y: (distort(x),y),num_parallel_calls=c['threads'])
dataset = dataset.shuffle(buffer_size=c['shuffle_buff'])
tf.data input_fn()
dataset = dataset.repeat()
dataset = dataset.batch(c['batch'])
dataset = dataset.prefetch(2*c['batch'])
tf.data input_fn()
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
features = {'images': images}
return features, labels
tf.data input_fn()
train_spec = tf.estimator.TrainSpec(input_fn=lambda: dataset_input_fn(train_params))
eval_spec = tf.estimator.EvalSpec(input_fn=lambda: dataset_input_fn(eval_params))
tf.data input_fn()
Distributed Training
Parallelism is Not Automatic
graph
mat
mul
add
X
ypred = WX + b
W b
ypred
Data and Model Parallelism
gpu:0
graph
mat
mul
add
X
ypred = WX + b
W b
ypred
graph
mat
mul
add
gpu:1
graph
mat
mul
add
X
Data Parallelism
split
Data and Model Parallelism
gpu:1
gpu:0
graph
mat
mul
add
X
ypred = WX + b
W b
ypred
graph
mat
mul
add
gpu:1
graph
mat
mul
add
gpu:0
graph
mat
mul
add
X
Data Parallelism
Model Parallelism
X
split
Data and Model Parallelism
gpu:1
gpu:0
graph
mat
mul
add
X
ypred = WX + b
W b
ypred
graph
mat
mul
add
gpu:1
graph
mat
mul
add
gpu:0
graph
mat
mul
add
X
Data Parallelism
Model Parallelism
X
gpu:3gpu:2
graph
mat
mul
add
split
split
Data and Model Parallelism
gpu:0
Output Reduction
graph
mat
mul
add
gpu:1
graph
mat
mul
add
X split
[ grad(gpu:0) , ypred(gpu:0) , loss(gpu:0) ]
[ grad(gpu:1) , ypred(gpu:1) , loss(gpu:1) ]
gpu:0
Output Reduction
graph
mat
mul
add
gpu:1
graph
mat
mul
add
X split
[ grad(gpu:0) , ypred(gpu:0) , loss(gpu:0) ]
[ grad(gpu:1) , ypred(gpu:1) , loss(gpu:1) ]
avg
con-
cat
sum
[ grad(model) , ypred(model) , loss(model) ]
Update
Model
Params
Output Reduction
gpu:3 gpu:1
gpu:0
gpu:2gpu:3
gpu:1
gpu:0
gpu:2
cpu/gpu
Naive Consolidation Ring All-Reduce
Baidu All-Reduce: Andrew Ng
Distributed Accelerator Options
replicate_model_fn (TF1.5+; Naive Consolidation; Single Server)
distribute.MirroredStrategy (TF1.8+; All-Reduce; Single Server for now)
TPUEstimator (TF1.4+; All-Reduce; 64-TPU pod)
Horovod (TF1.8+; All-Reduce; Multi Server)
multi_gpu_model (Keras; Naive Consolidation; Single Server)
Meet Horovod: Alex Sergeev, Mike Del Balso
TPU / TPU Pod
TPUs Batch Size Time to
90 Epochs
Accuracy
1 256 23:22 76.6%
4 1024 05:48 76.3%
16 4096 01:30 76.5%
32 8192 45 min 76.1%
64 16384 22 min 75.0%
64 8192 ->16384 29.5 min 76.1%
ImageNet is the New MNIST
Summary
tf.estimator
● Premade Estimators for quick and dirty jobs (5-part recipe)
● Custom models w/o the plumbing (5-part function)
● Keras Functional API for Inference() in custom model_fn()
● Use Checkpoints/Preemptible VMs/Cloud Storage to reduce cost
● Scalable model building from laptop to GPUs to TPU Pods
TFRecords & tf.data for pipeline performance
● Parallelize creation of TFRecord files ~ 100-150MB
● Use tf.data input_fn()
Distributed Accelerator options
● Input pipeline before bigger/faster/more accelerators
● Scale Up before Out
Thank you.
Developing a ML model using TF Estimator
Developing a ML model using TF Estimator

More Related Content

PDF
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
PDF
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
PDF
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
PDF
The magic behind your Lyft ride prices: A case study on machine learning and ...
PDF
Writing an Interactive Interface for SQL on Flink
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
Machine learning and big data @ uber a tale of two systems
PDF
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
The magic behind your Lyft ride prices: A case study on machine learning and ...
Writing an Interactive Interface for SQL on Flink
Hopsworks Feature Store 2.0 a new paradigm
Machine learning and big data @ uber a tale of two systems
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...

Similar to Developing a ML model using TF Estimator (20)

PDF
maXbox starter65 machinelearning3
PPTX
TensorFlow for IITians
PDF
Introducton to Convolutional Nerural Network with TensorFlow
PDF
The TensorFlow dance craze
PDF
Google Big Data Expo
PPTX
Introduction to Neural Networks and Deep Learning from Scratch
PDF
OpenPOWER Workshop in Silicon Valley
PPTX
Deep learning image classification aplicado al mundo de la moda
PDF
Neural networks with python
PPTX
From Tensorflow Graph to Tensorflow Eager
PDF
Intelligent System Optimizations
PDF
Introduction to Tensor Flow for Optical Character Recognition (OCR)
PDF
ML in Android
PPTX
Simone Scardapane - Bring your neural networks to the browser with TF.js! - C...
PPTX
Bring your neural networks to the browser with TF.js - Simone Scardapane
PPTX
Introduction to Tensorflow
PDF
A Tour of Tensorflow's APIs
PDF
An introduction to Machine Learning
PPTX
TensorFlow in Practice
PDF
Google TensorFlow Tutorial
maXbox starter65 machinelearning3
TensorFlow for IITians
Introducton to Convolutional Nerural Network with TensorFlow
The TensorFlow dance craze
Google Big Data Expo
Introduction to Neural Networks and Deep Learning from Scratch
OpenPOWER Workshop in Silicon Valley
Deep learning image classification aplicado al mundo de la moda
Neural networks with python
From Tensorflow Graph to Tensorflow Eager
Intelligent System Optimizations
Introduction to Tensor Flow for Optical Character Recognition (OCR)
ML in Android
Simone Scardapane - Bring your neural networks to the browser with TF.js! - C...
Bring your neural networks to the browser with TF.js - Simone Scardapane
Introduction to Tensorflow
A Tour of Tensorflow's APIs
An introduction to Machine Learning
TensorFlow in Practice
Google TensorFlow Tutorial
Ad

More from Karthik Murugesan (20)

PDF
Rakuten - Recommendation Platform
PDF
Yahoo's Knowledge Graph - 2014 slides
PDF
Free servers to build Big Data Systems on: Bing's Approach
PDF
Microsoft cosmos
PPTX
Microsoft AI Platform - AETHER Introduction
PDF
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
PDF
Lyft data Platform - 2019 slides
PDF
The Evolution of Spotify Home Architecture - Qcon 2019
PDF
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
PDF
The journey toward a self-service data platform at Netflix - sf 2019
PDF
Production Model Deployment - StitchFix - 2018
PDF
Netflix factstore for recommendations - 2018
PDF
Trends in Music Recommendations 2018
PDF
Netflix Ads Personalization Solution - 2017
PDF
State Of AI 2018
PDF
Spotify Machine Learning Solution for Music Discovery
PDF
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
PDF
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
PDF
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
PDF
Fact Store at Scale for Netflix Recommendations
Rakuten - Recommendation Platform
Yahoo's Knowledge Graph - 2014 slides
Free servers to build Big Data Systems on: Bing's Approach
Microsoft cosmos
Microsoft AI Platform - AETHER Introduction
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Lyft data Platform - 2019 slides
The Evolution of Spotify Home Architecture - Qcon 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
The journey toward a self-service data platform at Netflix - sf 2019
Production Model Deployment - StitchFix - 2018
Netflix factstore for recommendations - 2018
Trends in Music Recommendations 2018
Netflix Ads Personalization Solution - 2017
State Of AI 2018
Spotify Machine Learning Solution for Music Discovery
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Fact Store at Scale for Netflix Recommendations
Ad

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Unlocking AI with Model Context Protocol (MCP)
Assigned Numbers - 2025 - Bluetooth® Document
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25-Week II
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Developing a ML model using TF Estimator