High Performance Distributed TensorFlow with GPUs and Kubernetes

HIGH PERFORMANCE DISTRIBUTED TENSORFLOW
IN PRODUCTION WITH GPUS AND KUBERNETES
HPC ADVISORY COUNCIL, FEB 2018
CHRIS FREGLY
FOUNDER @ PIPELINE.AI

KEY TAKE-AWAYS
Optimize Your Models After Training
Validate Models Online in Live Production (Safely!)
Evaluate Model Performance Offline *and* Online
Monitor and Tune Your Model Serving Runtime

INTRODUCTIONS: ME
§ Chris Fregly, Founder & Engineer @PipelineAI
§ Formerly Netflix, Databricks, IBM Spark Tech
§ Advanced Spark and TensorFlow Meetup
§ Please Join Our 60,000+ Global Members!!
Contact Me
chris@pipeline.ai
@cfregly
Global Locations
* San Francisco
* Chicago
* Austin
* Washington DC
* Dusseldorf
* London

INTRODUCTIONS: YOU
§ Data Scientist, Data Engineer, Data Analyst, Data Curious
§ Want to Deploy ML/AI Models Rapidly and Safely
§ Need to Trace or Explain Model Predictions
§ Have a Decent Grasp of Computer Science Fundamentals

PIPELINE.AI IS 100% OPEN SOURCE
§ https://github.com/PipelineAI/pipeline/
§ Please Star this GitHub Repo!
§ VC’s Value GitHub Stars @ $1,500 Each (?!)
GitHub Repo Geo Heat Map: http://jrvis.com/red-dwarf/

PIPELINE.AI OVERVIEW
500,000Docker Downloads
60,000 Registered Users
60,000 Meetup Members
30,000 LinkedIn Followers
2,400 GitHub Stars
15 Enterprise Beta Users

RECENT PIPELINE.AI NEWS
Sept 2017
Dec 2017
Jan 2018
PipelineAI Becomes Google ML/AI Expert
Register to Install PipelineAI
in Your Own Environment
(Starting March 2018)
http://pipeline.ai
Try GPU Community
Edition Today!
http://community.pipeline.ai

WHY HEAVY FOCUS ON MODEL SERVING?
Model Training
Batch & Boring
Offline in Research Lab
Pipeline Ends at Training
No Insight into Live Production
Small Number of Data Scientists
Optimizations Very Well-Known
Real-Time & Exciting!!
Online in Live Production
Pipeline Extends into Production
Continuous Insight into Live Production
Huuuuuuge Number of Application Users
Runtime Optimizations Not Yet Explored
<<<
Model Serving
100’s Training Jobs per Day 1,000,000’s Predictions per Sec

CLOUD-BASED MODEL SERVING OPTIONS
§ AWS SageMaker
§ Released Nov 2017 @ Re-invent
§ Custom Docker Images for Training/Serving (ie. PipelineAI Images)
§ Distributed TensorFlow Training through Estimator API
§ Traffic Splitting for A/B Model Testing
§ Google Cloud ML Engine
§ Mostly Command-Line Based
§ Driving TensorFlow Open Source API (ie. Estimator API)
§ Azure ML
PipelineAI Supports
Hybrid-Cloud
Deployments

BUILD MODEL WITH THE RUNTIME
§ Package Model + Runtime into 1 Docker Image
§ Emphasizes Immutable Deployment and Infrastructure
§ Same Image Across All Environments
§ No Library or Dependency Surprises from Laptop to Production
§ Allows Tuning Model + Runtime Together
pipeline predict-server-build --model-name=mnist
--model-tag=A
--model-type=tensorflow
--model-runtime=tfserving
--model-chip=gpu
--model-path=./tensorflow/mnist/
Build Local
Model Server A

TUNE MODEL + RUNTIME TOGETHER
§ Model Training Optimizations
§ Model Hyper-Parameters (ie. Learning Rate)
§ Reduced Precision (ie. FP16 Half Precision)
§ Model Serving (Post-Train) Optimizations
§ Quantize Model Weights + Activations From 32-bit to 8-bit
§ Fuse Neural Network Layers Together
§ Model Runtime Optimizations
§ Runtime Config: Request Batch Size, etc
§ Different Runtime: TensorFlow Serving CPU/GPU, Nvidia TensorRT

SERVING (POST-TRAIN) OPTIMIZATIONS
§ Prepare Model for Serving
§ Simplify Network, Reduce Size
§ Reduce Precision -> Fast Math
§ Some Tools
§ Graph Transform Tool (GTT)
§ tfcompile
After Training
After
Optimizing!
pipeline optimize --optimization-list=[‘quantize_weights’,‘tfcompile’]
--model-name=mnist
--model-tag=A
--model-path=./tensorflow/mnist/model
--model-inputs=[‘x’]
--model-outputs=[‘add’]
--output-path=./tensorflow/mnist/optimized_model
Linear
Regression
Model Size: 70MB –> 70K (!)

NVIDIA TENSOR-RT RUNTIME
§ Post-Training Model Optimizations
§ Specific to Nvidia GPUs
§ GPU-Optimized Prediction Runtime
§ Alternative to TensorFlow Serving
§ PipelineAI Supports TensorRT!

TENSORFLOW LITE RUNTIME
§ Post-Training Model Optimizations
§ Currently Supports iOS and Android
§ On-Device Prediction Runtime
§ Low-Latency, Fast Startup
§ Selective Operator Loading
§ 70KB Min - 300KB Max Runtime Footprint
§ Supports Accelerators (GPU, TPU)
§ Falls Back to CPU without Accelerator
§ Java and C++ APIs

3 DIFFERENT RUNTIMES, SAME MODEL
--model-tag=C
--model-runtime=tensorrt
--model-chip=gpu
Build Local
Model Server C
--model-tag=A
--model-chip=cpu
Build Local
Model Server A
--model-tag=B
--model-chip=gpu
Build Local
Model Server B
Same Model,
Diff Runtime

RUN A LOADTEST LOCALLY!
§ Perform Mini-Load Test on Local Model Server
§ Immediate, Local Prediction Performance Metrics
§ Compare to Previous Model + Runtime Variations
§ Gain Intuition Before Push to Prod
pipeline predict-server-start --model-name=mnist
--model-tag=A
--memory-limit=2G
pipeline predict-http-test --model-endpoint-url=http://localhost:8080
--test-request-path=test_request.json
--test-request-concurrency=1000
Start Local
LoadTest
Start Local
Model Servers

PUSH IMAGE TO DOCKER REGISTRY
§ Supports All Public + Private Docker Registries
§ DockerHub, Artifactory, Quay, AWS, Google, …
§ Or Self-Hosted, Private Docker Registry
pipeline predict-server-push --model-name=mnist
--model-tag=A
--image-registry-url=<your-registry>
--image-registry-repo=<your-repo>
Push Images to
Docker Registry

DEPLOY MODELS SAFELY TO PROD
§ Deploy from CLI or Jupyter Notebook
§ Tear-Down and Rollback Models Quickly
§ Shadow Canary: Deploy to 20% Live Traffic
§ Split Canary: Deploy to 97-2-1% Live Traffic
pipeline predict-kube-start --model-name=mnist
--model-tag=BStart Cluster B
--model-tag=CStart Cluster C
--model-tag=AStart Cluster A
pipeline predict-kube-route --model-name=mnist
--model-split-tag-and-weight-dict='{"A":97, "B":2, "C”:1}'
--model-shadow-tag-list='[]'
Route Live Traffic

COMPARE MODELS OFFLINE & ONLINE
§ Offline, Batch Metrics
§ Validation + Training Accuracy
§ CPU + GPU Utilization
§ Online, Live Prediction Values
§ Compare Relative Precision
§ Newly-Seen, Streaming Data
§ Online, Real-Time Metrics
§ Response Time, Throughput
§ Cost ($) Per Prediction

ENSEMBLE PREDICTION AUDIT TRAIL
§ Necessary for Model Explain-ability
§ Fine-Grained Request Tracing
§ Used for Model Ensembles

REAL-TIME PREDICTION STREAMS
§ Visually Compare Real-time Predictions
Features and
Inputs
Predictions and
Confidences
Model B Model CModel A

PREDICTION PROFILING AND TUNING
§ Pinpoint Performance Bottlenecks
§ Fine-Grained Prediction Metrics
§ 3 Steps in Real-Time Prediction
1. transform_request()
2. predict()
3. transform_response()

SHIFT TRAFFIC TO MAX(REVENUE)
§ Shift Traffic to Winning Model with Multi-armed Bandits

LIVE, ADAPTIVE TRAFFIC ROUTING
§ A/B Tests
§ Inflexible and Boring
§ Multi-Armed Bandits
§ Adaptive and Exciting!
pipeline predict-kube-route --model-name=mnist
--model-split-tag-and-weight-dict='{"A":1, "B":2, "C”:97}’
--model-shadow-tag-list='[]'
Route Traffic
Dynamically

SHIFT TRAFFIC TO MIN(CLOUD CO$T)
§ Based on Cost ($) Per Prediction
§ Cost Changes Throughout Day
§ Lose AWS Spot Instances
§ Google Cloud Becomes Cheaper
§ Shift Across Clouds & On-Prem

PSEUDO-CONTINUOUS TRAINING
§ Identify and Fix Borderline (Unconfident) Predictions
§ Fix Predictions Along Class Boundaries
§ Facilitate ”Human in the Loop”
§ Retrain with Newly-Labeled Data
§ Game-ify the Labeling Process
§ Path to Crowd-Sourced Labeling

CONTINUOUS MODEL TRAINING
§ The Holy Grail of Machine Learning!
§ PipelineAI Supports Continuous Model Training!
§ Kafka, Kinesis
§ Spark Streaming, Flink
§ Storm, Heron

THANK YOU!!
§ Please Star this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline
Contact Me
chris@pipeline.ai
@cfregly

High Performance Distributed TensorFlow with GPUs and Kubernetes

More Related Content

What's hot (16)

Similar to High Performance Distributed TensorFlow with GPUs and Kubernetes (20)

More from inside-BigData.com (20)

Recently uploaded (20)

High Performance Distributed TensorFlow with GPUs and Kubernetes