Running Tensorflow In Production: Challenges and Solutions on YARN 3.x

1 © Hortonworks Inc. 2011–2018. All rights reserved
Running Tensorflow in Production: Challenges and
Solutions on YARN 3.x
Yanbo Liang
Wangda Tan

Intro
• Yanbo Liang
• Staff software engineer at Hortonworks
• Apache Spark PMC member and committer
• TensorFlow and XGBoost contributor
• Work on intersection of System and Algorithm
for Machine Learning and Deep Learning
• Wangda Tan
• Engineering Manager @ Hortonworks.
• Apache Hadoop PMC member and
committer
• Work on scheduler / deep learning on
YARN / GPUs on YARN, etc.

Agenda
• With Machine Learning Hat, what required?
• With Infra hat, what we can help?

With Machine Learning Hat

What are the problems?
• Ads CTR Prediction
• Recommendation
• Fraud Detection
• Natural Language Processing
• Objective Detection
• Automatic Speech Recognition
• Image & Video Recognition
• Personal Assistant
• Financial Forecasting

Who we are?
• Machine Learning Engineer or Data Scientist
• What we are familiar with?
• Linear algebra, statistics, machine learning algorithms and models, deep neural
networks(DNN/CNN/RNN), basic programming skill, etc.
• What we are not familiar with?
• System environment and programming, resource management and scheduling, networking and
storage, etc.

What we use
• Liblinear
• LibFM
• Scikit-learn
• XGBoost/LightGBM
• Spark MLlib
• TensorFlow/PyTorch/MXNet

How we do?
• Where is the training and test dataset?
• HDFS / S3
• Sharing between team members
• Distributed preprocessing with MapReduce/Spark
• How to do experiments?
• Sample from full dataset
• Choose state of the art models, tuning hyper-parameters with cross validation
• Single node with CPUs
• Single node with GPUs
• Train with best parameters on full dataset
• Multi-node with CPUs and GPUs
• Push model into serving
Services on YARN

With Infra Hat

Things to do to support easy-to-use Machine learning platform
What Machine Learning Engineer See
What Infra Learning Engineer See

Important things we need to make ML engineer life easier
• Packaging / Dependency.
• GPU Isolation.
• Easy shared FS (e.g. HDFS) access.
• Job Tracking.
• Easy to deploy.

Do ML experiments on YARN (1)
• Sample / preprocess data from HDFS
/ S3?
• Many YARN apps designed for this!
• No high-end GPUs on your laptop to
do experiments?
• No worries! Pack your dependencies to
a Docker image, and launch on YARN
services.

Do ML experiments on YARN (2)
• What you need:
• Pack your dependencies. (Docker image).
• Specify resources and GPU resources in Yarn service spec.
• Tip: Need HDFS access? Just install Hadoop libraries
• Launch the notebook service on YARN.

From experiments to production? YARN can help!
• Once you have your training code ready.
• Deploy the training task to YARN.
• Run it daily, weekly, etc.
• And here’s the spec file..
submit-tf-job.py –-input-spec <input-spec>
--docker_image <docker-image>

Running Tensorflow on YARN service spec
Wanna distributed Tensorflow?
Just add ”ps”, “master” component.
Let YARN do wire-up and TF_CONFIG
Generation, etc.

After model training? Deploy online model serving on YARN
• Note:
• Uses simple_tensorflow_serving (github.com/tobegit3hub/simple_tensorflow_serving)
• http://serving.serving-job-001.<domain-name>:port to access serving REST end point

Recent works in YARN to support ML workloads like Tensorflow
• GPU isolation/scheduling support
• Native Service - Easy to define and run any custom service
 All above works available in Apache Hadoop 3.1.0

GPU support on YARN (Apache Hadoop 3.1.0)
• Why need isolation?
• Multiple processes use the single GPU will be:
• Serialized.
• Cause OOM easily.
• GPU isolation on YARN: .
• Granularity is for per-GPU device.
• Use Cgroups / docker to enforce the isolation.

Docker + GPU support on YARN (Apache Hadoop 3.1.0)
• Most of machine learning platforms has
python/R/cudnn/CUDA dependencies.
• Docker solves messy dependencies issues
• But it may introduce problems for GPU base
libraries
• Nvidia-docker-plugin mounts Nvidia driver,
etc. when container got launched.
• YARN supports Docker and as well as
nvidia-docker-plugin.
Tensorﬂow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0
Tensorﬂow 1.2
Nginx AppUbuntu 14:04
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0

Finally, make our life easier – sharing
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs

Advanced features
• Kerberoized HDFS access.
• Inside YARN service spec file, add your Kerberos principles:
• DNS support.
• YARN service includes DNS support since Hadoop 3.1.0.
• Required if you wanna use distributed Tensorflow / Tensorboard quick link:
• Web address like http://tensorboard-0.distributed-tf.wtan.hwxgpu.site:6006/
"kerberos_principal" : {
"principal_name" : "test-user@EXAMPLE.COM",
"keytab" : "file:///etc/security/keytabs/test-user.headless.keytab"
}

Questions?

Running Tensorflow In Production: Challenges and Solutions on YARN 3.x

More Related Content

Similar to Running Tensorflow In Production: Challenges and Solutions on YARN 3.x (20)

Recently uploaded (20)

Running Tensorflow In Production: Challenges and Solutions on YARN 3.x

Editor's Notes