SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Running Tensorflow in Production: Challenges and
Solutions on YARN 3.x
Yanbo Liang
Wangda Tan
2 © Hortonworks Inc. 2011–2018. All rights reserved
Intro
• Yanbo Liang
• Staff software engineer at Hortonworks
• Apache Spark PMC member and committer
• TensorFlow and XGBoost contributor
• Work on intersection of System and Algorithm
for Machine Learning and Deep Learning
• Wangda Tan
• Engineering Manager @ Hortonworks.
• Apache Hadoop PMC member and
committer
• Work on scheduler / deep learning on
YARN / GPUs on YARN, etc.
3 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• With Machine Learning Hat, what required?
• With Infra hat, what we can help?
4 © Hortonworks Inc. 2011–2018. All rights reserved
With Machine Learning Hat
5 © Hortonworks Inc. 2011–2018. All rights reserved
What are the problems?
• Ads CTR Prediction
• Recommendation
• Fraud Detection
• Natural Language Processing
• Objective Detection
• Automatic Speech Recognition
• Image & Video Recognition
• Personal Assistant
• Financial Forecasting
6 © Hortonworks Inc. 2011–2018. All rights reserved
Who we are?
• Machine Learning Engineer or Data Scientist
• What we are familiar with?
• Linear algebra, statistics, machine learning algorithms and models, deep neural
networks(DNN/CNN/RNN), basic programming skill, etc.
• What we are not familiar with?
• System environment and programming, resource management and scheduling, networking and
storage, etc.
7 © Hortonworks Inc. 2011–2018. All rights reserved
What we use
• Liblinear
• LibFM
• Scikit-learn
• XGBoost/LightGBM
• Spark MLlib
• TensorFlow/PyTorch/MXNet
8 © Hortonworks Inc. 2011–2018. All rights reserved
How we do?
• Where is the training and test dataset?
• HDFS / S3
• Sharing between team members
• Distributed preprocessing with MapReduce/Spark
• How to do experiments?
• Sample from full dataset
• Choose state of the art models, tuning hyper-parameters with cross validation
• Single node with CPUs
• Single node with GPUs
• Train with best parameters on full dataset
• Multi-node with CPUs and GPUs
• Push model into serving
Services on YARN
9 © Hortonworks Inc. 2011–2018. All rights reserved
With Infra Hat
10 © Hortonworks Inc. 2011–2018. All rights reserved
Things to do to support easy-to-use Machine learning platform
What Machine Learning Engineer See
What Infra Learning Engineer See
11 © Hortonworks Inc. 2011–2018. All rights reserved
Important things we need to make ML engineer life easier
• Packaging / Dependency.
• GPU Isolation.
• Easy shared FS (e.g. HDFS) access.
• Job Tracking.
• Easy to deploy.
12 © Hortonworks Inc. 2011–2018. All rights reserved
Do ML experiments on YARN (1)
• Sample / preprocess data from HDFS
/ S3?
• Many YARN apps designed for this!
• No high-end GPUs on your laptop to
do experiments?
• No worries! Pack your dependencies to
a Docker image, and launch on YARN
services.
13 © Hortonworks Inc. 2011–2018. All rights reserved
Do ML experiments on YARN (2)
• What you need:
• Pack your dependencies. (Docker image).
• Specify resources and GPU resources in Yarn service spec.
• Tip: Need HDFS access? Just install Hadoop libraries
• Launch the notebook service on YARN.
14 © Hortonworks Inc. 2011–2018. All rights reserved
From experiments to production? YARN can help!
• Once you have your training code ready.
• Deploy the training task to YARN.
• Run it daily, weekly, etc.
• And here’s the spec file..
submit-tf-job.py –-input-spec <input-spec>
--docker_image <docker-image>
15 © Hortonworks Inc. 2011–2018. All rights reserved
Running Tensorflow on YARN service spec
Wanna distributed Tensorflow?
Just add ”ps”, “master” component.
Let YARN do wire-up and TF_CONFIG
Generation, etc.
16 © Hortonworks Inc. 2011–2018. All rights reserved
After model training? Deploy online model serving on YARN
• Note:
• Uses simple_tensorflow_serving (github.com/tobegit3hub/simple_tensorflow_serving)
• http://serving.serving-job-001.<domain-name>:port to access serving REST end point
17 © Hortonworks Inc. 2011–2018. All rights reserved
Recent works in YARN to support ML workloads like Tensorflow
• GPU isolation/scheduling support
• Native Service - Easy to define and run any custom service
 All above works available in Apache Hadoop 3.1.0
18 © Hortonworks Inc. 2011–2018. All rights reserved
GPU support on YARN (Apache Hadoop 3.1.0)
• Why need isolation?
• Multiple processes use the single GPU will be:
• Serialized.
• Cause OOM easily.
• GPU isolation on YARN: .
• Granularity is for per-GPU device.
• Use Cgroups / docker to enforce the isolation.
19 © Hortonworks Inc. 2011–2018. All rights reserved
Docker + GPU support on YARN (Apache Hadoop 3.1.0)
• Most of machine learning platforms has
python/R/cudnn/CUDA dependencies.
• Docker solves messy dependencies issues
• But it may introduce problems for GPU base
libraries
• Nvidia-docker-plugin mounts Nvidia driver,
etc. when container got launched.
• YARN supports Docker and as well as
nvidia-docker-plugin.
Tensorflow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0
Tensorflow 1.2
Nginx AppUbuntu 14:04
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0
20 © Hortonworks Inc. 2011–2018. All rights reserved
Finally, make our life easier – sharing
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs
21 © Hortonworks Inc. 2011–2018. All rights reserved
Advanced features
• Kerberoized HDFS access.
• Inside YARN service spec file, add your Kerberos principles:
• DNS support.
• YARN service includes DNS support since Hadoop 3.1.0.
• Required if you wanna use distributed Tensorflow / Tensorboard quick link:
• Web address like http://tensorboard-0.distributed-tf.wtan.hwxgpu.site:6006/
"kerberos_principal" : {
"principal_name" : "test-user@EXAMPLE.COM",
"keytab" : "file:///etc/security/keytabs/test-user.headless.keytab"
}
22 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?

More Related Content

PDF
Ai pipelines powered by jupyter notebooks
PPTX
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PPTX
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
PDF
Scaling Deep Learning on Hadoop at LinkedIn
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
PPTX
Apache Hadoop 3 updates with migration story
Ai pipelines powered by jupyter notebooks
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
Apache Hadoop 3 updates with migration story

Similar to Running Tensorflow In Production: Challenges and Solutions on YARN 3.x (20)

PDF
Apache Hadoop YARN: State of the Union
PDF
Apache Hadoop YARN: state of the union - Tokyo
PDF
Apache Hadoop YARN: state of the union
PPTX
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
PPTX
Apache Hadoop YARN: Present and Future
PDF
Deep learning 101
PDF
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
PPTX
A Multi Colored YARN
PPTX
Introduction to the Hortonworks YARN Ready Program
PDF
Deep learning on HDP 2018 Prague
PDF
Tensorflow 2.0 and Coral Edge TPU
PPTX
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
PPTX
Apache Hadoop YARN: state of the union
PPTX
Apache Hadoop YARN: state of the union
PPTX
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
PPTX
Machine learning in the wild deployment
PPTX
MHUG - YARN
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
PPTX
Ml goes fruitful
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Apache Hadoop YARN: Present and Future
Deep learning 101
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
A Multi Colored YARN
Introduction to the Hortonworks YARN Ready Program
Deep learning on HDP 2018 Prague
Tensorflow 2.0 and Coral Edge TPU
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Machine learning in the wild deployment
MHUG - YARN
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
Ml goes fruitful
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Ad

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
introduction to datamining and warehousing
PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PPTX
Current and future trends in Computer Vision.pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPT
Total quality management ppt for engineering students
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
PDF
PPT on Performance Review to get promotions
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Soil Improvement Techniques Note - Rabbi
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
introduction to datamining and warehousing
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
Current and future trends in Computer Vision.pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Total quality management ppt for engineering students
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
III.4.1.2_The_Space_Environment.p pdffdf
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Fundamentals of Mechanical Engineering.pptx
86236642-Electric-Loco-Shed.pdf jfkduklg
PPT on Performance Review to get promotions
Exploratory_Data_Analysis_Fundamentals.pdf
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Ad

Running Tensorflow In Production: Challenges and Solutions on YARN 3.x

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Running Tensorflow in Production: Challenges and Solutions on YARN 3.x Yanbo Liang Wangda Tan
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Intro • Yanbo Liang • Staff software engineer at Hortonworks • Apache Spark PMC member and committer • TensorFlow and XGBoost contributor • Work on intersection of System and Algorithm for Machine Learning and Deep Learning • Wangda Tan • Engineering Manager @ Hortonworks. • Apache Hadoop PMC member and committer • Work on scheduler / deep learning on YARN / GPUs on YARN, etc.
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Agenda • With Machine Learning Hat, what required? • With Infra hat, what we can help?
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved With Machine Learning Hat
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved What are the problems? • Ads CTR Prediction • Recommendation • Fraud Detection • Natural Language Processing • Objective Detection • Automatic Speech Recognition • Image & Video Recognition • Personal Assistant • Financial Forecasting
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Who we are? • Machine Learning Engineer or Data Scientist • What we are familiar with? • Linear algebra, statistics, machine learning algorithms and models, deep neural networks(DNN/CNN/RNN), basic programming skill, etc. • What we are not familiar with? • System environment and programming, resource management and scheduling, networking and storage, etc.
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved What we use • Liblinear • LibFM • Scikit-learn • XGBoost/LightGBM • Spark MLlib • TensorFlow/PyTorch/MXNet
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved How we do? • Where is the training and test dataset? • HDFS / S3 • Sharing between team members • Distributed preprocessing with MapReduce/Spark • How to do experiments? • Sample from full dataset • Choose state of the art models, tuning hyper-parameters with cross validation • Single node with CPUs • Single node with GPUs • Train with best parameters on full dataset • Multi-node with CPUs and GPUs • Push model into serving Services on YARN
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved With Infra Hat
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Things to do to support easy-to-use Machine learning platform What Machine Learning Engineer See What Infra Learning Engineer See
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Important things we need to make ML engineer life easier • Packaging / Dependency. • GPU Isolation. • Easy shared FS (e.g. HDFS) access. • Job Tracking. • Easy to deploy.
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Do ML experiments on YARN (1) • Sample / preprocess data from HDFS / S3? • Many YARN apps designed for this! • No high-end GPUs on your laptop to do experiments? • No worries! Pack your dependencies to a Docker image, and launch on YARN services.
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Do ML experiments on YARN (2) • What you need: • Pack your dependencies. (Docker image). • Specify resources and GPU resources in Yarn service spec. • Tip: Need HDFS access? Just install Hadoop libraries • Launch the notebook service on YARN.
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved From experiments to production? YARN can help! • Once you have your training code ready. • Deploy the training task to YARN. • Run it daily, weekly, etc. • And here’s the spec file.. submit-tf-job.py –-input-spec <input-spec> --docker_image <docker-image>
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Running Tensorflow on YARN service spec Wanna distributed Tensorflow? Just add ”ps”, “master” component. Let YARN do wire-up and TF_CONFIG Generation, etc.
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved After model training? Deploy online model serving on YARN • Note: • Uses simple_tensorflow_serving (github.com/tobegit3hub/simple_tensorflow_serving) • http://serving.serving-job-001.<domain-name>:port to access serving REST end point
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Recent works in YARN to support ML workloads like Tensorflow • GPU isolation/scheduling support • Native Service - Easy to define and run any custom service  All above works available in Apache Hadoop 3.1.0
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved GPU support on YARN (Apache Hadoop 3.1.0) • Why need isolation? • Multiple processes use the single GPU will be: • Serialized. • Cause OOM easily. • GPU isolation on YARN: . • Granularity is for per-GPU device. • Use Cgroups / docker to enforce the isolation.
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Docker + GPU support on YARN (Apache Hadoop 3.1.0) • Most of machine learning platforms has python/R/cudnn/CUDA dependencies. • Docker solves messy dependencies issues • But it may introduce problems for GPU base libraries • Nvidia-docker-plugin mounts Nvidia driver, etc. when container got launched. • YARN supports Docker and as well as nvidia-docker-plugin. Tensorflow 1.2 Nginx AppUbuntu 14:04 Nginx AppHost OS GPU Base Lib v1 Volume Mount CUDA Library 5.0 Tensorflow 1.2 Nginx AppUbuntu 14:04 GPU Base Lib v2 Nginx AppHost OS GPU Base Lib v1 X Fails CUDA Library 5.0
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Finally, make our life easier – sharing LLAP 128 G 128 G 128 G 128 G 128 G LLAP LLAP 128 G 128 G GPUs
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Advanced features • Kerberoized HDFS access. • Inside YARN service spec file, add your Kerberos principles: • DNS support. • YARN service includes DNS support since Hadoop 3.1.0. • Required if you wanna use distributed Tensorflow / Tensorboard quick link: • Web address like http://tensorboard-0.distributed-tf.wtan.hwxgpu.site:6006/ "kerberos_principal" : { "principal_name" : "test-user@EXAMPLE.COM", "keytab" : "file:///etc/security/keytabs/test-user.headless.keytab" }
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Questions?

Editor's Notes

  • #19: Even though TF provide options to use GPU memory less than whole device provided. But we cannot enforce this from external.
  • #20: Even though TF provide options to use GPU memory less than whole device provided. But we cannot enforce this from external.