SlideShare a Scribd company logo
Running Tensorflow on Apache YARN –
A sneak peak into GPU Scheduling
Sunil Govindan
Apache Hadoop PMC member
YARN Team @ Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Overview to Machine Learning on Big Data Platform
 GPU support in Apache Hadoop YARN
 Tensorflow on YARN – example and demo
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Overview:
Machine Learning on Big Data Platform
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning workflow
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Data Preprocessing
Feature Engineering
Model
Training
Online
Service
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Data Preprocessing
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Import data
– HDFS
– AWS
– RDBMS
 Join data
 Data exploration
 Data sample
 Training/Test random split
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Feature Engineering
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Feature transform/selection
 Feature embedding
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Model Training
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– BigDL
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Model Serving
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Online
Service
 Model deploy
 Model serving
– Batch
– Streaming
 Experiment
– offline
– online (A/B test)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support in Apache Hadoop YARN
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning platform on YARN
CPU GPU SSD
YARN: Data Operating System
(Cluster Resource Management)
Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow
Zeppelin
HDFS AWS S3 RDBMS
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why GPU?
 GPU can speed up following computation-
intensive applications 10x - 300x times
Gene Analysis
Deep learningSelf-Driving Car
Scientific Computation
Without GPU speed up, you will almost
impossible to do these computations. (If job
runs for weeks).
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why GPU?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
Computation Intensive Other
Nvidia Tesla K40:
2880 CUDA cores.
$2200.00 => $0.76 / core
Intel Xeon E5-2697
14 cores
$2295.00 => $163 / core
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why all under YARN
SLA!
Monitoring!A normal YARN user
Quotas!
Isolation!
Capacity Planning, Preemption, Reservation System.
Time line services, Grafana, etc.
CPU / Memory, (WIP) GPU, FPGA, Network
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current status of GPU support on YARN
 Using node label (YARN-796), since Apache Hadoop 2.6.0
– Use node label to partition one big cluster to smaller disjoint clusters, and assign shares/acls to
queues.
– Issues: 1) GPU is not a countable resource in scheduling. 2) No proper isolation for GPU.
 Rest part of GPU support is WIP, umbrella JIRA: YARN-6223
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support: Challenges
 GPU isolation
– Different from memory / cpu, computations affinity to per-GPU-device.
– And multiple processes use the single GPU will be serialized. (MPS is an exception).
– And multiple process share the same GPU cause OOM easily.
• Even though TF provide options to use GPU memory less than whole device provided. But we
cannot enforce this from external.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support: Challenges
 Hierarchy of GPUs matters:
– Topology of GPU really matters: affect communication latency a lot! (Von Neumann bottleneck)
Picture credit to: https://opus.nci.org.au
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support: Challenges
 GPU on Docker: Build once and run
anywhere is not simple:
 For a regular app:
 It can run on Centos 6/7, or any different
hosts as well as CPU arch is same.
 However, GPU application needs driver to
talk to hardware:
Nginx App
Nginx AppUbuntu 14:04
Tensorflow 1.2
Nginx AppUbuntu 14:04
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU Support : Solutions
 GPU isolation:
– With general resource types feature:
• detect & report number of GPUs to YARN scheduler, and scheduler make central decision.
– For normal processes: use cgroups: device submodule. (Same as cpu/memory isolation
mechanism)
– For docker processes: use --device command line before launch docker container.
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU Support : Solutions
 GPU on Docker support
– By using nvidia-docker-plugin.
Tensorflow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How rest of YARN helps GPU support.
 Node partition
– Without node partition, cannot guarantee
best GPU utilizations, let’s look at an example:
– Two hosts in the cluster, only host1 has GPUs.
At the beginning, cluster is empty.
– At time T1, user submit a Spark job, which
need 10G mem, 4 CPUs. Without node
partition, it could be placed to Host1
– If we have another job, which needs 15G
memory, 6 CPUs and 3 GPUs, it won’t possible
to get allocated.
20G
8
4
Mem
CPU
GPU
20G
8
Host1 (GPU)
Host2
10G
4
4
Mem
CPU
GPU
20G
8
Host1 (GPU)
Host2
Task1
?
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How rest of YARN helps GPU support.
 Resource Profiles
– A generalized vector
– Admins can create custom Resource Types!
– Ease of resource requesting model using
profiles
NodeManager
Memory
CPU
GPU
NodeManager
Memory
CPU
GPU
ResourceManager
Small
Medium
Large
Profile Memory CPU GPU
Small 2 GB 4 Cores 1 Cores
Medium 4 GB 8 Cores 1 Cores
Large 16 GB 16 Cores 4 Cores
Application Master
Small
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current development status (YARN-6223)
 Apache Hadoop 3.1.0 release (Jan 15, 2018)
– GPU auto detection (Merged)
– GPU scheduling in RM (Merged)
– GPU isolation using Cgroups. (Merged)
– GPU on docker isolation & volume. (Merged)
– UI / Metrics (Merged).
– Documentation (Open)
– Ambari changes (Open)
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TensorFlow on Apache Hadoop YARN
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Makes everything easier!
 Forget about writing an application master, this is how you can run app on YARN ..
 Write assembly spec in JSON (we call it Yarnfile)
 Post the JSON as REST request to YARN server.
 YARN to figure out rest of it.
 An example:
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo….
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?

More Related Content

PPTX
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
PPTX
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PPTX
What's new in hadoop 3.0
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
PPTX
Hadoop 3.0 features
PPTX
Evolving HDFS to Generalized Storage Subsystem
PDF
Difference between hadoop 2 vs hadoop 3
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
Hadoop 3 @ Hadoop Summit San Jose 2017
What's new in hadoop 3.0
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Hadoop 3.0 features
Evolving HDFS to Generalized Storage Subsystem
Difference between hadoop 2 vs hadoop 3

What's hot (20)

PPTX
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
PPTX
Rds data lake @ Robinhood
PPTX
Achieving 100k Queries per Hour on Hive on Tez
PDF
Kudu - Fast Analytics on Fast Data
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PPT
1.0 vs2.0
PDF
Hadoop 3.0 - Revolution or evolution?
PPTX
Hadoop Meetup Jan 2019 - Overview of Ozone
PPTX
Mapreduce over snapshots
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
PPTX
Apache HBase: State of the Union
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PDF
Set Up & Operate Real-Time Data Loading into Hadoop
PDF
Bn 1016 demo postgre sql-online-training
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Time-Series Apache HBase
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
TeraCache: Efficient Caching Over Fast Storage Devices
Rds data lake @ Robinhood
Achieving 100k Queries per Hour on Hive on Tez
Kudu - Fast Analytics on Fast Data
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
1.0 vs2.0
Hadoop 3.0 - Revolution or evolution?
Hadoop Meetup Jan 2019 - Overview of Ozone
Mapreduce over snapshots
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Apache HBase: State of the Union
HDFS Tiered Storage: Mounting Object Stores in HDFS
Set Up & Operate Real-Time Data Loading into Hadoop
Bn 1016 demo postgre sql-online-training
HBase Tales From the Trenches - Short stories about most common HBase operati...
Time-Series Apache HBase
LLAP: Sub-Second Analytical Queries in Hive
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Ad

Similar to [Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan (20)

PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
PPTX
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
PPTX
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
PDF
Infrastructure and Tooling - Full Stack Deep Learning
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PPTX
Kindratenko hpc day 2011 Kiev
PDF
Azinta Gpu Cloud Services London Financial Python Ug 1.2
PPTX
Introducing Apache Geode and Spring Data GemFire
PDF
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
PPT
Running Spark in Production
PDF
08 Supercomputer Fugaku
PPTX
Deep Learning with Spark and GPUs
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
PDF
GPGPU Accelerates PostgreSQL (English)
PDF
LCU13: GPGPU on ARM Experience Report
PPTX
Stream Processing
PPTX
Light-weighted HDFS disaster recovery
PPTX
Apache Hadoop 3.0 Community Update
PDF
IBM: The Linux Ecosystem
PPT
NWU and HPC
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Infrastructure and Tooling - Full Stack Deep Learning
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Kindratenko hpc day 2011 Kiev
Azinta Gpu Cloud Services London Financial Python Ug 1.2
Introducing Apache Geode and Spring Data GemFire
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Running Spark in Production
08 Supercomputer Fugaku
Deep Learning with Spark and GPUs
Deep Learning with Apache Spark and GPUs with Pierce Spitler
GPGPU Accelerates PostgreSQL (English)
LCU13: GPGPU on ARM Experience Report
Stream Processing
Light-weighted HDFS disaster recovery
Apache Hadoop 3.0 Community Update
IBM: The Linux Ecosystem
NWU and HPC
Ad

Recently uploaded (20)

PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
System and Network Administraation Chapter 3
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
AI in Product Development-omnex systems
PDF
Digital Strategies for Manufacturing Companies
PPTX
Transform Your Business with a Software ERP System
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
System and Network Administration Chapter 2
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Introduction to Artificial Intelligence
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
System and Network Administraation Chapter 3
VVF-Customer-Presentation2025-Ver1.9.pptx
L1 - Introduction to python Backend.pptx
Softaken Excel to vCard Converter Software.pdf
AI in Product Development-omnex systems
Digital Strategies for Manufacturing Companies
Transform Your Business with a Software ERP System
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
Design an Analysis of Algorithms I-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Reimagine Home Health with the Power of Agentic AI​
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Operating system designcfffgfgggggggvggggggggg
2025 Textile ERP Trends: SAP, Odoo & Oracle
System and Network Administration Chapter 2
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction to Artificial Intelligence

[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

  • 1. Running Tensorflow on Apache YARN – A sneak peak into GPU Scheduling Sunil Govindan Apache Hadoop PMC member YARN Team @ Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Overview to Machine Learning on Big Data Platform  GPU support in Apache Hadoop YARN  Tensorflow on YARN – example and demo
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview: Machine Learning on Big Data Platform
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning workflow Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Model Training Feature Model Evaluation Model Validation Model Staging Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Data Preprocessing Feature Engineering Model Training Online Service
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Data Preprocessing Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Import data – HDFS – AWS – RDBMS  Join data  Data exploration  Data sample  Training/Test random split
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Feature Engineering Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Feature transform/selection  Feature embedding
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Model Training Model Training Feature Model Evaluation Model Validation Model Staging Model Training  Traditional machine learning models – Logistic Regression – Gradient boosting tree – Recommendation/ALS – LDA  Libraries – Apache Spark MLlib – XGBoost  Deep learning models – DNN – CNN – RNN – LSTM  Libraries – TensorFlow – Apache MXNet – BigDL
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Model Serving Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Online Service  Model deploy  Model serving – Batch – Streaming  Experiment – offline – online (A/B test)
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support in Apache Hadoop YARN
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning platform on YARN CPU GPU SSD YARN: Data Operating System (Cluster Resource Management) Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow Zeppelin HDFS AWS S3 RDBMS
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why GPU?  GPU can speed up following computation- intensive applications 10x - 300x times Gene Analysis Deep learningSelf-Driving Car Scientific Computation Without GPU speed up, you will almost impossible to do these computations. (If job runs for weeks).
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why GPU?  GPU: Many cores to handle massive (but simple) computation tasks simultaneously: GPU CPU Computation Intensive Other Nvidia Tesla K40: 2880 CUDA cores. $2200.00 => $0.76 / core Intel Xeon E5-2697 14 cores $2295.00 => $163 / core
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why all under YARN SLA! Monitoring!A normal YARN user Quotas! Isolation! Capacity Planning, Preemption, Reservation System. Time line services, Grafana, etc. CPU / Memory, (WIP) GPU, FPGA, Network
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved All running on the same YARN platform LLAP 128 G 128 G 128 G 128 G 128 G LLAP LLAP 128 G 128 G GPUs
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current status of GPU support on YARN  Using node label (YARN-796), since Apache Hadoop 2.6.0 – Use node label to partition one big cluster to smaller disjoint clusters, and assign shares/acls to queues. – Issues: 1) GPU is not a countable resource in scheduling. 2) No proper isolation for GPU.  Rest part of GPU support is WIP, umbrella JIRA: YARN-6223
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support: Challenges  GPU isolation – Different from memory / cpu, computations affinity to per-GPU-device. – And multiple processes use the single GPU will be serialized. (MPS is an exception). – And multiple process share the same GPU cause OOM easily. • Even though TF provide options to use GPU memory less than whole device provided. But we cannot enforce this from external.
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support: Challenges  Hierarchy of GPUs matters: – Topology of GPU really matters: affect communication latency a lot! (Von Neumann bottleneck) Picture credit to: https://opus.nci.org.au
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support: Challenges  GPU on Docker: Build once and run anywhere is not simple:  For a regular app:  It can run on Centos 6/7, or any different hosts as well as CPU arch is same.  However, GPU application needs driver to talk to hardware: Nginx App Nginx AppUbuntu 14:04 Tensorflow 1.2 Nginx AppUbuntu 14:04 GPU Base Lib v2 Nginx AppHost OS GPU Base Lib v1 X Fails CUDA Library 5.0
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU Support : Solutions  GPU isolation: – With general resource types feature: • detect & report number of GPUs to YARN scheduler, and scheduler make central decision. – For normal processes: use cgroups: device submodule. (Same as cpu/memory isolation mechanism) – For docker processes: use --device command line before launch docker container.
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU Support : Solutions  GPU on Docker support – By using nvidia-docker-plugin. Tensorflow 1.2 Nginx AppUbuntu 14:04 Nginx AppHost OS GPU Base Lib v1 Volume Mount CUDA Library 5.0
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How rest of YARN helps GPU support.  Node partition – Without node partition, cannot guarantee best GPU utilizations, let’s look at an example: – Two hosts in the cluster, only host1 has GPUs. At the beginning, cluster is empty. – At time T1, user submit a Spark job, which need 10G mem, 4 CPUs. Without node partition, it could be placed to Host1 – If we have another job, which needs 15G memory, 6 CPUs and 3 GPUs, it won’t possible to get allocated. 20G 8 4 Mem CPU GPU 20G 8 Host1 (GPU) Host2 10G 4 4 Mem CPU GPU 20G 8 Host1 (GPU) Host2 Task1 ?
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How rest of YARN helps GPU support.  Resource Profiles – A generalized vector – Admins can create custom Resource Types! – Ease of resource requesting model using profiles NodeManager Memory CPU GPU NodeManager Memory CPU GPU ResourceManager Small Medium Large Profile Memory CPU GPU Small 2 GB 4 Cores 1 Cores Medium 4 GB 8 Cores 1 Cores Large 16 GB 16 Cores 4 Cores Application Master Small
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current development status (YARN-6223)  Apache Hadoop 3.1.0 release (Jan 15, 2018) – GPU auto detection (Merged) – GPU scheduling in RM (Merged) – GPU isolation using Cgroups. (Merged) – GPU on docker isolation & volume. (Merged) – UI / Metrics (Merged). – Documentation (Open) – Ambari changes (Open)
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TensorFlow on Apache Hadoop YARN
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Makes everything easier!  Forget about writing an application master, this is how you can run app on YARN ..  Write assembly spec in JSON (we call it Yarnfile)  Post the JSON as REST request to YARN server.  YARN to figure out rest of it.  An example:
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo….
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?

Editor's Notes

  • #5: This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles. We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service. The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
  • #6: Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system. After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload. And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced. After that, we random split the dataset for training and test under the help of Spark. Once we get training data, we can start feature engineering.
  • #7: Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning. In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers. DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
  • #8: Model training is the most important step of the whole pipeline.
  • #9: Deploy the model distributed for parallel model serving on batch mode or streaming mode. Evaluate the model offline or online by different metrics.