SlideShare a Scribd company logo
Helge Gose, NVIDIA Solution Architect, June 7, 2018
HPE AND NVIDIA
EMPOWERING AI AND IOT
2
AGENDA
What is Deep Learning?
Volta and NVLINK
Inference to Training – HPE solutions
IoT Use Case
3
AI AND DEEP LEARNING
NIPS (2012)
ImageNet Classification with Deep
Convolutional Neural Networks
Alex Krizhevsky
University of Toronto
Ilya Sutskever
University of Toronto
Geoffrey e. Hinton
University of Toronto
Deep Learning
HPE and NVIDIA empowering AI and IoT
5
DEEP LEARNING
IS SWEEPING ACROSS INDUSTRIES
INTERNET SERVICES
MEDICINE MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Cancer cell detection
Diabetic grading
Drug discovery
Pedestrian detection
Lane tracking
Recognize traffic signs
Face recognition
Video surveillance
Cyber security
Video captioning
Content based search
Real time translation
Image/Video classification
Speech recognition
Natural language processing
INTERNET SERVICES
DEEP LEARNING APPLICATION DEVELOPMENT
DEEP LEARNING APPLICATION DEVELOPMENT
Untrained
Neural Network
Model
DEEP LEARNING APPLICATION DEVELOPMENT
Untrained
Neural Network
Model
Deep Learning
Framework
TRAINING
Learning a new capability
from existing data
DEEP LEARNING APPLICATION DEVELOPMENT
Untrained
Neural Network
Model
Deep Learning
Framework
TRAINING
Learning a new capability
from existing data
Trained Model
New Capability
DEEP LEARNING APPLICATION DEVELOPMENT
Untrained
Neural Network
Model
Deep Learning
Framework
TRAINING
Learning a new capability
from existing data
Trained Model
New Capability
Trained Model
Optimized for
Performance
DEEP LEARNING APPLICATION DEVELOPMENT
Untrained
Neural Network
Model
Deep Learning
Framework
TRAINING
Learning a new capability
from existing data
Trained Model
New Capability
App or Service
Featuring Capability
INFERENCE
Applying this capability
to new data
Trained Model
Optimized for
Performance
12
VOLTA AND NVLINK
13
TESLA V100
WORLD’S MOST ADVANCED DATA CENTER GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
16GB/ 32GB HBM2 @ 900GB/s | 300GB/s NVLink
14
NVLINK FABRIC
V100: 6 NVLINKS @ 50 GB/s bidirectional
Total bandwidth of 300 GB/sec
Maximizing system throughput
Advancing Multi-GPU Processing
15
REVOLUTIONARY AI PERFORMANCE
3X Faster DL Training Performance
3X Reduction in Time to Train Over P100
0 10 20
1X
V100
1X
P100
2X
CPU
Relative Time to Train Improvements
(LSTM)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x
Xeon E5 2699 V4
15 Days
18 Hours
6 Hours
Over 80X DL Training
Performance in 3 Years
1x K80
cuDNN2
4x M40
cuDNN3
8x P100
cuDNN6
8x V100
cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Exponential Performance over time
(GoogleNet)
SpeedupvsK80
GoogleNet Training Performance on versions of cuDNN
Vs 1x K80 cuDNN2
16
END-TO-END PRODUCT FAMILY
TRAINING INFERENCE
Jetson
Drive PX
Apollo 6500 | V100
DATA CENTER
TITAN V
TESLA V100
DESKTOP
DGX Station
DATA CENTER
TESLA V100
TESLA P4
EMBEDDED AUTOMOTIVE
DriveWorks SDKJETPACK SDK
17
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK Accelerates Every Major Framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl
NVIDIA DEEP LEARNING SDK
developer.nvidia.com/deep-learning-software
18
TRAINING TO INFERENCE
- HPE SOLUTIONS
19
HPE COMPREHENSIVE, PURPOSE-BUILT
PORTFOLIO FOR DEEP LEARNING
Compute ideal for training models in data center
Edge analytics and
inference engine
Compute for both training models
and inference at edge
HPE Apollo 6500
AI / HPC Storage Choice of Fabrics
HPE SGI 8600
Government,
academia and
industries
Financial
services
Life Sciences,
Health
Government
and academia
Autonomous
vehicles / Mfg.
AI Software Framework
– Intel® Omni-Path
Architecture
– Mellanox
InfiniBand
– HPE FlexFabric
Network
Petaflop scale for deep
learning and HPC
Enterprise platform for
accelerated computing
HPE Apollo 2000
The bridge to enterprise
scale-out architecture
HPE Edgeline EL4000
Unprecedented deep edge compute and
high capacity storage; open standards
HPE Apollo sx40
Maximize GPU capacity and
performance with lower TCO
Advisory and Transformation Services | GreenLake Flexible Capacity | Datacenter Care Services
HPE Data Management
Framework Software
Storage elasticity
through tiered
data
management
20
ACCELERATED PERFORMANCE
FOR MACHINE LEARNING
deep learning platform with industry leading GPUs and interconnects
Enterprise AI
Dependable system performance
− Power and cooling headroom for
top bin accelerators
− Robust signal integrity provides
higher bandwidth, lower latency
HPE’s highest GPUs per
server
− Up to 125 Tflops1 single precision
performance
− Eight GPUs per server
Powerful host
− 2x the high speed fabric adapters2
− NVMe drives
− High speed DDR4 SmartMemory
Leading accelerator
technology
− NVLink 2.0 enables dedicated
GPU-to-GPU communication
1 NVIDIA Tesla V100 provides 15.7 TFlops single precision performance with NVLink 2.0 per GPU x 8 GPUs = 125 Tflops http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf
2 Apollo 6500 Gen10 with 4 OPA or IB connectors vs Apollo 6500 Gen9 with 2 connectors
21
IOT USE CASE
High Performing Edge Compute Brings Action to the Edge
22
Shift to the Edge for Control!
23
HPE enables both Edge AI “Inference” & Core “Training”
24
FOR MORE
INFO
24
NVIDIA:
https://www.nvidia.de/data-center/tesla/
HPE:
https://www.hpe.com/us/en/solutions/hpc-
high-performance-computing.html
HPE and NVIDIA empowering AI and IoT

More Related Content

PPTX
Dell and NVIDIA for Your AI workloads in the Data Center
PPTX
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
PPTX
Building the World's Largest GPU
PPTX
Simplifying AI Infrastructure: Lessons in Scaling on DGX Systems
PDF
AI + E-commerce
PPTX
Accelerate AI w/ Synthetic Data using GANs
PDF
Enabling Artificial Intelligence - Alison B. Lowndes
PPTX
Nvidia Deep Learning Solutions - Alex Sabatier
Dell and NVIDIA for Your AI workloads in the Data Center
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Building the World's Largest GPU
Simplifying AI Infrastructure: Lessons in Scaling on DGX Systems
AI + E-commerce
Accelerate AI w/ Synthetic Data using GANs
Enabling Artificial Intelligence - Alison B. Lowndes
Nvidia Deep Learning Solutions - Alex Sabatier

What's hot (20)

PDF
EPSRC CDT Conference
PDF
GTC Taiwan 2017 主題演說
PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
PDF
Fuelling the AI Revolution with Gaming
PDF
NVIDIA Keynote #GTC21
PDF
Innovation Roundtable
PDF
Talk on using AI to address some of humanities problems
PDF
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
PDF
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
PDF
Accelerated Computing: The Path Forward
PDF
OpenPOWER Foundation Overview
PDF
End-to-End Big Data AI with Analytics Zoo
PDF
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
PDF
Tesla Accelerated Computing Platform
PDF
Talk on commercialising space data
PPTX
OpenACC Monthly Highlights - September
PDF
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
PDF
Nvidia SC16: The Greatest Challenges Can't Wait
PPTX
OpenACC Monthly Highlights- December
PDF
Artificial intelligence on the Edge
EPSRC CDT Conference
GTC Taiwan 2017 主題演說
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
Fuelling the AI Revolution with Gaming
NVIDIA Keynote #GTC21
Innovation Roundtable
Talk on using AI to address some of humanities problems
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
Accelerated Computing: The Path Forward
OpenPOWER Foundation Overview
End-to-End Big Data AI with Analytics Zoo
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
Tesla Accelerated Computing Platform
Talk on commercialising space data
OpenACC Monthly Highlights - September
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
Nvidia SC16: The Greatest Challenges Can't Wait
OpenACC Monthly Highlights- December
Artificial intelligence on the Edge
Ad

Similar to HPE and NVIDIA empowering AI and IoT (20)

PDF
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
PDF
The Convergence of HPC and Deep Learning
PDF
Deep Learning Update May 2016
PDF
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
PDF
GTC 2016 Opening Keynote
PDF
GTC 2017: Powering the AI Revolution
PDF
NVIDIA Deep Learning Institute 2017 基調講演
PDF
GTC Taiwan 2017 企業端深度學習與人工智慧應用
PDF
Introduction to Deep Learning (NVIDIA)
PPTX
abelbrownnvidiarakuten2016-170208065814 (1).pptx
PDF
Fueling the AI Revolution with Gaming
PDF
Alison B Lowndes - Fueling the Artificial Intelligence Revolution with Gaming...
PDF
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
PDF
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
PDF
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
PDF
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
PDF
Vertex Perspectives | AI-optimized Chipsets | Part I
PDF
Vertex perspectives ai optimized chipsets (part i)
PDF
AI talk at CogX 2018
PDF
AI, A New Computing Model
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
The Convergence of HPC and Deep Learning
Deep Learning Update May 2016
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
GTC 2016 Opening Keynote
GTC 2017: Powering the AI Revolution
NVIDIA Deep Learning Institute 2017 基調講演
GTC Taiwan 2017 企業端深度學習與人工智慧應用
Introduction to Deep Learning (NVIDIA)
abelbrownnvidiarakuten2016-170208065814 (1).pptx
Fueling the AI Revolution with Gaming
Alison B Lowndes - Fueling the Artificial Intelligence Revolution with Gaming...
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
Vertex Perspectives | AI-optimized Chipsets | Part I
Vertex perspectives ai optimized chipsets (part i)
AI talk at CogX 2018
AI, A New Computing Model
Ad

More from Renee Yao (15)

PDF
Medical Imaging AI Startups _RSNA 2021
PPTX
Women L.E.A.D. Toastmasters Appreciation event
PPTX
Toastmasters Evaluation Contest Workshop
PPTX
Presentation tips for non native speakers
PPTX
How to be an effective mentor
PPTX
Why Toastmasters and How it Helps Your Daily Job
PPTX
How to get the most out of a mentorship
PDF
AI in Healthcare | Future of Smart Hospitals
PPTX
How to Evaluate Effectively
PPTX
Startups Step Up - how healthcare ai startups are taking action during covid-...
PPTX
Code for good
PPTX
NetApp Insights 2018 Post Show
PPTX
Public speaking journey
PDF
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
PDF
Renee Yao
Medical Imaging AI Startups _RSNA 2021
Women L.E.A.D. Toastmasters Appreciation event
Toastmasters Evaluation Contest Workshop
Presentation tips for non native speakers
How to be an effective mentor
Why Toastmasters and How it Helps Your Daily Job
How to get the most out of a mentorship
AI in Healthcare | Future of Smart Hospitals
How to Evaluate Effectively
Startups Step Up - how healthcare ai startups are taking action during covid-...
Code for good
NetApp Insights 2018 Post Show
Public speaking journey
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
Renee Yao

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Big Data Technologies - Introduction.pptx
NewMind AI Monthly Chronicles - July 2025
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

HPE and NVIDIA empowering AI and IoT

  • 1. Helge Gose, NVIDIA Solution Architect, June 7, 2018 HPE AND NVIDIA EMPOWERING AI AND IOT
  • 2. 2 AGENDA What is Deep Learning? Volta and NVLINK Inference to Training – HPE solutions IoT Use Case
  • 3. 3 AI AND DEEP LEARNING NIPS (2012) ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky University of Toronto Ilya Sutskever University of Toronto Geoffrey e. Hinton University of Toronto Deep Learning
  • 5. 5 DEEP LEARNING IS SWEEPING ACROSS INDUSTRIES INTERNET SERVICES MEDICINE MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES Cancer cell detection Diabetic grading Drug discovery Pedestrian detection Lane tracking Recognize traffic signs Face recognition Video surveillance Cyber security Video captioning Content based search Real time translation Image/Video classification Speech recognition Natural language processing INTERNET SERVICES
  • 7. DEEP LEARNING APPLICATION DEVELOPMENT Untrained Neural Network Model
  • 8. DEEP LEARNING APPLICATION DEVELOPMENT Untrained Neural Network Model Deep Learning Framework TRAINING Learning a new capability from existing data
  • 9. DEEP LEARNING APPLICATION DEVELOPMENT Untrained Neural Network Model Deep Learning Framework TRAINING Learning a new capability from existing data Trained Model New Capability
  • 10. DEEP LEARNING APPLICATION DEVELOPMENT Untrained Neural Network Model Deep Learning Framework TRAINING Learning a new capability from existing data Trained Model New Capability Trained Model Optimized for Performance
  • 11. DEEP LEARNING APPLICATION DEVELOPMENT Untrained Neural Network Model Deep Learning Framework TRAINING Learning a new capability from existing data Trained Model New Capability App or Service Featuring Capability INFERENCE Applying this capability to new data Trained Model Optimized for Performance
  • 13. 13 TESLA V100 WORLD’S MOST ADVANCED DATA CENTER GPU 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 16GB/ 32GB HBM2 @ 900GB/s | 300GB/s NVLink
  • 14. 14 NVLINK FABRIC V100: 6 NVLINKS @ 50 GB/s bidirectional Total bandwidth of 300 GB/sec Maximizing system throughput Advancing Multi-GPU Processing
  • 15. 15 REVOLUTIONARY AI PERFORMANCE 3X Faster DL Training Performance 3X Reduction in Time to Train Over P100 0 10 20 1X V100 1X P100 2X CPU Relative Time to Train Improvements (LSTM) Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 15 Days 18 Hours 6 Hours Over 80X DL Training Performance in 3 Years 1x K80 cuDNN2 4x M40 cuDNN3 8x P100 cuDNN6 8x V100 cuDNN7 0x 20x 40x 60x 80x 100x Q1 15 Q3 15 Q2 17 Q2 16 Exponential Performance over time (GoogleNet) SpeedupvsK80 GoogleNet Training Performance on versions of cuDNN Vs 1x K80 cuDNN2
  • 16. 16 END-TO-END PRODUCT FAMILY TRAINING INFERENCE Jetson Drive PX Apollo 6500 | V100 DATA CENTER TITAN V TESLA V100 DESKTOP DGX Station DATA CENTER TESLA V100 TESLA P4 EMBEDDED AUTOMOTIVE DriveWorks SDKJETPACK SDK
  • 17. 17 POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK Accelerates Every Major Framework COMPUTER VISION OBJECT DETECTION IMAGE CLASSIFICATION SPEECH & AUDIO VOICE RECOGNITION LANGUAGE TRANSLATION NATURAL LANGUAGE PROCESSING RECOMMENDATION ENGINES SENTIMENT ANALYSIS DEEP LEARNING FRAMEWORKS Mocha.jl NVIDIA DEEP LEARNING SDK developer.nvidia.com/deep-learning-software
  • 18. 18 TRAINING TO INFERENCE - HPE SOLUTIONS
  • 19. 19 HPE COMPREHENSIVE, PURPOSE-BUILT PORTFOLIO FOR DEEP LEARNING Compute ideal for training models in data center Edge analytics and inference engine Compute for both training models and inference at edge HPE Apollo 6500 AI / HPC Storage Choice of Fabrics HPE SGI 8600 Government, academia and industries Financial services Life Sciences, Health Government and academia Autonomous vehicles / Mfg. AI Software Framework – Intel® Omni-Path Architecture – Mellanox InfiniBand – HPE FlexFabric Network Petaflop scale for deep learning and HPC Enterprise platform for accelerated computing HPE Apollo 2000 The bridge to enterprise scale-out architecture HPE Edgeline EL4000 Unprecedented deep edge compute and high capacity storage; open standards HPE Apollo sx40 Maximize GPU capacity and performance with lower TCO Advisory and Transformation Services | GreenLake Flexible Capacity | Datacenter Care Services HPE Data Management Framework Software Storage elasticity through tiered data management
  • 20. 20 ACCELERATED PERFORMANCE FOR MACHINE LEARNING deep learning platform with industry leading GPUs and interconnects Enterprise AI Dependable system performance − Power and cooling headroom for top bin accelerators − Robust signal integrity provides higher bandwidth, lower latency HPE’s highest GPUs per server − Up to 125 Tflops1 single precision performance − Eight GPUs per server Powerful host − 2x the high speed fabric adapters2 − NVMe drives − High speed DDR4 SmartMemory Leading accelerator technology − NVLink 2.0 enables dedicated GPU-to-GPU communication 1 NVIDIA Tesla V100 provides 15.7 TFlops single precision performance with NVLink 2.0 per GPU x 8 GPUs = 125 Tflops http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf 2 Apollo 6500 Gen10 with 4 OPA or IB connectors vs Apollo 6500 Gen9 with 2 connectors
  • 22. High Performing Edge Compute Brings Action to the Edge 22 Shift to the Edge for Control!
  • 23. 23 HPE enables both Edge AI “Inference” & Core “Training”

Editor's Notes

  • #7: Let’s take a look at how deep learning actually works in practice… Say you want to automatically identify various types of animals, which is a classification task. The first step is to assemble a collection of representative examples to be used as a training dataset which will serve as the experience from which a deep neural network will learn. If you only need to classify cats vs. dogs, then you only need to have cat images and dog images in the training dataset. In this case you’ll need several thousand images, each with a label indicating whether it is a cat image or a dog image. To ensure the training dataset is representative of all the pictures of cats and dogs that exist in the world, it must include a wide range of species, poses, and environments in which dogs and cats may be observed. [next]
  • #8: The next thing you need is a deep neural network model. Typically, this will be an untrained neural network designed to perform a general task like detection, classification, or segmentation on a specific type of input data like images, text, audio, or video. Here you can see a very simple model of an untrained neural network. At the top of the model, there is a row or “layer” that has 4 input nodes, and at the bottom there is a layer that has 2 output nodes. Between the input layer and the output layer are a few so-called “hidden” layers with several nodes each. The white lines show which nodes in the input layer share their results with nodes in the first hidden layer, and so on all the way down to the output layer. You may hear the nodes referred to as “artificial neurons” or “perceptrons” since their simple behavior is inspired by the neurons in the human brain. A real deep neural network model would have many hidden layers between the input layer and the output layer - which is why it’s called “deep” - but a simplified representation works fine for this example. Just keep in mind that the design of the neural network model is what makes it suitable for a particular task. For example, the best models for image classification are very different from the best models for speech recognition. And the differences can include the number layers, the number of nodes in each layer, the algorithms performed in each node, and the connections between nodes. It’s worth noting that there are readily available deep neural network models for image classification, object recognition, image segmentation and several other tasks – but it is often necessary to modify these models to achieve high levels of accuracy for a particular dataset. So, the computation required to train one version of the model is multiplied by the number of model variations that need to be evaluated. All of this processing requires a tremendous amount of computation, much of which can be performed in parallel, which makes deep learning an ideal workload for GPUs to accelerate. ---------- simple version ---------- So, if the goal is to train a deep neural network to distinguish cats vs. dogs, you would use a neural network model that is designed to be good at image classification. ---------- detailed version -------- For the image classification task of distinguishing images of cats vs. dogs, we would probably use a convolutional neural network, such as AlexNet, which is comprised of nodes that implement simple generalized algorithms. 1. Convolution layers (linear, element-wise matrix multiplication and addition) 2. Non-linearity function layers (ReLU - rectified linear unit, which replaces negative values with zero) 3. Pooling layers (subsampling/downsampling, reduces dimensionality while retaining the most important information, max/avg/sum) Using these simple, generalized algorithms is a key difference and advantage for deep learning vs. earlier approaches to machine learning, which required many custom data-specific feature extraction algorithms to be developed by specialists for each dataset and task. ------------------------------------ [next]
  • #9: Once you have assembled a training dataset and selected a neural network model, a deep learning framework (such as Caffe2, MXNET, or TensorFlow) is used to feed the training dataset through the neural network. For each image that is processed through the neural network, each node in the output layer reports a number that indicates how confident it is that the image is a dog or a cat. In this case there are only two options, so the model needs just two nodes in the output layer – one for dogs and one for cats. When these final outputs are sorted most-confident to least-confident, the result is called a confidence vector. The deep learning framework then looks at the label for the image to determine whether the neural network guessed (or “inferred”) the correct answer. If it inferred correctly, the framework strengthens the weights of the connections that contributed to getting the correct answer. And vice-versa, if the neural network inferred the incorrect result, the framework reduces the weights of the connections that contributed to getting the wrong answer. After processing the entire training dataset once, the neural network will generally have enough experience to infer the correct answer a little more than half of the time (slightly better than a random coin toss) but it will require several additional “epochs” (meaning rounds of processing the entire training dataset) to achieve higher levels of accuracy. There are many deep learning frameworks to choose from, each with its own set of strengths, programming language interfaces, etc. but they’re all GPU-accelerated using the CUDA Deep Neural Networks library (cuDNN) and other libraries in the NVIDIA Deep Learning SDK. You can learn more about all the deep learning frameworks and the NVIDIA Deep Learning SDK on our developer web site at developer.nvidia.com. [next]
  • #10: So, now that the model has been trained on a large representative dataset, it’s very good at distinguishing between cats vs. dogs. But if you showed it a picture of a racoon it would be very confused since it has no previous experience with racoons, and would probably report a low confidence number for both dog and cat. If you needed the ability to classify racoons as well as dogs and cats, you would simply modify the design topology of the model to add a third node to the output layer, expand the training dataset to include thousands of representative images of racoons, and use the deep learning framework to re-train the model. There’s no need to manually write a racoon feature extraction algorithm and figure out how to integrate it into your deep neural network model, just modify the network topology a bit and re-train the model with your new dataset. [next]
  • #11: Once the model has been trained, much of the generalized flexibility that was necessary during the training process is no longer needed, so it’s possible to optimize the model for significantly faster runtime performance. [Story: How Cooper learned to classify cars, “tucks”, busses, motorcycles, etc.] Common optimizations include fusing layers to reduce memory and communication overhead, pruning nodes that do not contribute significantly to the results, and other techniques supported in the NVIDIA TensorRT runtime. [next]
  • #12: Your fully trained and optimized model is then ready to be integrated into an application that will feed it new data, in this case images of cats and dogs that it hasn’t seen before, and it will be able to quickly and accurately infer the correct answer based on its training. And your application can be deployed on a GPU-accelerated platform in your datacenter, in the cloud, on a local workstation, in a robot, a smart camera, or even a self-driving car.
  • #17: There are a wide range of GPU-accelerated platforms you can use to accelerate deep learning training and inference application workloads. If you want a fully-integrated solution, we recommend the DGX-1 supercomputer in a box which delivers the performance equivalent of 100s of CPU-only servers using 8 world-class Tesla GPUs, or its little brother the DGX Station, which is powered by 4 Tesla GPUs and runs whisper-quiet next to your desk. If you just want to get started on a prototype using your existing workstation, the TitanV includes new TensorCores designed specifically for deep learning that deliver up to 12x higher peak TFLOPs for training. In the datacenter, the Tesla P100 and V100 with NVLink Technology deliver strong scaling support for mixed workloads across both HPC applications and Deep Learning training & inference applications. And for scale-out inference workloads the Tesla P4 (GP104) supports high efficiency (perf/watt) low-latency performance. And, of course, if you need to deploy deep learning applications in automotive or embedded applications, NVIDIA offers the DrivePX and Jetson platforms. ====================== NVIDIA also provides a wide range of GPU-accelerated platforms you can use to accelerate deep learning training and inference application workloads. If you want a fully-integrated solution, we recommend the DGX-1 supercomputer in a box which delivers the performance equivalent of 250 CPU-only servers, or it’s little brother the DGX Station, which runs whisper-quiet next to your desk. If you just want to get started on a prototype using your existing workstation, the Titan X Pascal supports fast 32-bit floating point (FP32) and 8-bit integer (INT8) performance for deep learning applications. In the datacenter, the Tesla P100 and V100 with NVLink Technology deliver strong scaling support for mixed workloads across both HPC applications and Deep Learning training & inference workloads (using FP64, FP32, and FP16). And for scale-out inference workloads the Tesla P4 (GP104) supports high efficiency (perf/watt) low-latency performance with fast FP32 and INT8. And, of course, if you need to deploy deep learning applications in automotive or embedded applications, NVIDIA offers the DrivePX and Jetson platforms. ====================== What happens after development. NGC is tuned for all these platforms, but what happens next is you productionize your hard work, you created this awesome AI solution, there is a seamless deployment path to a cloud based microservices in the form of the NGC TensorRT container, or you can take it to the the embedded devices… you want to productize your research or your solution, this is the path. Take these models into Jetpack and Driveworks for robotics, drones, autonomous vehicles, etc.