SlideShare a Scribd company logo
7
Most read
8
Most read
11
Most read
Perspective on HPC-enabled AI
Tim Barr
September 7, 2017
AI is Everywhere
Copyright© 2017 Cray Inc. 2
Deep Learning Component of AI
The punchline: Deep Learning is a High Performance
Computing problem
• Delivers benefits similar to HPC in other disciplines
• The value is in the decisions that are enabled
• Characterized by the same underlying factors
• Large amount of computation
• Large amount of data motion (I/O and network)
• The same methods work
• HPC Technology and HPC Best Practice apply directly to DL
3
Deep Learning Training: Behind the Scenes
Compute gradients locally
Global average of gradients
PnP1 P2
Process samples
} One
Mini-batch
Deploying lots of computational power requires lots of communication.
} One
Mini-batchRepeat…
Computationally-intensive training phase
Copyright© 2017 Cray Inc. 4
High
Performance
Simulation
High
Performance
Machine and
Deep Learning
Why Are We Here?
Faster is
better
More accurate
is better
Computationally
Intensive
Communication
Intensive
Copyright© 2017 Cray Inc. 5
Let’s Use Weather As An Example
• More Accurate is
Better
• At100km (top) and
25km (bottom)
• Missed tropical
cyclones and big
waves up to 30 meters
high
• Faster is Better
• Higher resolution
simulation requires 64X
more computation
http://www.nersc.gov/news-publications/nersc-news/science-news/2017-
2/researchers-catch-extreme-waves-with-high-resolution-modeling
Copyright© 2017 Cray Inc.
6
HPC and AI Will Converge
Big Data HPC
40% Reduction in error
rates when 10x more data is
being used in coordination
with AI in speech recognition 1
28% believe HPC
will allow them to scale
computationally to build
deep learning
algorithms that can take
advantage of high
volumes of data 1
2xDigital data
is doubling in size
every two years,
and by 2020 the
digital universe
will reach 44
zettabytes 2
Machine Learning
Deep Learning
1. “Are AI/Machine Learning/Deep Learning in Your Company’s Future?”,
insideBigData + NVIDIA
2. EMC Digital Universe with Research & Analysis by IDC Copyright© 2017 Cray Inc.
7
What is Deep Learning ?
ARTIFICIAL INTELLIGENCE
Design of intelligent systems that augments human productivity. Systems
that help decision makers do what they do best; leveraging computers doing what they do best
Sense Comprehend Predict Act and Adapt
ANALYTICS MACHINE LEARNING
Search for the what, when, where and why Learn patterns from the past to predict future
Leverage domain and data science to query
datasets for insights:
Unsupervised
Group, cluster and
organize content with
domain-specific
heuristic models
Supervised
Train mathematical
predictive models with
labelled data
Descriptive What happened?
Diagnostic Why did it happen? DEEPLEARNING
Predictive What will happen? Train and use neural networks as a predictive model
Prescriptive How to make it happen? Vision Speech Language
Copyright© 2017 Cray Inc.
8
“AI and machine learning have reached a critical tipping point and will
increasingly augment and extend virtually every technology enabled
service, thing or application.”
“The combination of extensive parallel processing power, advanced
algorithms and massive data sets to feed the algorithms has
unleashed this new era.”
Gartner’s Top 10 Strategic Technology Trends for 2017
“Fast data is just as important as big data. In 2016, we’ll witness
the emergence of a new class of real-time applications in e-
commerce and financial technology services powered by super-
speedy data analytics. ‘Fast data’ is the second iteration of big
data, and it will create a lot of value.”
Fortune Magazine, December 2015
In a competitive international
economy, advanced AI combined
with supercomputing are essential
ingredients for:
▪ Solution of strategically important
problems
▪ Maintaining global leadership in
industry, government and
academia
▪ Creating next generation
technologies, products and
services
Performance will be an AI Innovation and
Adoption Driver
Copyright© 2017 Cray Inc.
9
Deep Learning Will Require Supercomputing
• An AI Revolution Started For
Courageous Enterprises
• Yes, Deep Learning Warrants All
The Fuss
• Expect To Need Thousands Of
Cores
10
Copyright© 2017 Cray Inc.
Deep Learning with Supercomputers
NERSC – Deep Learning in Science
11
Opportunities to apply
DL widely in support of
classic HPC simulation
and modelling
Copyright© 2017 Cray Inc.
Deep Learning in Automotive
Noise, Vibration and Harshness at Daimler
• Noise, Vibration and
Harshness is a traditional
HPC application used in
automotive and aerospace
• Deep Learning has the
potential to do an
automatic evaluation of
results in complex, multi-
component, non-linear
applications
Copyright© 2017 Cray Inc.
12
Deep Learning Examples in Manufacturing
Aerospace Drones
10-fold increase in the commercial drone
fleet by 2021…FAA, 2017
Digital Twin
“Top 10 technologies for 2017”,
Gartner
Autonomous Vehicle
OEMs will invest $7 billion in
development…Frost &Sullivan, 2016
Leveraging data analytics and deep learning between engineering disciplines
and across the enterprise has great potential for product quality and innovation
Copyright© 2017 Cray Inc.
13
Will not see
ROI
imminently
Will not see
ROI for
sometime
Beginning to
see ROI
See significant
ROI
17%46%25%10%
ROI Timeline
When Should You Start?
A Sample from the Financial Services Sector
Source: Innovita Partners, 7/2017, exclusively for Cray
▪ ROI payoff will be 1 – 2 years
▪ Time to begin experimentation
is now
<1 year 1 year 1 to 2 years 3 to 4 years 5 to 7 years
Copyright© 2017 Cray Inc.
14
Why Deep Learning Now?
Adjustable weights
Weights are not learned
Learnable weights and
threshold
XOR Problem Solution to nonlinearly
separable problems
Big computation, local
optima/overfitting
Limitations of
learning prior
Kernel function:
Human intervention
Hierarchical feature learning
Electronic
brain
Perceptron ADALINE XOR Backpropagation Deep LearningSVM
AI WinterGolden Age
"Large
Enough"
Data to
Train
Compute
Power
Advanced
Algorithms
and Software
Frameworks
Data
Science
Expertise
Deep
Learning
Now
Image Source: Andrew L. Beam. (2017, February 13). Deep Learning 101 – Part 1:History and Background[Blog post]. Retrieved from
https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html
15
Deep Learning Challenges
“AI systems still demand considered design,
knowledge engineering and model building”, ForresterAI
TechRadarQ1 2017
▪ A lot to learn for practitioners and end-users:
▪ Large, complex workflows
▪ Different Toolkits + Data Movement + Network
▪ Defining the value returned to the business
▪ Training times grow with data sizes and complexity:
▪ Days to Weeks
▪ Compounded with hyper parameter optimization
(O(1000) is not unrealistic)
Copyright© 2017 Cray Inc.
16
HPC and AI
Enabling resource intensive training by delivering
performance efficiencies and scalability
Architectures
Software
Platforms
▪Deep Learning Platforms - dense
GPU to scalable platforms with
optimized software stacks
▪Apply HPC best practices and
expertise to improve deep
learning frameworks and core
algorithms
Expertise
Copyright© 2017 Cray Inc.
17
Reduce Total Workflow Time
Why? The Deep Neural Net Training Problem
• DNN model with weights on all connections
• Largest models now hundreds of layers,
and millions (to billions) of nodes
• Large set of labeled training data
• Idealized training algorithm:
• For every minibatch of training samples:
• run samples forward through the model
• compute the error vs. the training data
• back-propagate error through the NN to update the weights (gradient descent)
• After all data processed, iteratively optimize hyperparameters until
required accuracy is achieved
A (not particularly deep) neural net
Copyright© 2017 Cray Inc.
18
Reduce Total Workflow Time
▪ Minutes, Hours:
▪ Interactive research!
Instant gratification!
▪ 1-4 days
▪ Tolerable
▪ Interactivity replaced by
running many
experiments in parallel
▪ 1-4 weeks:
▪ High value experiments
only
▪ Progress stalls
▪ >1 month
▪ Don’t even try
Data
Acquisition
Data
Preparation
Model
Training
Model
Testing
Source: Large-Scale Deep Learning for
Intelligent Computer Systems, Jeff
Dean, Google
Apply HPC best practices
and expertise to improve
deep learning frameworks
and core algorithms
Copyright© 2017 Cray Inc.
19
0
100
200
300
400
500
600
700
64 Nodes 128 Nodes 256 Nodes 512 Nodes 1024 Nodes 2048 Nodes
EpochElapsedTime(Seconds)
“Applying a supercomputing approach to optimize deep
learning workloads represents a powerful breakthrough
for training and evaluating deep learning algorithms at
scale. Our collaboration with Cray and CSCS has
demonstrated how the Microsoft Cognitive Toolkit can be
used to push the boundaries of deep learning.”
- Dr. Xuedong Huang, distinguished engineer, Microsoft AI and
Research
Microsoft Cognitive Toolkit
Cray Focus: Deep Learning Training at Scale
CNTK: Distributed Version vs Cray MPI Parallel Implementation
Copyright© 2017 Cray Inc.
▪ Apply HPC Best Practices and Cray Expertise
to improve DL systems and core algorithms
with real-world use cases
▪ Collaborations across Cray customers and
other stakeholders
▪ Currently optimizing different toolkits:
▪ CNTK
▪ TensorFlow
▪ MXNet
20
HPC Focus: Comprehensive Systems
Configuration
Monitoring
Serving
Infrastructure
Data
Collection
Feature
Extraction
Data
Verification
Machine
Resource
Management
Analysis Tools
ML
Code
Process
Management Tools
“Only a small fraction of real-world ML systems is composed of
the ML code, as shown by the small black box in the middle.
The required surrounding infrastructure is vast and complex.”
-Adapted from Hidden Technical Debt in Machine Learning Systems,
Sculley et. al., NIPS ‘15
Copyright© 2017 Cray Inc.
21
HPC Supports the Entire AI Workflow
Deep Learning
workflows are not
limited to training.
● Similar to other HPC
and analytics
workloads, significant
portions of DL jobs are
devoted to data
collection, preparation
and management.
Data
Acquisition
Data
Preparation
Model
Training
Model
Testing
• Cleansing
• Shaping
• Enrichment
Data Annotation
(Ground Truth)
Test
Set
Validation
Set
Train
Model
Evaluate Performance and
optimize model
Cross-
Validation
Iterative
Training
Set
Copyright© 2017 Cray Inc.
22
AI is everywhere… Even the grocery store
23
Thank You

More Related Content

PPTX
High performance computing
PDF
Introduction to High Performance Computing
PDF
High–Performance Computing
PDF
GPU - Basic Working
PDF
High performance computing tutorial, with checklist and tips to optimize clus...
PPTX
Nvidia (History, GPU Architecture and New Pascal Architecture)
PPTX
Google TPU
PPTX
Python libraries for data science
High performance computing
Introduction to High Performance Computing
High–Performance Computing
GPU - Basic Working
High performance computing tutorial, with checklist and tips to optimize clus...
Nvidia (History, GPU Architecture and New Pascal Architecture)
Google TPU
Python libraries for data science

What's hot (20)

PDF
Deep learning with FPGA
PDF
Intro to Machine Learning for GPUs
PPTX
High performance computing
PDF
GPU - An Introduction
PPT
Final green computing slide by: Anurag.Saxena
PPTX
Graphics processing unit
PPTX
Core i 7 processor
PPT
Introduction to HPC
PDF
PPTX
HYPER-THREADING TECHNOLOGY
PPTX
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
PPTX
Green computing ppt
PPTX
Green computing
PDF
AI Chip Trends and Forecast
PDF
Philosophy of Artificial Intelligence
PDF
Evaluating GPU programming Models for the LUMI Supercomputer
PDF
GPU Accelerated Deep Learning for CUDNN V2
PDF
5G Multi-Access Edge Compute
PDF
Gpu presentation
Deep learning with FPGA
Intro to Machine Learning for GPUs
High performance computing
GPU - An Introduction
Final green computing slide by: Anurag.Saxena
Graphics processing unit
Core i 7 processor
Introduction to HPC
HYPER-THREADING TECHNOLOGY
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Green computing ppt
Green computing
AI Chip Trends and Forecast
Philosophy of Artificial Intelligence
Evaluating GPU programming Models for the LUMI Supercomputer
GPU Accelerated Deep Learning for CUDNN V2
5G Multi-Access Edge Compute
Gpu presentation
Ad

Similar to Perspective on HPC-enabled AI (20)

PDF
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
PDF
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
PDF
Using Algorithmia to leverage AI and Machine Learning APIs
PDF
Building Data Science Ecosystems for Smart Cities and Smart Commerce
PPTX
The Future of Data Science
PDF
Center of Excellence
PDF
Deep learning at supercomputing scale by Rangan Sukumar from Cray
PDF
Think Big | Enterprise Artificial Intelligence
PPTX
Scaling Data Science on Big Data
PDF
Data Tells the Story - Greenplum Summit 2018
PPTX
Innovating to Create a Brighter Future for AI, HPC, and Big Data
PDF
TensorFlow 16: Building a Data Science Platform
PPT
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
PDF
Enterprise deep learning lessons bodkin o reilly ai sf 2017
PDF
Data Con LA 2022 - Democratizing AI Across Clouds: Low-Cost, Easy-to-Deploy M...
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
PDF
Webinar - Patient Readmission Risk
PPTX
Data Science Salon: Applying Machine Learning to Modernize Business Processes
PDF
"1,000X in Three Years: How Embedded Vision is Transitioning from Exotic to E...
PPTX
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Using Algorithmia to leverage AI and Machine Learning APIs
Building Data Science Ecosystems for Smart Cities and Smart Commerce
The Future of Data Science
Center of Excellence
Deep learning at supercomputing scale by Rangan Sukumar from Cray
Think Big | Enterprise Artificial Intelligence
Scaling Data Science on Big Data
Data Tells the Story - Greenplum Summit 2018
Innovating to Create a Brighter Future for AI, HPC, and Big Data
TensorFlow 16: Building a Data Science Platform
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Enterprise deep learning lessons bodkin o reilly ai sf 2017
Data Con LA 2022 - Democratizing AI Across Clouds: Low-Cost, Easy-to-Deploy M...
Austin,TX Meetup presentation tensorflow final oct 26 2017
Webinar - Patient Readmission Risk
Data Science Salon: Applying Machine Learning to Modernize Business Processes
"1,000X in Three Years: How Embedded Vision is Transitioning from Exotic to E...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
Modernizing your data center with Dell and AMD
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing

Perspective on HPC-enabled AI

  • 1. Perspective on HPC-enabled AI Tim Barr September 7, 2017
  • 2. AI is Everywhere Copyright© 2017 Cray Inc. 2
  • 3. Deep Learning Component of AI The punchline: Deep Learning is a High Performance Computing problem • Delivers benefits similar to HPC in other disciplines • The value is in the decisions that are enabled • Characterized by the same underlying factors • Large amount of computation • Large amount of data motion (I/O and network) • The same methods work • HPC Technology and HPC Best Practice apply directly to DL 3
  • 4. Deep Learning Training: Behind the Scenes Compute gradients locally Global average of gradients PnP1 P2 Process samples } One Mini-batch Deploying lots of computational power requires lots of communication. } One Mini-batchRepeat… Computationally-intensive training phase Copyright© 2017 Cray Inc. 4
  • 5. High Performance Simulation High Performance Machine and Deep Learning Why Are We Here? Faster is better More accurate is better Computationally Intensive Communication Intensive Copyright© 2017 Cray Inc. 5
  • 6. Let’s Use Weather As An Example • More Accurate is Better • At100km (top) and 25km (bottom) • Missed tropical cyclones and big waves up to 30 meters high • Faster is Better • Higher resolution simulation requires 64X more computation http://www.nersc.gov/news-publications/nersc-news/science-news/2017- 2/researchers-catch-extreme-waves-with-high-resolution-modeling Copyright© 2017 Cray Inc. 6
  • 7. HPC and AI Will Converge Big Data HPC 40% Reduction in error rates when 10x more data is being used in coordination with AI in speech recognition 1 28% believe HPC will allow them to scale computationally to build deep learning algorithms that can take advantage of high volumes of data 1 2xDigital data is doubling in size every two years, and by 2020 the digital universe will reach 44 zettabytes 2 Machine Learning Deep Learning 1. “Are AI/Machine Learning/Deep Learning in Your Company’s Future?”, insideBigData + NVIDIA 2. EMC Digital Universe with Research & Analysis by IDC Copyright© 2017 Cray Inc. 7
  • 8. What is Deep Learning ? ARTIFICIAL INTELLIGENCE Design of intelligent systems that augments human productivity. Systems that help decision makers do what they do best; leveraging computers doing what they do best Sense Comprehend Predict Act and Adapt ANALYTICS MACHINE LEARNING Search for the what, when, where and why Learn patterns from the past to predict future Leverage domain and data science to query datasets for insights: Unsupervised Group, cluster and organize content with domain-specific heuristic models Supervised Train mathematical predictive models with labelled data Descriptive What happened? Diagnostic Why did it happen? DEEPLEARNING Predictive What will happen? Train and use neural networks as a predictive model Prescriptive How to make it happen? Vision Speech Language Copyright© 2017 Cray Inc. 8
  • 9. “AI and machine learning have reached a critical tipping point and will increasingly augment and extend virtually every technology enabled service, thing or application.” “The combination of extensive parallel processing power, advanced algorithms and massive data sets to feed the algorithms has unleashed this new era.” Gartner’s Top 10 Strategic Technology Trends for 2017 “Fast data is just as important as big data. In 2016, we’ll witness the emergence of a new class of real-time applications in e- commerce and financial technology services powered by super- speedy data analytics. ‘Fast data’ is the second iteration of big data, and it will create a lot of value.” Fortune Magazine, December 2015 In a competitive international economy, advanced AI combined with supercomputing are essential ingredients for: ▪ Solution of strategically important problems ▪ Maintaining global leadership in industry, government and academia ▪ Creating next generation technologies, products and services Performance will be an AI Innovation and Adoption Driver Copyright© 2017 Cray Inc. 9
  • 10. Deep Learning Will Require Supercomputing • An AI Revolution Started For Courageous Enterprises • Yes, Deep Learning Warrants All The Fuss • Expect To Need Thousands Of Cores 10 Copyright© 2017 Cray Inc.
  • 11. Deep Learning with Supercomputers NERSC – Deep Learning in Science 11 Opportunities to apply DL widely in support of classic HPC simulation and modelling Copyright© 2017 Cray Inc.
  • 12. Deep Learning in Automotive Noise, Vibration and Harshness at Daimler • Noise, Vibration and Harshness is a traditional HPC application used in automotive and aerospace • Deep Learning has the potential to do an automatic evaluation of results in complex, multi- component, non-linear applications Copyright© 2017 Cray Inc. 12
  • 13. Deep Learning Examples in Manufacturing Aerospace Drones 10-fold increase in the commercial drone fleet by 2021…FAA, 2017 Digital Twin “Top 10 technologies for 2017”, Gartner Autonomous Vehicle OEMs will invest $7 billion in development…Frost &Sullivan, 2016 Leveraging data analytics and deep learning between engineering disciplines and across the enterprise has great potential for product quality and innovation Copyright© 2017 Cray Inc. 13
  • 14. Will not see ROI imminently Will not see ROI for sometime Beginning to see ROI See significant ROI 17%46%25%10% ROI Timeline When Should You Start? A Sample from the Financial Services Sector Source: Innovita Partners, 7/2017, exclusively for Cray ▪ ROI payoff will be 1 – 2 years ▪ Time to begin experimentation is now <1 year 1 year 1 to 2 years 3 to 4 years 5 to 7 years Copyright© 2017 Cray Inc. 14
  • 15. Why Deep Learning Now? Adjustable weights Weights are not learned Learnable weights and threshold XOR Problem Solution to nonlinearly separable problems Big computation, local optima/overfitting Limitations of learning prior Kernel function: Human intervention Hierarchical feature learning Electronic brain Perceptron ADALINE XOR Backpropagation Deep LearningSVM AI WinterGolden Age "Large Enough" Data to Train Compute Power Advanced Algorithms and Software Frameworks Data Science Expertise Deep Learning Now Image Source: Andrew L. Beam. (2017, February 13). Deep Learning 101 – Part 1:History and Background[Blog post]. Retrieved from https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html 15
  • 16. Deep Learning Challenges “AI systems still demand considered design, knowledge engineering and model building”, ForresterAI TechRadarQ1 2017 ▪ A lot to learn for practitioners and end-users: ▪ Large, complex workflows ▪ Different Toolkits + Data Movement + Network ▪ Defining the value returned to the business ▪ Training times grow with data sizes and complexity: ▪ Days to Weeks ▪ Compounded with hyper parameter optimization (O(1000) is not unrealistic) Copyright© 2017 Cray Inc. 16
  • 17. HPC and AI Enabling resource intensive training by delivering performance efficiencies and scalability Architectures Software Platforms ▪Deep Learning Platforms - dense GPU to scalable platforms with optimized software stacks ▪Apply HPC best practices and expertise to improve deep learning frameworks and core algorithms Expertise Copyright© 2017 Cray Inc. 17
  • 18. Reduce Total Workflow Time Why? The Deep Neural Net Training Problem • DNN model with weights on all connections • Largest models now hundreds of layers, and millions (to billions) of nodes • Large set of labeled training data • Idealized training algorithm: • For every minibatch of training samples: • run samples forward through the model • compute the error vs. the training data • back-propagate error through the NN to update the weights (gradient descent) • After all data processed, iteratively optimize hyperparameters until required accuracy is achieved A (not particularly deep) neural net Copyright© 2017 Cray Inc. 18
  • 19. Reduce Total Workflow Time ▪ Minutes, Hours: ▪ Interactive research! Instant gratification! ▪ 1-4 days ▪ Tolerable ▪ Interactivity replaced by running many experiments in parallel ▪ 1-4 weeks: ▪ High value experiments only ▪ Progress stalls ▪ >1 month ▪ Don’t even try Data Acquisition Data Preparation Model Training Model Testing Source: Large-Scale Deep Learning for Intelligent Computer Systems, Jeff Dean, Google Apply HPC best practices and expertise to improve deep learning frameworks and core algorithms Copyright© 2017 Cray Inc. 19
  • 20. 0 100 200 300 400 500 600 700 64 Nodes 128 Nodes 256 Nodes 512 Nodes 1024 Nodes 2048 Nodes EpochElapsedTime(Seconds) “Applying a supercomputing approach to optimize deep learning workloads represents a powerful breakthrough for training and evaluating deep learning algorithms at scale. Our collaboration with Cray and CSCS has demonstrated how the Microsoft Cognitive Toolkit can be used to push the boundaries of deep learning.” - Dr. Xuedong Huang, distinguished engineer, Microsoft AI and Research Microsoft Cognitive Toolkit Cray Focus: Deep Learning Training at Scale CNTK: Distributed Version vs Cray MPI Parallel Implementation Copyright© 2017 Cray Inc. ▪ Apply HPC Best Practices and Cray Expertise to improve DL systems and core algorithms with real-world use cases ▪ Collaborations across Cray customers and other stakeholders ▪ Currently optimizing different toolkits: ▪ CNTK ▪ TensorFlow ▪ MXNet 20
  • 21. HPC Focus: Comprehensive Systems Configuration Monitoring Serving Infrastructure Data Collection Feature Extraction Data Verification Machine Resource Management Analysis Tools ML Code Process Management Tools “Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.” -Adapted from Hidden Technical Debt in Machine Learning Systems, Sculley et. al., NIPS ‘15 Copyright© 2017 Cray Inc. 21
  • 22. HPC Supports the Entire AI Workflow Deep Learning workflows are not limited to training. ● Similar to other HPC and analytics workloads, significant portions of DL jobs are devoted to data collection, preparation and management. Data Acquisition Data Preparation Model Training Model Testing • Cleansing • Shaping • Enrichment Data Annotation (Ground Truth) Test Set Validation Set Train Model Evaluate Performance and optimize model Cross- Validation Iterative Training Set Copyright© 2017 Cray Inc. 22
  • 23. AI is everywhere… Even the grocery store 23