SlideShare a Scribd company logo
High Performance
Computing
strategies for machine learning
Why?
https://www.pomona.edu/administration/hpc/access-guidelines/primer
Focus for today
http://cloudscaling.com/blog/cloud-computing/grid-cloud-hpc-whats-the-diff/
Hardware for
HPC
HPC starts when you
feel that you need a
bigger laptop!
http://www.advancedclustering.com/hpc-compute-blocks-built-to-order-for-intensive-
workloads/
Good to know:
Scaling up with
Hardware
Understand
what speed is
possible.
http://www.moorinsightsstrategy.com/wp-content/uploads/2015/04/unnamed.png
Scaling up with
Hardware
You want to optimize how your
data flows!
https://www.microway.com/product/octoputer-4u-tesla-8-gpu-server-nvlink/blockdiagram-sys-4028gr-tvxrt-teslav100/
Python tools for the help
Recommended talk:
https://youtu.be/HKjM3
peINtwe
tbb4py: python c-extension that is instantiated via monkey
patching python pools → enables TBB (threading building blocks
lib) of intel MKL (math kernel lib.). → I saw ~20% speedup on
some tests.
Goal: Fast R&D
turnover
Time is costly
Hardware is costly
Speed matters
Scaling is important
Data Science has a 1:n
compute requirement
Cloud: large differences in offers.
Costs per compute unit ...
… the story of a model that could not
be trained in sufficient time for
production ...
Product Strategy → ML Strategy →
Data Strategy + Compute Strategy →
HPC Case
https://news.developer.nvidia.com/new-translator-provides-more-human-li
ke-translations/
Example:
Product: Online Translator
HPC Strategy: HPC cluster in inceland
Current trends in hardware:
Trend: Heterogenization, Diversification, HPC is entering consumer market more
and more
Towards a HPC strategy
HPC accelerates research
Idea
Result
Experiment
Reasoning by Andrew Ng: https://youtu.be/c_55gZfUK1E
Example ML Platform
Define your
compute strategy?
Why HPC?
Why Cloud?
When Hybrid?
Answer this as early as possible →
large buy in risks.
Cloud:
Good if you have no owned
infrastructure, manpower and
want to get ready fast. Good for
scale and resilience.
Hybrid:
Best option if you have the
manpower and use case.
Gives you option to pick the
best from both worlds
On Premise:
Good if you can manage
the hardware. Good if you
want to be highly
optimised and know your
case.
What is your
compute strategy?
Why HPC
Why Cloud
When Hybrid
Be careful with case studies!!
What is your priority? Fast results,
Scalability, Resilience, Cost
efficiency
.
Data locality?
Utilization will drive your costs
structures
R&D turnover
Talent
available?
Business case ? Eg. IOT, embedded,
special hardware, consumer
electronics.
Example setup:
Data
Research
Deploy
! Good option
for HPC on prem
or in Cloud
Cloud / on
Prem
Cloud
A possible HPC setup for research
Slurm Network
NFS
(nearli
ne)
Master Node (weak)
Nodes
Send Jobs
User Login
SLURM Compute Node:
N CPU (specialized)
X RAM (Main memory)
Generic resources:
- Enhanced network
- Fast Internal flash
storage
- GPU /Phi Coprocessors
Same user id and
permissions on all
system components!!
Mounted
network file
system
Focus:
- Connect: Central model repo
- Connect: Central data repo
- Use templates that can be
ported to your cloud platform
Best Practices
Educate your researchers how to
best use the system
Develop standards and best
practices for the ML dev cycle (eg.
model versioning and testing)
Develope standards for transitions
between eg. between on premise
and cloud.
Check your product use case:
- eg. requirements for training
- many products vs. single (see
deepl.com example)

More Related Content

PDF
Implementing AI: High Performace Architectures
 
PDF
IBM Data Centric Systems & OpenPOWER
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
HP Moonshot system
PDF
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
PDF
BSC LMS DDL
PPSX
ICEOTOPE & OCF: Performance for Manufacturing
Implementing AI: High Performace Architectures
 
IBM Data Centric Systems & OpenPOWER
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Innovation with ai at scale on the edge vt sept 2019 v0
HP Moonshot system
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
BSC LMS DDL
ICEOTOPE & OCF: Performance for Manufacturing

Similar to 2019 HighPerformance Computing - Strategies for Machine Learning.pdf (20)

PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPT
Monitoring IAAS & PAAS Solutions
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
PDF
Elephants in the cloud or how to become cloud ready
PDF
Elephants in the cloud or How to become cloud ready
PDF
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
PPTX
Deep Learning with Spark and GPUs
PPT
Cluster Computing
PDF
Accelerate Big Data Processing with High-Performance Computing Technologies
PPTX
General Learning.pptx
PDF
Power AI introduction
PDF
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
PPTX
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
PDF
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
PPTX
Hadoop project design and a usecase
PDF
Lessons Learned on Benchmarking Big Data Platforms
PPTX
Cloud Computing for Small & Medium Businesses
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Monitoring IAAS & PAAS Solutions
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or How to become cloud ready
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
Deep Learning with Spark and GPUs
Cluster Computing
Accelerate Big Data Processing with High-Performance Computing Technologies
General Learning.pptx
Power AI introduction
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
Hadoop project design and a usecase
Lessons Learned on Benchmarking Big Data Platforms
Cloud Computing for Small & Medium Businesses
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Ad

2019 HighPerformance Computing - Strategies for Machine Learning.pdf