SlideShare a Scribd company logo
GIST AI-X Computing Cluster
Jargalsaikhan Narantuya (자르갈)
2021-07-21
AI Graduate School
Gwangju Institute of Science and Technology (GIST)
Introduction
• As the complexity of machine learning (ML) models and training data grow enormously, methods that
scale with computation are becoming the future of Artificial Intelligence (AI) development.
Source:
NVIDIA
SC2020
2
• Powerful accelerated-computation is required for big data analysis and machine learning.
3
• Using special hardware to perform some functions more efficiently than running on a CPU.
Hardware Acceleration
Started with GPUs and now includes FPGAs (SMART-NIC), ASICs
GPU accelerated applications (compared with CPU realization, source Nvidia)
Graphical Processing Unit (GPU)
Mythbusters
Demo
• Historically, GPU was intended for graphics applications only, to ensure monitor output at each PC (draw polygons).
• Now, it is broadly used in machine learning as a co-processor to accelerate CPUs for general-purpose computing.
CPU GPU
4
AI computing is
not only having multiple GPUs !!!
GIST AI-X Cluster Center
SINGULARITY
Invested more than $1 million
5
5,000,000,000,000,000 floating-point
operations in a second
6
8x NVIDIA A100 GPUs
320 GB GPU Memory
1 TB System Memory
Dual AMD Rome
2.25 GHz (64-Core) CPU
9x 200Gb/s NIC
8x NVIDIA V100 GPUs
256 GB GPU Memory
512 GB System Memory
Dual Intel Xeon E5-2698 v4
2.2 GHz (20-Core) CPU
4x 100Gb/s NIC
DGX A100
DGX-1V
Computing
7
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
GPU Node 1 (DGXA100-1)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
GPU Node 2 (DGXA100-2)
AI Graduate School
2 2 2
Ceph Storage 140 TB: Each user ~ 3TB
…
Cloud Login Node
(Controller + Master)
AI-X Data Pond 170 TB: Each user ~ 5TB
…
2
GPU Node 3 (DGX1v-1)
AI Graduate School
GPU Node 4 (DGX1v-2)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
Local Storage 3 TB: Each user ~ 100GB
…
Box Login Node
(Slurm + K8S Master)
AI-X Data Pond 170 TB: Each user ~ 5 TB
…
Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB)
8x 200G
(Each DGX A100)
2x 100G
(Each DGX-1V)
2x 100G
(Each DGX node) 8x 40G
1
External Fabric Module (XFM)
4x 100 G
Storage Node
(AI-X Data Pond)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
Compute + DataPond
Node 1 (6x 6.4TB)
Compute + DataPond
Node 2 (6x 6.4TB)
Compute + DataPond
Node 3 (6x 6.4TB)
Compute + DataPond
Node 4 (6x 6.4TB)
Compute + DataPond
Nodes (2u4N)
25G
(Each node)
25G
(Each node)
Management Network (10G)
Data Network (100G RoCE)
Internal Network (100G IB)
Data Network (25G RoCE)
Internal Network (200G IB)
Campus Network (1G)
Ceph (25G RoCE)
IDF IDF
8
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
GPU Node 1 (DGXA100-1)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
GPU Node 2 (DGXA100-2)
AI Graduate School
AI-X Data Pond 140 TB
…
(Controller + Master)
AI-X Data Lake 170 TB
…
GPU Node 3 (DGX1v-1)
AI Graduate School
GPU Node 4 (DGX1v-2)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
Local Storage 3 TB
…
Core Cloud
(Slurm + K8S Master)
AI-X Data Lake 170 TB
…
Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB)
8x 200G
(Each DGX A100)
2x 100G
(Each DGX-1V)
100G
(Each node) 8x 40G
External Fabric Module (XFM)
4x 100 G
Storage Node
(AI-X Data Lake)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
Compute + DataPond
Node 1 (6x 6.4TB)
Compute + DataPond
Node 2 (6x 6.4TB)
Compute + DataPond
Node 3 (6x 6.4TB)
Compute + DataPond
Node 4 (6x 6.4TB)
Compute + DataPond
Nodes (2u4N)
Management Network (10G)
Data Network (100G RoCE)
Internal Network (100G IB)
Internal Network (200G IB)
Mellanox SN2100 (100G RoCE)
1 km
MDF
AI-X Front Cluster AI-X Back Cluster
9
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
GPU Node 3 (DGXA100-1)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
GPU Node 4 (DGXA100-2)
AI Graduate School
GPU Node 5 (DGX1v-1)
AI Graduate School
GPU Node 6 (DGX1v-2)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
Local Storage 3 TB (shared)
…
Core Cloud
(Slurm + K8S Master)
AI-X Data Lake 170 TB
…
Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB)
8x 200G
(Each DGX A100)
2x 100G
(Each DGX-1V)
100G
(Each node) 8x 40G
External Fabric Module (XFM)
4x 100 G
Storage Node
(AI-X Data Lake)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
Management Network (10G)
Data Network (100G RoCE)
Internal Network (100G IB)
Internal Network (200G IB)
AI-X Back Cluster
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
GPU Node 1 (DGXA100-1)
AI Graduate School
GPU Node 2 (DGXA100-2)
AI Graduate School
1
2 2 2
2 2 2
10
NFS Client
Login Node
DGX Node
NFS Client
Pure Storage (NFS Server)
… …
Individual
( /mnt/user_id directory )
Storage Node
(AI-X Data Pond)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
• Network file system (NFS v3)
- NFS server and client
Storage
• Ceph: Software-defined storage
- Object storage
- Block storage
- File storage
Networking (Management, Internal, Data)
• AI-X computing cluster:
- Data network: Mellanox SN2100 (100G RoCE)
- Internal network: Mellanox QM8700 (200G Infiniband)
• Commonly used in HPC:
- High throughput and low latency.
- Connect supercomputers and storage systems.
- RoCE: RDMA over Converged Ethernet
- Infiniband (IB): low latency and high bandwidth for system area network (SAN)
• Link speed
- Enhanced Data Rate (EDR) - 25Gb/s per lane (100Gb/s for 4x)
- High Data Rate (HDR) – 50Gb/s per lane (200Gb/s for 4x)
11
DeepOps-based GPU Cloud Deployment
• Open-source project to facilitate deployment of multi-node GPU clusters for deep learning
• DeepOps is also recognized as the DGX POD management software
• Deployment options (single node, multi-node)
 Kubernetes (GPU-enabled Kubernetes cluster using DeepOps)
 Slurm (GPU-enabled Slurm cluster using DeepOps)
 DGX POD Hybrid Cluster (A hybrid cluster with both Kubernetes and Slurm)
 Virtual (virtualized version of DeepOps)
DeepOps Deployment in AI-X Rack: Multi-node GPU Cluster with DeepOps 12
NVIDIA CEO: Jensen Huang
Background
13
“Methods that scale with computation are the future of Artificial Intelligence”
— Richard S. Sutton, Father of reinforcement learning
• Simple Linux Utility for Resource Management:
- Open source, fault-tolerant, and highly scalable cluster management and job scheduling system.
• Deployed at various national and international computing centers.
- Approximately 60% of the TOP500 supercomputers in the world.
• Three key functions:
- Allocates exclusive or non-exclusive access to compute nodes for some duration of time.
- Provides a framework for starting, executing, and monitoring work (normally a parallel job).
- Arbitrates contention for resources by managing a queue of pending work.
14
JUNE 2021 (TOP 500 LIST)
Why share resources ?
15
Shower room: Resource (GPU …)
People: Jobs
Color: Lab, Company, User …
16
 Used Singularity container:
• High performance container technology.
• Specifically for large-scale and cross-node HPC and DL workloads.
• Lightweight, fast deployment, and convenient migration.
• Supports conversion from Docker images to Singularity images.
 User Permissions: Can be started by both root and non-root users.
 Performance: More lightweight, smaller kernel namespace, less performance loss.
 HPC-Optimized: Highly suitable for scenarios where HPC is used (Slurm, OpenMPI, Infiniband).
Singularity + Docker
Users can use Singularity without having
to perform extra adaptation to HPC.
 Frameworks, such as Tensorflow and Pytorch are essential for implementing DL applications
 Containerization technology is adapted to provide all user requirements independently.
Why Singularity? I am familiar with Docker
• Security:
• HPC environments are typically multi-user systems where users should only
have access to their own data.
• For all practical purposes, Docker requires super-user privileges.
• It is hard to give someone limited Docker access.
• Scheduling:
• Users submit jobs with CPU/memory/GPU/Time requirements in Slurm.
• Docker command is an API client that talks to the Docker daemon.
• Singularity runs container processes without a daemon (run as child processes).
• Other concerns:
• Docker is just better at running applications on VM or cloud infrastructure.
• Singularity is better for command line applications and accessing devices like
GPUs or MPI hardware.
17
Distributed Training/Parallel Computing
Message Passing Interface
• Parallelism on HPC is obtained by using MPI.
• Uses high performance Infiniband communication network.
• OpenMPI:
- Open Source Message Passing Interface for multi-process.
- Interface for delivering process results to each other.
- Used by many TOP500 supercomputers.
18
1
2
…
25
Sum: 325
26
27
…
50
Sum: 950
51
52
…
75
Sum: 1575
76
77
…
100
Sum: 2200
325
950
1575
2200
Sum: 5050
Process 0
Process 1
Process 2
Process 3
Process 0
Process 0 Process 1 Process 2 Process 3
High Speed Network (Infiniband)
Message Passing
Interface
Horovod (Distributed deep learning framework)
• Distributed deep learning training framework by Uber
- Make distributed deep learning fast and easy to use.
- Enable training across multiple hosts with multiple GPUs.
- TensorFlow, Keras, PyTorch, Apache MXNet.
• Distributed training:
- Data parallelism (data).
- Model parallelism (layer).
19
Reduce training time for deep neural networks by using multiple GPUs.
Monitoring
20
Using GIST AI-X Computing Cluster
21
How to Use?
1. Create ID/Pass (contact to admin or ask your labmate)
2. Connect to the login node (SSH, )
3. Copy data to the login node (mounted directory)
4. Request resource and submit job (Partitions: v100 and a100)
5. Move to DGX node with requested resources
6. Build writable Singularity image
7. Run Singularity container and do your job
22
Requesting resources and building container image
23
24
Running container
Resource Allocation Policy
25
Flavor Name GPUs CPU cores Memory Storage Number of jobs Time
small_v100 2x V100 20 100 GB 3 TB 4 3 days
medium_v100 4x V100 40 200 GB 3 TB 2 7 days
large_v100 8x V100 80 450 GB 3 TB 2 21 days
small_a100 2x A100 64 200 GB 3 TB 4 3 days
medium_a100 4x A100 128 450 GB 3 TB 2 7 days
large_a100 8x A100 256 950 GB 3 TB 2 21days
Flavor small_v100 in detail:
- Resource: at most 2 GPUs, 20 CPU cores, and 100GB memory for a single job.
- Number of jobs: User can submit at most 4 jobs with one user ID.
- Time limit: User can run a single job for 3 days (72 hours). After 3 days, your job will be automatically canceled.
- You can re-start your job again if there is enough resource.
• All labs can use the GIST AI-X computing cluster.
• By default, flavor “small_v100” is allocated to each lab account.
• If you need more resources, contact to admin.
Thank You!
Q&A
26
Email: jargalsaikhan.n@gist.ac.kr , Phone: 6356
Office: AI Graduate School Building S7, 1st Floor, Researcher’s Office

More Related Content

PDF
PG-Strom
PDF
SQL+GPU+SSD=∞ (English)
PDF
20201006_PGconf_Online_Large_Data_Processing
PDF
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
PPTX
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
PDF
Nvidia cuda tutorial_no_nda_apr08
PDF
Japan Lustre User Group 2014
PDF
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
PG-Strom
SQL+GPU+SSD=∞ (English)
20201006_PGconf_Online_Large_Data_Processing
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
Nvidia cuda tutorial_no_nda_apr08
Japan Lustre User Group 2014
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data

What's hot (20)

PDF
GPGPU Accelerates PostgreSQL (English)
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
SGI HPC Update for June 2013
PDF
Deep Learning on the SaturnV Cluster
PDF
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PDF
20160407_GTC2016_PgSQL_In_Place
PDF
SGI HPC DAY 2011 Kiev
PDF
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
PDF
20170602_OSSummit_an_intelligent_storage
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
PDF
Supermicro High Performance Enterprise Hadoop Infrastructure
PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PDF
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
PPT
PDF
Building Software Ecosystems for AI Cloud using Singularity HPC Container
PDF
PG-Strom - A FDW module utilizing GPU device
PDF
Computing using GPUs
PDF
pgconfasia2016 plcuda en
PPTX
GPGPU programming with CUDA
PDF
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
GPGPU Accelerates PostgreSQL (English)
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
SGI HPC Update for June 2013
Deep Learning on the SaturnV Cluster
PG-Strom v2.0 Technical Brief (17-Apr-2018)
20160407_GTC2016_PgSQL_In_Place
SGI HPC DAY 2011 Kiev
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
20170602_OSSummit_an_intelligent_storage
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Supermicro High Performance Enterprise Hadoop Infrastructure
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
Building Software Ecosystems for AI Cloud using Singularity HPC Container
PG-Strom - A FDW module utilizing GPU device
Computing using GPUs
pgconfasia2016 plcuda en
GPGPU programming with CUDA
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
Ad

Similar to GIST AI-X Computing Cluster (20)

PDF
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
PDF
The Convergence of HPC and Deep Learning
PDF
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
PDF
Latest HPC News from NVIDIA
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
PDF
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
PDF
GTC 2022 Keynote
PPTX
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
PDF
Talk on commercialising space data
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
GTC 2017: Powering the AI Revolution
PDF
Infrastructure and Tooling - Full Stack Deep Learning
PDF
Future of hpc
PDF
EPSRC CDT Conference
PDF
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
PDF
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
PDF
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
PPT
Current Trends in HPC
PDF
Tesla Accelerated Computing Platform
PDF
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
The Convergence of HPC and Deep Learning
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
Latest HPC News from NVIDIA
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
GTC 2022 Keynote
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
Talk on commercialising space data
Hardware & Software Platforms for HPC, AI and ML
GTC 2017: Powering the AI Revolution
Infrastructure and Tooling - Full Stack Deep Learning
Future of hpc
EPSRC CDT Conference
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Current Trends in HPC
Tesla Accelerated Computing Platform
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
Ad

Recently uploaded (20)

PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CH1 Production IntroductoryConcepts.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Welding lecture in detail for understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Sustainable Sites - Green Building Construction
CYBER-CRIMES AND SECURITY A guide to understanding
Structs to JSON How Go Powers REST APIs.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CH1 Production IntroductoryConcepts.pptx
573137875-Attendance-Management-System-original
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Welding lecture in detail for understanding
Internet of Things (IOT) - A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Lesson 3_Tessellation.pptx finite Mathematics
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

GIST AI-X Computing Cluster

  • 1. GIST AI-X Computing Cluster Jargalsaikhan Narantuya (자르갈) 2021-07-21 AI Graduate School Gwangju Institute of Science and Technology (GIST)
  • 2. Introduction • As the complexity of machine learning (ML) models and training data grow enormously, methods that scale with computation are becoming the future of Artificial Intelligence (AI) development. Source: NVIDIA SC2020 2 • Powerful accelerated-computation is required for big data analysis and machine learning.
  • 3. 3 • Using special hardware to perform some functions more efficiently than running on a CPU. Hardware Acceleration Started with GPUs and now includes FPGAs (SMART-NIC), ASICs GPU accelerated applications (compared with CPU realization, source Nvidia)
  • 4. Graphical Processing Unit (GPU) Mythbusters Demo • Historically, GPU was intended for graphics applications only, to ensure monitor output at each PC (draw polygons). • Now, it is broadly used in machine learning as a co-processor to accelerate CPUs for general-purpose computing. CPU GPU 4 AI computing is not only having multiple GPUs !!!
  • 5. GIST AI-X Cluster Center SINGULARITY Invested more than $1 million 5 5,000,000,000,000,000 floating-point operations in a second
  • 6. 6 8x NVIDIA A100 GPUs 320 GB GPU Memory 1 TB System Memory Dual AMD Rome 2.25 GHz (64-Core) CPU 9x 200Gb/s NIC 8x NVIDIA V100 GPUs 256 GB GPU Memory 512 GB System Memory Dual Intel Xeon E5-2698 v4 2.2 GHz (20-Core) CPU 4x 100Gb/s NIC DGX A100 DGX-1V Computing
  • 7. 7 1-node Multi-GPUs (8 x A100) 1-node Multi-GPUs (8 x A100) GPU Node 1 (DGXA100-1) AI Graduate School 1-node Multi-GPUs (8 x Tesla V100) GPU Node 2 (DGXA100-2) AI Graduate School 2 2 2 Ceph Storage 140 TB: Each user ~ 3TB … Cloud Login Node (Controller + Master) AI-X Data Pond 170 TB: Each user ~ 5TB … 2 GPU Node 3 (DGX1v-1) AI Graduate School GPU Node 4 (DGX1v-2) AI Graduate School 1-node Multi-GPUs (8 x Tesla V100) Local Storage 3 TB: Each user ~ 100GB … Box Login Node (Slurm + K8S Master) AI-X Data Pond 170 TB: Each user ~ 5 TB … Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB) 8x 200G (Each DGX A100) 2x 100G (Each DGX-1V) 2x 100G (Each DGX node) 8x 40G 1 External Fabric Module (XFM) 4x 100 G Storage Node (AI-X Data Pond) FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB Compute + DataPond Node 1 (6x 6.4TB) Compute + DataPond Node 2 (6x 6.4TB) Compute + DataPond Node 3 (6x 6.4TB) Compute + DataPond Node 4 (6x 6.4TB) Compute + DataPond Nodes (2u4N) 25G (Each node) 25G (Each node) Management Network (10G) Data Network (100G RoCE) Internal Network (100G IB) Data Network (25G RoCE) Internal Network (200G IB) Campus Network (1G) Ceph (25G RoCE)
  • 8. IDF IDF 8 1-node Multi-GPUs (8 x A100) 1-node Multi-GPUs (8 x A100) GPU Node 1 (DGXA100-1) AI Graduate School 1-node Multi-GPUs (8 x Tesla V100) GPU Node 2 (DGXA100-2) AI Graduate School AI-X Data Pond 140 TB … (Controller + Master) AI-X Data Lake 170 TB … GPU Node 3 (DGX1v-1) AI Graduate School GPU Node 4 (DGX1v-2) AI Graduate School 1-node Multi-GPUs (8 x Tesla V100) Local Storage 3 TB … Core Cloud (Slurm + K8S Master) AI-X Data Lake 170 TB … Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB) 8x 200G (Each DGX A100) 2x 100G (Each DGX-1V) 100G (Each node) 8x 40G External Fabric Module (XFM) 4x 100 G Storage Node (AI-X Data Lake) FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB Compute + DataPond Node 1 (6x 6.4TB) Compute + DataPond Node 2 (6x 6.4TB) Compute + DataPond Node 3 (6x 6.4TB) Compute + DataPond Node 4 (6x 6.4TB) Compute + DataPond Nodes (2u4N) Management Network (10G) Data Network (100G RoCE) Internal Network (100G IB) Internal Network (200G IB) Mellanox SN2100 (100G RoCE) 1 km MDF AI-X Front Cluster AI-X Back Cluster
  • 9. 9 1-node Multi-GPUs (8 x A100) 1-node Multi-GPUs (8 x A100) GPU Node 3 (DGXA100-1) AI Graduate School 1-node Multi-GPUs (8 x Tesla V100) GPU Node 4 (DGXA100-2) AI Graduate School GPU Node 5 (DGX1v-1) AI Graduate School GPU Node 6 (DGX1v-2) AI Graduate School 1-node Multi-GPUs (8 x Tesla V100) Local Storage 3 TB (shared) … Core Cloud (Slurm + K8S Master) AI-X Data Lake 170 TB … Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB) 8x 200G (Each DGX A100) 2x 100G (Each DGX-1V) 100G (Each node) 8x 40G External Fabric Module (XFM) 4x 100 G Storage Node (AI-X Data Lake) FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB Management Network (10G) Data Network (100G RoCE) Internal Network (100G IB) Internal Network (200G IB) AI-X Back Cluster 1-node Multi-GPUs (8 x A100) 1-node Multi-GPUs (8 x A100) GPU Node 1 (DGXA100-1) AI Graduate School GPU Node 2 (DGXA100-2) AI Graduate School 1 2 2 2 2 2 2
  • 10. 10 NFS Client Login Node DGX Node NFS Client Pure Storage (NFS Server) … … Individual ( /mnt/user_id directory ) Storage Node (AI-X Data Pond) FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB FlashBlade 17TB • Network file system (NFS v3) - NFS server and client Storage • Ceph: Software-defined storage - Object storage - Block storage - File storage
  • 11. Networking (Management, Internal, Data) • AI-X computing cluster: - Data network: Mellanox SN2100 (100G RoCE) - Internal network: Mellanox QM8700 (200G Infiniband) • Commonly used in HPC: - High throughput and low latency. - Connect supercomputers and storage systems. - RoCE: RDMA over Converged Ethernet - Infiniband (IB): low latency and high bandwidth for system area network (SAN) • Link speed - Enhanced Data Rate (EDR) - 25Gb/s per lane (100Gb/s for 4x) - High Data Rate (HDR) – 50Gb/s per lane (200Gb/s for 4x) 11
  • 12. DeepOps-based GPU Cloud Deployment • Open-source project to facilitate deployment of multi-node GPU clusters for deep learning • DeepOps is also recognized as the DGX POD management software • Deployment options (single node, multi-node)  Kubernetes (GPU-enabled Kubernetes cluster using DeepOps)  Slurm (GPU-enabled Slurm cluster using DeepOps)  DGX POD Hybrid Cluster (A hybrid cluster with both Kubernetes and Slurm)  Virtual (virtualized version of DeepOps) DeepOps Deployment in AI-X Rack: Multi-node GPU Cluster with DeepOps 12 NVIDIA CEO: Jensen Huang
  • 13. Background 13 “Methods that scale with computation are the future of Artificial Intelligence” — Richard S. Sutton, Father of reinforcement learning
  • 14. • Simple Linux Utility for Resource Management: - Open source, fault-tolerant, and highly scalable cluster management and job scheduling system. • Deployed at various national and international computing centers. - Approximately 60% of the TOP500 supercomputers in the world. • Three key functions: - Allocates exclusive or non-exclusive access to compute nodes for some duration of time. - Provides a framework for starting, executing, and monitoring work (normally a parallel job). - Arbitrates contention for resources by managing a queue of pending work. 14 JUNE 2021 (TOP 500 LIST)
  • 15. Why share resources ? 15 Shower room: Resource (GPU …) People: Jobs Color: Lab, Company, User …
  • 16. 16  Used Singularity container: • High performance container technology. • Specifically for large-scale and cross-node HPC and DL workloads. • Lightweight, fast deployment, and convenient migration. • Supports conversion from Docker images to Singularity images.  User Permissions: Can be started by both root and non-root users.  Performance: More lightweight, smaller kernel namespace, less performance loss.  HPC-Optimized: Highly suitable for scenarios where HPC is used (Slurm, OpenMPI, Infiniband). Singularity + Docker Users can use Singularity without having to perform extra adaptation to HPC.  Frameworks, such as Tensorflow and Pytorch are essential for implementing DL applications  Containerization technology is adapted to provide all user requirements independently.
  • 17. Why Singularity? I am familiar with Docker • Security: • HPC environments are typically multi-user systems where users should only have access to their own data. • For all practical purposes, Docker requires super-user privileges. • It is hard to give someone limited Docker access. • Scheduling: • Users submit jobs with CPU/memory/GPU/Time requirements in Slurm. • Docker command is an API client that talks to the Docker daemon. • Singularity runs container processes without a daemon (run as child processes). • Other concerns: • Docker is just better at running applications on VM or cloud infrastructure. • Singularity is better for command line applications and accessing devices like GPUs or MPI hardware. 17
  • 18. Distributed Training/Parallel Computing Message Passing Interface • Parallelism on HPC is obtained by using MPI. • Uses high performance Infiniband communication network. • OpenMPI: - Open Source Message Passing Interface for multi-process. - Interface for delivering process results to each other. - Used by many TOP500 supercomputers. 18 1 2 … 25 Sum: 325 26 27 … 50 Sum: 950 51 52 … 75 Sum: 1575 76 77 … 100 Sum: 2200 325 950 1575 2200 Sum: 5050 Process 0 Process 1 Process 2 Process 3 Process 0 Process 0 Process 1 Process 2 Process 3 High Speed Network (Infiniband) Message Passing Interface
  • 19. Horovod (Distributed deep learning framework) • Distributed deep learning training framework by Uber - Make distributed deep learning fast and easy to use. - Enable training across multiple hosts with multiple GPUs. - TensorFlow, Keras, PyTorch, Apache MXNet. • Distributed training: - Data parallelism (data). - Model parallelism (layer). 19 Reduce training time for deep neural networks by using multiple GPUs.
  • 21. Using GIST AI-X Computing Cluster 21
  • 22. How to Use? 1. Create ID/Pass (contact to admin or ask your labmate) 2. Connect to the login node (SSH, ) 3. Copy data to the login node (mounted directory) 4. Request resource and submit job (Partitions: v100 and a100) 5. Move to DGX node with requested resources 6. Build writable Singularity image 7. Run Singularity container and do your job 22
  • 23. Requesting resources and building container image 23
  • 25. Resource Allocation Policy 25 Flavor Name GPUs CPU cores Memory Storage Number of jobs Time small_v100 2x V100 20 100 GB 3 TB 4 3 days medium_v100 4x V100 40 200 GB 3 TB 2 7 days large_v100 8x V100 80 450 GB 3 TB 2 21 days small_a100 2x A100 64 200 GB 3 TB 4 3 days medium_a100 4x A100 128 450 GB 3 TB 2 7 days large_a100 8x A100 256 950 GB 3 TB 2 21days Flavor small_v100 in detail: - Resource: at most 2 GPUs, 20 CPU cores, and 100GB memory for a single job. - Number of jobs: User can submit at most 4 jobs with one user ID. - Time limit: User can run a single job for 3 days (72 hours). After 3 days, your job will be automatically canceled. - You can re-start your job again if there is enough resource. • All labs can use the GIST AI-X computing cluster. • By default, flavor “small_v100” is allocated to each lab account. • If you need more resources, contact to admin.
  • 26. Thank You! Q&A 26 Email: jargalsaikhan.n@gist.ac.kr , Phone: 6356 Office: AI Graduate School Building S7, 1st Floor, Researcher’s Office