GIST AI-X Computing Cluster

GIST AI-X Computing Cluster
Jargalsaikhan Narantuya (자르갈)
2021-07-21
AI Graduate School
Gwangju Institute of Science and Technology (GIST)

Introduction
• As the complexity of machine learning (ML) models and training data grow enormously, methods that
scale with computation are becoming the future of Artificial Intelligence (AI) development.
Source:
NVIDIA
SC2020
2
• Powerful accelerated-computation is required for big data analysis and machine learning.

3
• Using special hardware to perform some functions more efficiently than running on a CPU.
Hardware Acceleration
Started with GPUs and now includes FPGAs (SMART-NIC), ASICs
GPU accelerated applications (compared with CPU realization, source Nvidia)

Graphical Processing Unit (GPU)
Mythbusters
Demo
• Historically, GPU was intended for graphics applications only, to ensure monitor output at each PC (draw polygons).
• Now, it is broadly used in machine learning as a co-processor to accelerate CPUs for general-purpose computing.
CPU GPU
4
AI computing is
not only having multiple GPUs !!!

GIST AI-X Cluster Center
SINGULARITY
Invested more than $1 million
5
5,000,000,000,000,000 floating-point
operations in a second

6
8x NVIDIA A100 GPUs
320 GB GPU Memory
1 TB System Memory
Dual AMD Rome
2.25 GHz (64-Core) CPU
9x 200Gb/s NIC
8x NVIDIA V100 GPUs
256 GB GPU Memory
512 GB System Memory
Dual Intel Xeon E5-2698 v4
2.2 GHz (20-Core) CPU
4x 100Gb/s NIC
DGX A100
DGX-1V
Computing

7
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
GPU Node 1 (DGXA100-1)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
AI Graduate School
2 2 2
Ceph Storage 140 TB: Each user ~ 3TB
…
Cloud Login Node
(Controller + Master)
AI-X Data Pond 170 TB: Each user ~ 5TB
…
2
GPU Node 3 (DGX1v-1)
AI Graduate School
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
Local Storage 3 TB: Each user ~ 100GB
…
Box Login Node
(Slurm + K8S Master)
AI-X Data Pond 170 TB: Each user ~ 5 TB
…
Mellanox SN2100 (100G RoCE) Mellanox QM8700 (200G IB)
8x 200G
(Each DGX A100)
2x 100G
(Each DGX-1V)
2x 100G
(Each DGX node) 8x 40G
1
External Fabric Module (XFM)
4x 100 G
Storage Node
(AI-X Data Pond)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
Compute + DataPond
Node 1 (6x 6.4TB)
Compute + DataPond
Node 2 (6x 6.4TB)
Compute + DataPond
Node 3 (6x 6.4TB)
Compute + DataPond
Node 4 (6x 6.4TB)
Compute + DataPond
Nodes (2u4N)
25G
(Each node)
25G
(Each node)
Management Network (10G)
Data Network (100G RoCE)
Internal Network (100G IB)
Campus Network (1G)
Ceph (25G RoCE)

IDF IDF
8
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
AI Graduate School
AI-X Data Pond 140 TB
…
(Controller + Master)
AI-X Data Lake 170 TB
…
AI Graduate School
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
Local Storage 3 TB
…
Core Cloud
…
8x 200G
(Each DGX A100)
2x 100G
(Each DGX-1V)
100G
(Each node) 8x 40G
4x 100 G
Storage Node
(AI-X Data Lake)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
Compute + DataPond
Node 1 (6x 6.4TB)
Compute + DataPond
Node 2 (6x 6.4TB)
Compute + DataPond
Node 3 (6x 6.4TB)
Compute + DataPond
Node 4 (6x 6.4TB)
Compute + DataPond
Nodes (2u4N)
Mellanox SN2100 (100G RoCE)
1 km
MDF
AI-X Front Cluster AI-X Back Cluster

9
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
AI Graduate School
AI Graduate School
AI Graduate School
1-node Multi-GPUs
(8 x Tesla V100)
Local Storage 3 TB (shared)
…
Core Cloud
…
8x 200G
(Each DGX A100)
2x 100G
(Each DGX-1V)
100G
(Each node) 8x 40G
4x 100 G
Storage Node
(AI-X Data Lake)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
AI-X Back Cluster
1-node Multi-GPUs
(8 x A100)
1-node Multi-GPUs
(8 x A100)
AI Graduate School
AI Graduate School
1
2 2 2
2 2 2

10
NFS Client
Login Node
DGX Node
NFS Client
Pure Storage (NFS Server)
… …
Individual
( /mnt/user_id directory )
Storage Node
(AI-X Data Pond)
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
FlashBlade 17TB
• Network file system (NFS v3)
- NFS server and client
Storage
• Ceph: Software-defined storage
- Object storage
- Block storage
- File storage

Networking (Management, Internal, Data)
• AI-X computing cluster:
- Data network: Mellanox SN2100 (100G RoCE)
- Internal network: Mellanox QM8700 (200G Infiniband)
• Commonly used in HPC:
- High throughput and low latency.
- Connect supercomputers and storage systems.
- RoCE: RDMA over Converged Ethernet
- Infiniband (IB): low latency and high bandwidth for system area network (SAN)
• Link speed
- Enhanced Data Rate (EDR) - 25Gb/s per lane (100Gb/s for 4x)
- High Data Rate (HDR) – 50Gb/s per lane (200Gb/s for 4x)
11

DeepOps-based GPU Cloud Deployment
• Open-source project to facilitate deployment of multi-node GPU clusters for deep learning
• DeepOps is also recognized as the DGX POD management software
• Deployment options (single node, multi-node)
 Kubernetes (GPU-enabled Kubernetes cluster using DeepOps)
 Slurm (GPU-enabled Slurm cluster using DeepOps)
 DGX POD Hybrid Cluster (A hybrid cluster with both Kubernetes and Slurm)
 Virtual (virtualized version of DeepOps)
DeepOps Deployment in AI-X Rack: Multi-node GPU Cluster with DeepOps 12
NVIDIA CEO: Jensen Huang

Background
13
“Methods that scale with computation are the future of Artificial Intelligence”
— Richard S. Sutton, Father of reinforcement learning

• Simple Linux Utility for Resource Management:
- Open source, fault-tolerant, and highly scalable cluster management and job scheduling system.
• Deployed at various national and international computing centers.
- Approximately 60% of the TOP500 supercomputers in the world.
• Three key functions:
- Allocates exclusive or non-exclusive access to compute nodes for some duration of time.
- Provides a framework for starting, executing, and monitoring work (normally a parallel job).
- Arbitrates contention for resources by managing a queue of pending work.
14
JUNE 2021 (TOP 500 LIST)

Why share resources ?
15
Shower room: Resource (GPU …)
People: Jobs
Color: Lab, Company, User …

16
 Used Singularity container:
• High performance container technology.
• Specifically for large-scale and cross-node HPC and DL workloads.
• Lightweight, fast deployment, and convenient migration.
• Supports conversion from Docker images to Singularity images.
 User Permissions: Can be started by both root and non-root users.
 Performance: More lightweight, smaller kernel namespace, less performance loss.
 HPC-Optimized: Highly suitable for scenarios where HPC is used (Slurm, OpenMPI, Infiniband).
Singularity + Docker
Users can use Singularity without having
to perform extra adaptation to HPC.
 Frameworks, such as Tensorflow and Pytorch are essential for implementing DL applications
 Containerization technology is adapted to provide all user requirements independently.

Why Singularity? I am familiar with Docker
• Security:
• HPC environments are typically multi-user systems where users should only
have access to their own data.
• For all practical purposes, Docker requires super-user privileges.
• It is hard to give someone limited Docker access.
• Scheduling:
• Users submit jobs with CPU/memory/GPU/Time requirements in Slurm.
• Docker command is an API client that talks to the Docker daemon.
• Singularity runs container processes without a daemon (run as child processes).
• Other concerns:
• Docker is just better at running applications on VM or cloud infrastructure.
• Singularity is better for command line applications and accessing devices like
GPUs or MPI hardware.
17

Distributed Training/Parallel Computing
Message Passing Interface
• Parallelism on HPC is obtained by using MPI.
• Uses high performance Infiniband communication network.
• OpenMPI:
- Open Source Message Passing Interface for multi-process.
- Interface for delivering process results to each other.
- Used by many TOP500 supercomputers.
18
1
2
…
25
Sum: 325
26
27
…
50
Sum: 950
51
52
…
75
Sum: 1575
76
77
…
100
Sum: 2200
325
950
1575
2200
Sum: 5050
Process 0
Process 1
Process 2
Process 3
Process 0
Process 0 Process 1 Process 2 Process 3
High Speed Network (Infiniband)
Message Passing
Interface

Horovod (Distributed deep learning framework)
• Distributed deep learning training framework by Uber
- Make distributed deep learning fast and easy to use.
- Enable training across multiple hosts with multiple GPUs.
- TensorFlow, Keras, PyTorch, Apache MXNet.
• Distributed training:
- Data parallelism (data).
- Model parallelism (layer).
19
Reduce training time for deep neural networks by using multiple GPUs.

Using GIST AI-X Computing Cluster
21

How to Use?
1. Create ID/Pass (contact to admin or ask your labmate)
2. Connect to the login node (SSH, )
3. Copy data to the login node (mounted directory)
4. Request resource and submit job (Partitions: v100 and a100)
5. Move to DGX node with requested resources
6. Build writable Singularity image
7. Run Singularity container and do your job
22

Requesting resources and building container image
23

Resource Allocation Policy
25
Flavor Name GPUs CPU cores Memory Storage Number of jobs Time
small_v100 2x V100 20 100 GB 3 TB 4 3 days
medium_v100 4x V100 40 200 GB 3 TB 2 7 days
large_v100 8x V100 80 450 GB 3 TB 2 21 days
small_a100 2x A100 64 200 GB 3 TB 4 3 days
medium_a100 4x A100 128 450 GB 3 TB 2 7 days
large_a100 8x A100 256 950 GB 3 TB 2 21days
Flavor small_v100 in detail:
- Resource: at most 2 GPUs, 20 CPU cores, and 100GB memory for a single job.
- Number of jobs: User can submit at most 4 jobs with one user ID.
- Time limit: User can run a single job for 3 days (72 hours). After 3 days, your job will be automatically canceled.
- You can re-start your job again if there is enough resource.
• All labs can use the GIST AI-X computing cluster.
• By default, flavor “small_v100” is allocated to each lab account.
• If you need more resources, contact to admin.

Thank You!
Q&A
26
Email: jargalsaikhan.n@gist.ac.kr , Phone: 6356
Office: AI Graduate School Building S7, 1st Floor, Researcher’s Office

GIST AI-X Computing Cluster

More Related Content

What's hot (20)

Similar to GIST AI-X Computing Cluster (20)

Recently uploaded (20)

GIST AI-X Computing Cluster