Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul 2021

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
K O R E A | M A Y 1 1 - 1 2 , 2 0 2 1

Amazon SageMaker를 통한
대용량 모델 훈련 방법 살펴보기
김대근
AIML 스페셜리스트 솔루션즈 아키텍트
AWS
최영준
솔루션즈 아키텍트
AWS

Agenda
Background
분산 훈련 성능을 높이기 위한 AWS 3가지 기술
분산 훈련 모니터링하기
Demo

Background

분산 훈련의 주요 이슈
Network I/O, Storage ↔ CPU ↔ GPU 간의 병목
• 훈련을 위해 GB 및 TB의 데이터에 빠르게 액세스 필요
• 수백 개 이상 GPU의 훈련에 대한 인프라 관리 전문가 필요
모니터링
• 모델 훈련 과정에 대한 가시성 미확보
• 리소스 사용율 및 훈련 오류 모니터링 통합 솔루션의 부재
기존 솔루션 (EC2 DLAMI / AWS Batch / ECS&EKS 등)
• 분산 훈련에 필요한 인프라 & 환경 설정 및 동기화의 어려움
• 디버깅 툴의 부재 (예: Gradient Checking, Attention, GradCam)
. . .
. . .
. . .

분산 훈련 성능을 높이기 위한
AWS 3가지 기술

Architecture for SageMaker
Distributed Training
Amazon S3
훈련 데이터셋
Amazon SageMaker
훈련 스크립트
Amazon FSx
for Lustre
SageMaker Training Job
VPC p3dn/p4d instances
Amazon ECR
관리되고 최적화된
데이터/모델 병렬 훈련 툴킷 포함
Amazon
CloudWatch
Amazon
SageMaker
Debugger
Amazon S3
모델 아티팩트

Amazon FSx for Lustre
on Amazon SageMaker
Amazon
S3
Amazon
SageMaker
•SageMaker 머신 러닝 모델의 입력 데이터 소스로 사용 가능
•S3 다운로드 단계 제거로 파일 시스템 곧바로 시작 가능 → 훈련 속도 향상
•동일 데이터셋에서 반복 작업을 위해 공통 오브젝트를 반복적으로 다운로드할 필요 없음
Massively scalable performance
100+ GiB/s throughput
Millions of IOPS
Consistent low latencies Amazon
FSx for Lustre

Elastic Fabric Adapter (EFA)
• 긴밀하게 결합된 HPC 및 ML 워크로드 확장
• 15ms 미만의 네트워크 대기 시간
• 100 Gbps 네트워크 대역폭
EFA
Scale tightly-coupled
HPC applications on AWS
• Elastic Fabric Adapter,
대규모 HPC 워크로드에 가장 적합

SageMaker’s Model/Data parallelism Library
Amazon SageMaker
Data Parallel
Amazon SageMaker
Model Parallel
GPU 0 GPU 1 GPU N
…
…
W Wgrad
batch size
𝑁
batch size
𝑁
batch size
𝑁
Reference : FAST DISTRIBUTED TRAINING OF DEEP NEURAL NETWORKS: DYNAMIC COMMUNICATION THRESHOLDING FOR MODEL AND DATA PARALLELISM
GPU 0 GPU 1 GPU N
Xact
Xgrad
model
subgraph 1 subgraph 2 subgraph n
…
batch-split

S3로 데이터를 업로드한 후, SageMaker에서 TensorFlow/PyTorch로분산 훈련을 수행합니다.
SageMaker Data Parallelism 동작 원리
Amazon SageMaker
데이터 병렬 처리가
활성화된 SageMaker
훈련 작업 시작
Amazon
SageMaker
데이터 병렬 훈련 툴킷
포함
Horovod 및 Distributed Data
Parallel과 같은 널리 사용되는
API 지원
Worker의 복제본에 대한 완전
관리형 훈련
여러 GPU에서 worker를
자동으로 동기화
Amazon S3
훈련 데이터셋 저장
훈련 스크립트로
데이터 병렬 처리
라이브러리 가져오기
배포 준비가 완료된
Amazon S3의
훈련된 모델

Communication cost가 많이 소요됩니다.
Traditional Data Parallelism
GPU 0
GPU N GPU 1
GPU 2
…
grad_GPU0
grad_GPU1 +
grad_GPU2
grad_GPU1 +
grad_GPU2 +
…
grad_GPUN
Gradient averaging
GPU GPU GPU GPU
병목 현상이 발생합니다.
Parameter Server Ring All-reduce (Horovod)

Balanced Fusion Buffers (BFB)
• 전통적인 파라메터 서버는 변수를
원자 단위atomic unit로 취급합니다.
• 각 변수는 하나의 서버에 배치됩니다.
• 각 서버는 각 역방향 패스 경로 중
일부만 활성화됩니다.
• BFB는 gradient를 보유하는 각 GPU의 버퍼입니다.
• 각 서버마다 정확히 동일한 바이트를 할당합니다.
• i번째 서버는 모든 worker로부터 BFB의 i번째
파티션을 수신하고 이를 합산하여 결과를
모든 worker에게 보냅니다.
Wh1
500M
Wh2
100M
150M 150M 150M 150M
파라메터 서버: Worker = 1:1
Traditional BFB

Balanced Fusion Buffers (BFB) 예시
5 3 1
4 2
50MB 50MB 250MB 75MB 100MB
Gradient
BFB
(GPU 0)
BFB
(GPU 1)
5 3 41 42 43 21 22 11 12
50MB 50MB 100MB 100MB 50MB 50MB 25MB 75MB 25MB
100MB
5 3 4 2 1
5 3 41 42 43 21 22 11 12
50MB 50MB 200MB 100MB 50MB 50MB 25MB 75MB 25MB
5 4 3 2 1

Training Script (PyTorch)
import smdistributed.dataparallel.torch.distributed as dist
from smdistributed.dataparallel.torch.parallel.distributed
import DistributedDataParallel as DDP
dist.init_process_group()
batch_size //= dist.get_world_size() // 8
batch_size = max(batch_size, 1)
train_sampler = DistributedSampler(
train_dataset, num_replicas=dist.get_world_size(),
rank=dist.get_rank())
train_loader = torch.utils.data.DataLoader(..)
model = DDP(model.to(device))
torch.cuda.set_device(dist.get_local_rank())
model.cuda(dist.get_local_rank())
if dist.get_rank() == 0:
torch.save(checkpoint_dir)
Initialize
Scale Parameters
Set rank in
DistributedSampler
Pin each GPU
Checkpoint on master node
https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html
01.
02.
03.
04.
05.

훈련 실행 코드에 distribution을 추가해 줍니다.
SageMaker 실행 코드
# 분산 훈련 수행 선언 (CUDA 11에서 수행되도록 설정)
distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}
# Import 문
estimator = PyTorch(
instance_type=‘ml.p4d.24xlarge’, # ml.p3.16xlarge, ml.p3dn.24xlarge 지원
instance_count=8,
framework_version='1.6.0',
py_version='py36’,
...
distribution=distribution,
)

Model
Node 수
(p3dn.24xl)
Throughput Scaling Efficiency
TF2-HVD SageMaker Speed up TF2-HVD SageMaker Improvement
BERT Large
(seqs/sec)
2 1300 1600 23% 66% 82% 16%
4 2316 2440 5% 59% 62% 3%
8 5000 5200 4% 63% 66% 3%
MaskRCNN
(samples/sec)
2 130 140 8% 84% 91% 7%
4 255 259 2% 83% 84% 1%
8 495 525 6% 80% 85% 5%
• Throughput : 초당 처리 데이터 수
• Scaling Efficiency : 단일 노드 처리량 * Node 수 대비 클러스터 처리량
SageMaker Data Parallelism vs. TF2 Horovod

Model
Node 수
(p3dn.24xl)
Throughput Scaling Efficiency
PT-DDP SageMaker Speed up PT-DDP SageMaker Improvement
BERT Large
(seqs/sec)
2 1752 2479 41% 64% 90% 26%
4 3017 4603 52% 55% 84% 29%
8 7409 8551 15% 67% 78% 11%
MaskRCNN
(samples/sec)
2 152 158 4% 82% 85% 3%
4 258 307 19% 70% 83% 13%
8 545 617 13% 74% 84% 10%
• Throughput : 초당 처리 데이터 수
• Scaling Efficiency : 단일 노드 처리량 * Node 수 대비 클러스터 처리량
SageMaker Data Parallelism vs. PyTorch DDP

모델을 자동으로 분할하고 서브
그래프를 디바이스로 전송
변수 및 그래프 구조 분석
파이프라인 마이크로배치pipelined
microbatches에 대한 관리형 분산 훈련
S3로 데이터를 업로드한 후, SageMaker에서 TensorFlow/PyTorch로분산 훈련을 수행합니다.
SageMaker Data Parallelism 동작 원리
Amazon SageMaker
데이터 병렬 처리가
활성화된 SageMaker
훈련 작업 시작
Amazon
SageMaker
데이터 병렬 훈련 툴킷
포함
Amazon S3
훈련 데이터셋 저장
훈련 스크립트로
데이터 병렬 처리
라이브러리 가져오기
배포 준비가 완료된
Amazon S3의
훈련된 모델

Naïve Model Parallelism
GPU 0
GPU 1
GPU 2
Time
•Layerwise partition (Pipeline)
•Forward pass/backward pass 순차 계산
Idle

Training Script (PyTorch)
import smdistributed.modelparallel.torch as smp
smp.init()
...
model = smp.DistributedModel(model)
scaler = smp.amp.GradScaler()
optimizer = smp.DistributedOptimizer(optimizer)
@smp.step
def train_step(model, scaler, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction="mean")
model.backward(loss)
return output, loss
for batch_idx, (data, target) in enumerate(train_loader):
_, loss_mb = train_step(model, scaler, data, target)
# SageMaker model parallel: average the loss across
microbatches
loss = loss_mb.reduce_mean()
if smp.rank() == 0:
print(f"Loss: {loss}")
https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel.html
Initialize
01.
Define @smp.step
02.
Average the loss
across microbatches
03.
Print the loss only at rank 0
04.

Model Parallelism Parameters
mpi_options = {
"enabled" : True,
"processes_per_host" : 8,
"custom_mpi_options" : "--mca
btl_vader_single_copy_mechanism none "
}
smp_options = {
"enabled": True,
"parameters": {
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved”,
"optimize": "speed",
"partitions": 2,
"ddp": True,
}
}
estimator = PyTorch(
...
distribution={
"smdistributed": smp_options,
"mpi": mpi_options
}
)
https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel.html
Simple
Interleaved
F0: Microbatch 0 for Forward pass
B0: Microbatch 0 for Backward pass

Pipeline Execution (TensorFlow)

SageMaker Model parallelism performance
Num GPUs Without modelparallel With modelparallel
2 OOM 15.3
4 OOM 24.0
8 208.8 264.0
8 OOM 18.8
GPT-2 LARGE, TENSORFLOW, P3.16XLARGE, 2 PARTITIONS, SAMPLES/SEC
BERT LARGE, TENSORFLOW, P3.16XLARGE, 2 PARTITIONS, SAMPLES/SEC
T5-3B, PYTORCH, P3DN.24XLARGE, 8 PARTITIONS, SAMPLES/SEC
•partitions : 분석을 통해 모델의 graph를 나눈 subgraph 수
•OOM : Out Of Memory

분산 훈련
모니터링하기

SageMaker Model 모니터링
SageMaker metric definitions
훈련 모델의 성능 모니터링
SageMaker Debugger
모델 훈련 중 Tensor 수집 및 이상 탐지
SageMaker Debugger Profiling
훈련 클러스터 사용률 모니터링
GPU Cluster
Metrics
P4d instance P4d instance P4d instance
… …
Model

훈련 모델의 성능 모니터링
Model
…
…
training log
SageMaker 설정
로그 정보를 이용한 Graph 제공
train:Prec@1
train:Loss

Writes/Reads 요청
SageMaker Debugger
Training in progress
Amazon Simple
Storage Service (S3)
Rule1 : Vanishing Gradient
Rule2 : not Decreasing Loss
Amazon
SageMaker
Debugger
RuleN : Custom Rule
Amazon CloudWatch
Event
Amazon SageMaker
Studio/Notebook
Action
- stoptraining
- email
Analyze/
Visualize
1
4
Debugger
Hook
System /
Framework
metrics
loss,
weight,
biases,
gradients,
etc.
loss,
gradients,
…
3
2

훈련 내 Tensor 수집 및 이상 탐지
이미지 gradients와 weights를 응용한 사례
수집된 Loss 를 이용한 Plot
Class Activation Maps 구현
gradients의
분포 plot

DebuggerProfiling은 자동으로시스템리소스사용률을모니터링하고,
포괄적인보고서로제공합니다.
훈련 클러스터 사용률 모니터링 – Profiling

권장 사항
Rule
Parameter
이슈
발생횟수
사전 정의된 rule에 의해 발생하는 이슈와 권장 사항, 발생 횟수,
rule에서 설정한 파라미터 값을 제공합니다.
훈련 클러스터 사용률 모니터링 – Profiling

Detect Performance Bottlenecks
Time (minutes)
GPU
CPU
Utilization
Detect Training Issues
Bottleneck 발생
분산 훈련 모니터링 사례
Low Model Performance
GPU Cluster
Metrics
P4d instance P4d instance P4d instance
… …
Model

Demo

Demo Code
https://github.com/Napkin-DL/ml-ws.git
https://github.com/aws-samples/sagemaker-distributed-training-
pytorch-kr.git

여러분의 소중한 피드백을 기다립니다.
강연 종료 후, 강연 평가에 참여해 주세요!

감사합니다

Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul 2021

More Related Content

What's hot (20)

Similar to Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul 2021 (20)

More from Amazon Web Services Korea (20)

Recently uploaded (20)

Amazon SageMaker를 통한 대용량 모델 훈련 방법 살펴보기 - 김대근 AWS AI/ML 스페셜리스트 솔루션즈 아키텍트 / 최영준 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul 2021