Deep learning acceleration with Amazon Elastic Inference

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Elastic Inference
Reduce deep learning inference costs by 75%
Hagay Lupesko, Amazon AI

Agenda
• GPU Inference in production –
characteristics and challenges
• Amazon Elastic Inference –
cost effective and flexible approach
• Coding inference with Elastic
Inference
• Demo
You will learn how to start
using Elastic Inference and
reduce your deep learning
inference cost by ~75%

Inference
90%
Training
10%
Deep learning workload cost

GPU inference inproduction

A closer look at GPU utilization for inference
0
100
200
300
400
500
600
700
800
900
1000
1 2 4 8 16 32 64
90%
underutilized for
single batch size
inference

A closer look at GPU cost for inference
P2 instances are more cost effective for online inference with small batch
sizes

How can we use a cost-effective acceleration?
Introducing

75% reduced cost
1 to 32 TFLOPs
Integrated with Amazon EC2
Integrated with Amazon SageMaker
Support for TensorFlow
Support for MXNet & ONNX

Acceleration sizes tailored for inference
Accelerator
Type
FP32
TOPS
FP16
TOPS
Accelerator
Memory
(GB)
Price ($US/hr)
eia1.medium 1 8 1 $0.13
eia1.large 2 16 2 $0.26
eia1.xlarge 4 32 4 $0.52
Available in N. Virginia, Ohio, Oregon, Dublin, Tokyo, and Seoul

Inference Performance with EI vs GPU
0
20
40
60
80
100
120
0
10
20
30
40
50
60
70
0
20
40
60
80
100
120
140

How does Elastic Inference work with Amazon EC2?
VPC
Region
Availability Zone

How does Elastic Inference work with SageMaker?
SageMaker Notebooks
SageMaker Hosted Endpoints

Framework Support
ONNX
Amazon EI
enabled
TensorFlow
Serving
Amazon EI
enabled Apache
MXNet
Applied using
Apache MXNet

Let’s look at some code!

Inference with Apache MXNet on CPU
# Loading a resnet-152 model the from local filesystem
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0)
# Initializing the module with a CPU context
mod = mx.mod.Module(symbol=sym, context=mx.cpu(), label_names=None)
# Binding the model for inference
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],
label_shapes=mod._label_shapes)
# Loading the weights
mod.set_params(arg_params, aux_params, allow_missing=True)
# Loading and pre-processing the image
img = get_image(...)
# And finally calling inference
mod.predict(mx.nd.array(img))

Inference with Apache MXNet on Elastic Inference
# Loading a resnet-152 model the from local filesystem
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0)
# Initializing the module with an EIA context
mod = mx.mod.Module(symbol=sym, context=mx.eia(), label_names=None)
# Binding the model for inference
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],
label_shapes=mod._label_shapes)
# Loading the weights
mod.set_params(arg_params, aux_params, allow_missing=True)
# Loading and pre-processing the image
img = get_image(...)
# And finally calling inference
mod.predict(mx.nd.array(img))

Inference with SageMaker and Apache MXNet
from sagemaker.mxnet.model import MXNetModel
# Load a pre-trained model from s3
sagemaker_model = MXNetModel(model_data = 's3://.../model.tar.gz', ...)
# Initialize a predictor
predictor = sagemaker_model.deploy(
initial_instance_count=1,
instance_type='ml.m4.xlarge’)

Inference with SageMaker and Apache MXNet
Accelerated by Elastic Inference
from sagemaker.mxnet.model import MXNetModel
# Load a pre-trained model from s3
sagemaker_model = MXNetModel(model_data = 's3://.../model.tar.gz', ...)
# Initialize a predictor
predictor = sagemaker_model.deploy(
initial_instance_count=1,
instance_type='ml.m4.xlarge’,
accelerator_type='ml.eia1.medium')

Demo (https://bit.ly/2CDTtV9)
Managed model hosting with:
- Amazon SageMaker
- Elastic Inference
- Apache MXNet
Wifi: GBDC Zettabytes 2019 / gdbc2019

How to choose the EI size you need?
Considerations for choosing instance type and accelerator:
• Latency requirements -> EI TFLOPs
• Model size -> EI Memory
• Input/output data payload has an impact on latency
• Convert to FP16 for lower latency and higher throughput
• Experiment: try out and measure !
Start small and size up as needed

Summary
• Use EI to reduce your
• Launch with any EC2 instance type
• Host for inference with Amazon SageMaker for a fully managed
experience
• Deploy TensorFlow, MXNet and ONNX models with minimal code
changes.
To get started with Amazon Elastic Inference, visit:
https://aws.amazon.com/machine-learning/elastic-inference/

Thank you!
To get started with Amazon Elastic Inference, visit:
https://aws.amazon.com/machine-learning/elastic-inference/

Deep learning acceleration with Amazon Elastic Inference

More Related Content

Similar to Deep learning acceleration with Amazon Elastic Inference (11)

More from Hagay Lupesko (6)

Recently uploaded (20)

Deep learning acceleration with Amazon Elastic Inference

Editor's Notes