SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Elastic Inference
Reduce deep learning inference costs by 75%
Hagay Lupesko, Amazon AI
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• GPU Inference in production –
characteristics and challenges
• Amazon Elastic Inference –
cost effective and flexible approach
• Coding inference with Elastic
Inference
• Demo
You will learn how to start
using Elastic Inference and
reduce your deep learning
inference cost by ~75%
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference
90%
Training
10%
Deep learning workload cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GPU inference inproduction
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A closer look at GPU utilization for inference
0
100
200
300
400
500
600
700
800
900
1000
1 2 4 8 16 32 64
90%
underutilized for
single batch size
inference
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A closer look at GPU cost for inference
P2 instances are more cost effective for online inference with small batch
sizes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How can we use a cost-effective acceleration?
Introducing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
75% reduced cost
1 to 32 TFLOPs
Integrated with Amazon EC2
Integrated with Amazon SageMaker
Support for TensorFlow
Support for MXNet & ONNX
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Acceleration sizes tailored for inference
Accelerator
Type
FP32
TOPS
FP16
TOPS
Accelerator
Memory
(GB)
Price ($US/hr)
eia1.medium 1 8 1 $0.13
eia1.large 2 16 2 $0.26
eia1.xlarge 4 32 4 $0.52
Available in N. Virginia, Ohio, Oregon, Dublin, Tokyo, and Seoul
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference Performance with EI vs GPU
0
20
40
60
80
100
120
0
10
20
30
40
50
60
70
0
20
40
60
80
100
120
140
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How does Elastic Inference work with Amazon EC2?
VPC
Region
Availability Zone
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How does Elastic Inference work with SageMaker?
SageMaker Notebooks
SageMaker Hosted Endpoints
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Framework Support
ONNX
Amazon EI
enabled
TensorFlow
Serving
Amazon EI
enabled Apache
MXNet
Applied using
Apache MXNet
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s look at some code!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference with Apache MXNet on CPU
# Loading a resnet-152 model the from local filesystem
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0)
# Initializing the module with a CPU context
mod = mx.mod.Module(symbol=sym, context=mx.cpu(), label_names=None)
# Binding the model for inference
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],
label_shapes=mod._label_shapes)
# Loading the weights
mod.set_params(arg_params, aux_params, allow_missing=True)
# Loading and pre-processing the image
img = get_image(...)
# And finally calling inference
mod.predict(mx.nd.array(img))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference with Apache MXNet on Elastic Inference
# Loading a resnet-152 model the from local filesystem
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0)
# Initializing the module with an EIA context
mod = mx.mod.Module(symbol=sym, context=mx.eia(), label_names=None)
# Binding the model for inference
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],
label_shapes=mod._label_shapes)
# Loading the weights
mod.set_params(arg_params, aux_params, allow_missing=True)
# Loading and pre-processing the image
img = get_image(...)
# And finally calling inference
mod.predict(mx.nd.array(img))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference with SageMaker and Apache MXNet
from sagemaker.mxnet.model import MXNetModel
# Load a pre-trained model from s3
sagemaker_model = MXNetModel(model_data = 's3://.../model.tar.gz', ...)
# Initialize a predictor
predictor = sagemaker_model.deploy(
initial_instance_count=1,
instance_type='ml.m4.xlarge’)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inference with SageMaker and Apache MXNet
Accelerated by Elastic Inference
from sagemaker.mxnet.model import MXNetModel
# Load a pre-trained model from s3
sagemaker_model = MXNetModel(model_data = 's3://.../model.tar.gz', ...)
# Initialize a predictor
predictor = sagemaker_model.deploy(
initial_instance_count=1,
instance_type='ml.m4.xlarge’,
accelerator_type='ml.eia1.medium')
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo (https://bit.ly/2CDTtV9)
Managed model hosting with:
- Amazon SageMaker
- Elastic Inference
- Apache MXNet
Wifi: GBDC Zettabytes 2019 / gdbc2019
How to choose the EI size you need?
Considerations for choosing instance type and accelerator:
• Latency requirements -> EI TFLOPs
• Model size -> EI Memory
• Input/output data payload has an impact on latency
• Convert to FP16 for lower latency and higher throughput
• Experiment: try out and measure !
Start small and size up as needed
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary
• Use EI to reduce your
• Launch with any EC2 instance type
• Host for inference with Amazon SageMaker for a fully managed
experience
• Deploy TensorFlow, MXNet and ONNX models with minimal code
changes.
To get started with Amazon Elastic Inference, visit:
https://aws.amazon.com/machine-learning/elastic-inference/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
To get started with Amazon Elastic Inference, visit:
https://aws.amazon.com/machine-learning/elastic-inference/

More Related Content

PDF
AI & Machine Learning Web Day | Einführung in Amazon SageMaker, eine Werkbank...
PDF
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
PPTX
AWS re:Invent 2018 - AIM401 - Deep Learning using Tensorflow
PDF
Speed up your Machine Learning workflows with build-in algorithms
PDF
Amazon SageMaker 紹介 & ハンズオン(2018/07/03 実施)
PDF
Amazon AI/ML Overview
PDF
Accelerate ML workflows with Amazon SageMaker
PPTX
AWS re:Invent 2018 - Machine Learning recap (December 2018)
AI & Machine Learning Web Day | Einführung in Amazon SageMaker, eine Werkbank...
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
AWS re:Invent 2018 - AIM401 - Deep Learning using Tensorflow
Speed up your Machine Learning workflows with build-in algorithms
Amazon SageMaker 紹介 & ハンズオン(2018/07/03 実施)
Amazon AI/ML Overview
Accelerate ML workflows with Amazon SageMaker
AWS re:Invent 2018 - Machine Learning recap (December 2018)

Similar to Deep learning acceleration with Amazon Elastic Inference (11)

PDF
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
PPTX
Build, train and deploy ML models with SageMaker (October 2019)
PPTX
Apache MXNet and Gluon
PPTX
Optimize your machine learning workloads on AWS (March 2019)
PPTX
Distributed Model Training using MXNet with Horovod
PDF
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
PPTX
Amazon SageMaker (December 2018)
PDF
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
PDF
A Gentle Intro to Deep Learning
PPTX
Build, train and deploy your ML models with Amazon Sage Maker
PDF
Amazon SageMaker
AWS의 새로운 언어, 음성, 텍스트 처리 인공 지능 서비스, Amazon SageMaker::Sunil Mallya::AWS Summit...
Build, train and deploy ML models with SageMaker (October 2019)
Apache MXNet and Gluon
Optimize your machine learning workloads on AWS (March 2019)
Distributed Model Training using MXNet with Horovod
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Amazon SageMaker (December 2018)
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
A Gentle Intro to Deep Learning
Build, train and deploy your ML models with Amazon Sage Maker
Amazon SageMaker
Ad

More from Hagay Lupesko (6)

PPTX
AI Powered Personalization @ Scale - O'Reilly AI San Jose - Sep 2019
PPTX
What is deep learning (and why you should care) - Talk at SJSU Oct 2018
PPTX
Emotion recognition in images: from idea to a model in production - Nordic DS...
PPTX
Deep learning systems model serving
PPTX
Build, Train and Deploy ML Models using Amazon SageMaker
PPTX
ONNX - The Lingua Franca of Deep Learning
AI Powered Personalization @ Scale - O'Reilly AI San Jose - Sep 2019
What is deep learning (and why you should care) - Talk at SJSU Oct 2018
Emotion recognition in images: from idea to a model in production - Nordic DS...
Deep learning systems model serving
Build, Train and Deploy ML Models using Amazon SageMaker
ONNX - The Lingua Franca of Deep Learning
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
A comparative analysis of optical character recognition models for extracting...
sap open course for s4hana steps from ECC to s4
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation

Deep learning acceleration with Amazon Elastic Inference

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Elastic Inference Reduce deep learning inference costs by 75% Hagay Lupesko, Amazon AI
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • GPU Inference in production – characteristics and challenges • Amazon Elastic Inference – cost effective and flexible approach • Coding inference with Elastic Inference • Demo You will learn how to start using Elastic Inference and reduce your deep learning inference cost by ~75%
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference 90% Training 10% Deep learning workload cost
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GPU inference inproduction
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A closer look at GPU utilization for inference 0 100 200 300 400 500 600 700 800 900 1000 1 2 4 8 16 32 64 90% underutilized for single batch size inference
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A closer look at GPU cost for inference P2 instances are more cost effective for online inference with small batch sizes
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How can we use a cost-effective acceleration? Introducing
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 75% reduced cost 1 to 32 TFLOPs Integrated with Amazon EC2 Integrated with Amazon SageMaker Support for TensorFlow Support for MXNet & ONNX
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Acceleration sizes tailored for inference Accelerator Type FP32 TOPS FP16 TOPS Accelerator Memory (GB) Price ($US/hr) eia1.medium 1 8 1 $0.13 eia1.large 2 16 2 $0.26 eia1.xlarge 4 32 4 $0.52 Available in N. Virginia, Ohio, Oregon, Dublin, Tokyo, and Seoul
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference Performance with EI vs GPU 0 20 40 60 80 100 120 0 10 20 30 40 50 60 70 0 20 40 60 80 100 120 140
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How does Elastic Inference work with Amazon EC2? VPC Region Availability Zone
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How does Elastic Inference work with SageMaker? SageMaker Notebooks SageMaker Hosted Endpoints
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Framework Support ONNX Amazon EI enabled TensorFlow Serving Amazon EI enabled Apache MXNet Applied using Apache MXNet
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s look at some code!
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference with Apache MXNet on CPU # Loading a resnet-152 model the from local filesystem sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0) # Initializing the module with a CPU context mod = mx.mod.Module(symbol=sym, context=mx.cpu(), label_names=None) # Binding the model for inference mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))], label_shapes=mod._label_shapes) # Loading the weights mod.set_params(arg_params, aux_params, allow_missing=True) # Loading and pre-processing the image img = get_image(...) # And finally calling inference mod.predict(mx.nd.array(img))
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference with Apache MXNet on Elastic Inference # Loading a resnet-152 model the from local filesystem sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-152', 0) # Initializing the module with an EIA context mod = mx.mod.Module(symbol=sym, context=mx.eia(), label_names=None) # Binding the model for inference mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))], label_shapes=mod._label_shapes) # Loading the weights mod.set_params(arg_params, aux_params, allow_missing=True) # Loading and pre-processing the image img = get_image(...) # And finally calling inference mod.predict(mx.nd.array(img))
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference with SageMaker and Apache MXNet from sagemaker.mxnet.model import MXNetModel # Load a pre-trained model from s3 sagemaker_model = MXNetModel(model_data = 's3://.../model.tar.gz', ...) # Initialize a predictor predictor = sagemaker_model.deploy( initial_instance_count=1, instance_type='ml.m4.xlarge’)
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inference with SageMaker and Apache MXNet Accelerated by Elastic Inference from sagemaker.mxnet.model import MXNetModel # Load a pre-trained model from s3 sagemaker_model = MXNetModel(model_data = 's3://.../model.tar.gz', ...) # Initialize a predictor predictor = sagemaker_model.deploy( initial_instance_count=1, instance_type='ml.m4.xlarge’, accelerator_type='ml.eia1.medium')
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo (https://bit.ly/2CDTtV9) Managed model hosting with: - Amazon SageMaker - Elastic Inference - Apache MXNet Wifi: GBDC Zettabytes 2019 / gdbc2019
  • 20. How to choose the EI size you need? Considerations for choosing instance type and accelerator: • Latency requirements -> EI TFLOPs • Model size -> EI Memory • Input/output data payload has an impact on latency • Convert to FP16 for lower latency and higher throughput • Experiment: try out and measure ! Start small and size up as needed
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary • Use EI to reduce your • Launch with any EC2 instance type • Host for inference with Amazon SageMaker for a fully managed experience • Deploy TensorFlow, MXNet and ONNX models with minimal code changes. To get started with Amazon Elastic Inference, visit: https://aws.amazon.com/machine-learning/elastic-inference/
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! To get started with Amazon Elastic Inference, visit: https://aws.amazon.com/machine-learning/elastic-inference/

Editor's Notes

  • #2: <introduction> Before we start, with a show of hands: - Who is an AWS customer? - Who runs deep learning workloads for training or inference? - Who heard of, or uses, Amazon Elastic Inference?
  • #4: When we look at the costs associated with running deep learning, or more generally machine learning, we can divide them into two large categories – Training models and making predictions with them - also known as inference. Training cost typically scales with two key factors: the first one is model complexity, since the more complex and expressive your model is, the more time and resources it usually takes to train it. The second one is the amount of training data used – the more data you use for training, the more compute and storage capacity you will need. For inference, cost scales with other factors, first and foremost – your user base. As your user base grows, so will your inference calls in production! Other factors are all related to maintaining a service in production: availability and redundancy, global distribution, all requires setting up hosts that can service inference requests. Turns out that for most customers, inference workloads account for approximately 90% of the overall cost for deep learning and machine learning – and is the main driver for deep learning workload cost!
  • #5: Now let’s take a closer look at GPU inference in production. (CLICK) First, GPUs today offer the best performance in terms of latency and throughput, primarily due to their phenomenal parallel execution – Nvidia Volta V100 is equipped with 640 tensor cores, each performing 64 FP fused multiple-add (FMA) per clock cycle – delivering up to 125 TFLOPS per GPU. (CLICK) However, this capacity comes at at price - GPU instances are expensive when compared to CPUs. (CLICK) But what is interesting, is that for most models used for inference in production, the GPU utilization is in fact pretty low. Inference workloads usually handle a batch of one input, since they need to serve a request as it comes – usually called “Online Inference”. This is different than training jobs where the entire data is available and batch size is in the tens or hundreds. Online inference usually means that unless you do fancy stuff in your serving system you will leave your GPU greatly underutilized. (CLICK) And then lastly, different models need different amounts of GPU, CPU, and memory resources. Selecting a GPU instance type that is big enough to satisfy the requirements of the most demanding resource often results in under-utilization of the other resources and high costs.
  • #6: When look at what various inference workloads can drive out of GPUs, you’ll see that the max throughput varies significantly based on batch size. For smaller batch sizes, which are the large majority of online inference use cases, you are using less than 10% of the GPU potential. For this inception model, at a batch size of 64, you are able to drive almost 1000 images per second out of the GPU, compared to a max of 100 images per second at batch size 1. (CLICK) So for many online inference use cases, up to 90% of your infrastructure fleet can remain unutilized–this is huge waste.
  • #7: So how do we choose between the GPU options that are available. (CLICK) If we look at how cost per inference varies for an inference workload across P2 and P3 instances, (CLICK) for most online inference use cases, with a batch size of 1, the p2.xlarge instances are much more cost effective than p3.2xl instance. (CLICK) For batch size of 64 this equation flips, where the large amount of parallel processing capacity within the p3.2xl makes it over 2x more cost effective. So you have choices depending on your workload and depending on your application requirements, but customers typically opt to go with p2.xlarge instances for smaller batch sizes.
  • #8: We are happy to say that we have a solution for you. One that gives you more efficient choice. Its called Amazon Elastic Inference!
  • #9: TODO: go over the text Amazon Elastic Inference, or Amazon EI as we call it, (CLICK) helps lower the cost of running deep learning applications by up to 75%, (CLICK) by giving you the amounts of GPU-powered acceleration that are right sized for inference applications. EI helps you size the CPU, Memory and GPU acceleration for your application independently, so for your computer vision application, where you need smaller amounts of X86 CPU and memory compared to GPU acceleration, (CLICK) you can now choose a small CPU instance and attach an accelerator from a range of sizes that provides you the latency your application requires while helping you fully utilize the capacity and reduce costs. (CLICK) EI is integrated with EC2 in a way that lets you apply acceleration flexibly without additional infrastructure management, and EI is integrated with SageMaker, that lets you reduce costs and have fully managed experience.
  • #10: EI accelerators are available today in three sizes; each with amounts of single precision FP32, mixed precision FP16 compute capacity and an amount of accelerator memory. These are available at prices that are significantly cheaper than GPU instances. The smallest accelerator costs just 13cents an hour, compared to a 90c / hr p2.xlarge (4.4 FP32 TOPS) instance, or worse a $3/hr p3.2xl (15 FP32 TOPS) instance. EI is now available in Northern Virigina, Ohio and Oregon in the US, Dublin in the EU, Tokyo and Seoul in the Asia Pacific regions.
  • #11: So the big question is with smaller amounts of acceleration than a full GPU, what does this mean for performance? Well I’m glad to tell you that in many cases EI accelerators perform better than the p2.xlarge GPU instance. (CLICK) Lets look at inferences per second for a few computer vision models for image classification and object detection. The grey bars are the number of inferences per second using EI accelerator sizes paired with a small EC2 instance. Compare this to the pink bar which is the p2.xlarge instance. The p3.2xlarge in orange provides a higher throughput for applications that need it, (CLICK) but at a much higher cost. If you look EI accelerator and instance combination price ranges, from 22cents to 61cents an hour compared to the p3 instance at 3/hr the difference is significant and certainly not 5x the throughput you’d expect at 5x the cost.
  • #12: Lets look at how EI accelerators work. Accelerators are made available behind a AWS PrivateLink VPC endpoint in the same availability zone as your EC2 instance and they provide dedicated capacity for your instance. Launching EI accelerators with instances is easy, it is simply a new configuration flag in the run instance API or CLI command, or you can use the console to add acceleration to your instance configuration when you launch your instance.
  • #13: SageMaker fully manages this experience for you so you can prototype deployments with Notebooks. (CLICK) and deploy models SageMaker endpoints with cost effective EI acceleration instead of standalone GPU instnaces.
  • #14: TODO: update the slide to show that ONNX can take in various formats Lets talk about the types of models we support. EI supports TensorFlow, Apache MXNet and ONNX models. (CLICK) The service provides EI enabled TensorFlow Serving and Apache MXNet software packages that let you deploy these model types with ZERO code changes. (CLICK) The EI enabled packages automatically discover the presence of an accelelerators and efficiently offload operations within your model to run on the attached accelerator. You can find these pacakages within the Deep Learning AMIs, or download them via S3.
  • #21: Demands on CPU compute resources, CPU memory, GPU-based acceleration, and GPU memory vary significantly between different types of deep learning models. The latency and throughput requirements of the application also determine the amount of instance compute and Amazon EI acceleration you need. Consider the following when you choose an instance and accelerator type combination for your model: Before you evaluate the right combination of resources for your model or application, you should determine the target latency and throughput needs for your overall application stack, as well as any constraints you may have. For example, if your application needs to respond within 300 milliseconds (ms), and data retrieval (including any authentication) and pre-processing takes 200ms, you have a 100ms window to work with for the inference request. Using this analysis, you can determine the lowest cost infrastructure combination that meets these targets. Start with a reasonably small combination of resources. For example, a c5.xlarge instance type along with an eia1.medium accelerator type. This combination has been tested to work well for various computer vision workloads (including a large version of ResNet: ResNet-200) and give comparable or better performance than a p2.xlarge instance. You can then size up on the instance or accelerator type depending your latency targets. Since Amazon EI accelerators are attached over the network, input/output data transfer between instance and accelerator also adds to inferencing latency. Using a larger size for either or both instance and accelerator may reduce data transfer time, and therefore reduce overall inferencing latency. If you load multiple models to your accelerator (or, the same model from multiple application processes on the instance), you may need a larger accelerator size for both the compute and memory needs on the accelerator. You can convert your model to mixed precision, which utilizes the higher FP16 TFLOPS of Amazon EI (for a given size), to provide lower latency and higher performance.
  • #22: TODO: simplify this slide