SlideShare a Scribd company logo
AUTOMATED DEEP LEARNING TRAINING
WITH
AWS STEP FUNCTIONS / AWS LAMBDA
@mizti
PROBLEM WITH
DEEP LEARNING TRAING
1.
SERVER WITH GPU
REQUIRED
SERVERS WITH GNU ARE
EXPENSIVE
• Some thousands dollars / month with on demand instance
• Spot instance with bidding system: much low priced, but not
ignorable price for me
NOT IGNORABLE PRICE ?
• It costs equal to 1 or 2 “Tirol choco” for each server /
hour
• Not much, but I worry about…
* WELL-KNOWN IN
JAPAN, THE PRONOUN
OF CHEAP CONFECTION
AND
IT TAKES VERY LONG TIME
Half day, One day,
Occasionally some days
I WANT TO TERMINATE
SERVERS
ONCE TRAINING
COMPLETED
SO
PROBLEM WITH
DEEP LEARNING TRAINING
2.
ANNOYING COLLECTION OF
TRAINED DATA
WITH ONE SERVER,
IT TAKES ONLY FEW
MINUTES
WITH SCP
WITH MANY SERVERS,
IT TAKES LONG TIME
WHAT IS WORSE, WE DON’T KNOW WHEN
EACH TASK COMPETE
IN EACH SERVER
AND I GET CONFUSED
“WHAT WAS THE SETTING FOR THIS
SERVER?”
AT LAST, I TERMINATE
SERVER
WITHOUT EXTRACTING
DATA
I WANT TO GATHER DATA INTO
ONE PLACE AUTOMATICALLY
SO
AND WANT TO LABEL TRAINING
CONDITIONS…
SERVER-LESS
ARCHITECTURE
• Serverless computing (with my understanding) is
• Generate servers when I need, Terminate servers once task
completed
• Does not use any server to control above.
• Thus, I don’t need have any server usually,
and can generate any numbers of server when / as many as I
need.
• (becoming buzz-word these days ?)
SERVER-LESS SERVICES IN AWS
• AWS Lambda
• Users can register code with Node.js / Python /
Java / C#
• Registered codes can be hooked with events
from inside of AWS (and can be kicked by hand,
of cause)
• Users can automate AWS control with AWS SDK
for each languages ( like boto3 for Python )
• No special libraries for AWS Lambda,
IOW: AWS Lambda is just a register / starting
mechanism of codes
• One Lambda function can be alive only 60
seconds at most, so AWS Lambda is not suitable
for
long-time / many-state jobs.
SERVER-LESS SERVICES IN AWS
• AWS Step Functions
• Users can define multi-state machine like
“cell automaton”
• Fork / Parallel processes are also can be
defined
• Each state inputs / receives data into /
from AWS Lambda functions.
• You can check status of states (process)
with Web UI visually.
• Users can control long-time / multi-path
process
WHAT I WANTED TO MAKE:
1. Create S3 bucket for each execution
2. Bid a spot instance
3. If the bidding suceeds, and a spot instance is generated,
• Notify with AWS SNS (Email or SMS)
• Prepare to training ( Downloading training etc.)
• Start training
• Periodically upload model dump / output data / logs into S3 bucket
4. Once training completed
• Notify with AWS SNS (Email or SMS)
• Terminate instance after a certain period of times
I MADE:
Create S3 bucket
Request Spot Instance
Check if the bidding succeeded
Notify bidding success
Check if the task completed
Wait for the task completed
Notify task completed
Terminate Spot instance
USAGE
• Input a set of json like below to start Step Function
• exec_name: name of this execution (also become a name of S3 bucket)
• repository url: git repository of code to exec ( used like git clone {repository url} )
• data_dir / output_dir: directory of training data and output data
• data_get_command: command executed before training. (typically, getting training data for
machine learning)
• exec_command: executed command for training.
USAGE
Input a json, and ..
USAGE
Just push "Start Execution"
USAGE
・Progress can be checked on Web UI
・Output result is automatically carried into
S3 bucket.
BENEFIT
• Start and Forget. Sleep peacefully.
• Make it easy to parallel execution with many
patterns of hyper-params
• No need of modifying training / model codes
• Maybe used also for many kinds of
batch-like process
MISC
• Author: @mizti
any comments / questions welcomed
• Details: wrote in my blog (but in Japanese lang ; )
http://mizti.hatenablog.com/entry/deeplearningwithawsstepf
unction
• Code repository:
https://github.com/mizti/aws_stepfunc_chainer
• Illustration in this slides:
http://www.irasutoya.com/

More Related Content

PPTX
Async CTP 3 Presentation for MUGH 2012
PDF
MJ Berends talk - Women & Non-Binary Focused Intro to AWS
PDF
JUST EAT: Tools we use to enable our culture
PDF
JUST EAT: Embracing DevOps
PDF
Intro To Serverless ClojureScript
PPTX
Micro services architecture and service fabric
PDF
Increasing performance with Elixir Tasks
ODP
Amazonec2
Async CTP 3 Presentation for MUGH 2012
MJ Berends talk - Women & Non-Binary Focused Intro to AWS
JUST EAT: Tools we use to enable our culture
JUST EAT: Embracing DevOps
Intro To Serverless ClojureScript
Micro services architecture and service fabric
Increasing performance with Elixir Tasks
Amazonec2

What's hot (19)

PPTX
Dev-Friendly Ops
PPT
Ruby On Google App Engine 2nd Athens Ruby Me
PDF
High performance in react native
PDF
Lotuscript for large systems
PPTX
PDF
Apache pulsar
PDF
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
PPTX
Problems you’ll face in the Microservices World: Configuration, Authenticatio...
PDF
Introduction to scaling your WordPress site past a single node using AWS
PDF
FireBug And FirePHP
PDF
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
PDF
Ember Overview in 5 Minutes
PPTX
Introduce cucumber
PDF
Makes JavaScript Fun Again
PPTX
Keeping MongoDB Data Safe
PPTX
MONITORING THE UNKNOWN, 1000*100 SERIES A DAY - DEVOXX MOROCCO 2017
PPT
Coffee script final
PDF
Tech Talk #4 : Functional Reactive Programming - Đặng Thái Sơn
PPT
Introduction to Django-Celery and Supervisor
Dev-Friendly Ops
Ruby On Google App Engine 2nd Athens Ruby Me
High performance in react native
Lotuscript for large systems
Apache pulsar
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Problems you’ll face in the Microservices World: Configuration, Authenticatio...
Introduction to scaling your WordPress site past a single node using AWS
FireBug And FirePHP
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Ember Overview in 5 Minutes
Introduce cucumber
Makes JavaScript Fun Again
Keeping MongoDB Data Safe
MONITORING THE UNKNOWN, 1000*100 SERIES A DAY - DEVOXX MOROCCO 2017
Coffee script final
Tech Talk #4 : Functional Reactive Programming - Đặng Thái Sơn
Introduction to Django-Celery and Supervisor
Ad

Viewers also liked (16)

PPTX
AWS Step FunctionとLambdaでディープラーニングの訓練を全自動化する
PDF
八子Openingプレゼン 130525
PPT
Aws Lambda Cart Microservice Server Less
PDF
スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
PDF
Micro services infrastructure with AWS and Ansible
PDF
座談会資料 事前配布 20170225
PDF
Deliver Docker Containers Continuously on AWS - QCon 2017
PPTX
HPC で使えそうな FPGA 搭載 AWS F1 インスタンス 20170127
PDF
AWS Step Functions 実践
PDF
ウフル・Enebular紹介 170225
PDF
Poster-An Expert System for Car Failure Diagnosis
PDF
Auto scaling using Amazon Web Services ( AWS )
PDF
Auto scaling with Ruby, AWS, Jenkins and Redis
PDF
Building a Machine Learning App with AWS Lambda
PDF
Tensorflow in production with AWS Lambda
PDF
20170210 jawsug横浜(AWSタグ)
AWS Step FunctionとLambdaでディープラーニングの訓練を全自動化する
八子Openingプレゼン 130525
Aws Lambda Cart Microservice Server Less
スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
Micro services infrastructure with AWS and Ansible
座談会資料 事前配布 20170225
Deliver Docker Containers Continuously on AWS - QCon 2017
HPC で使えそうな FPGA 搭載 AWS F1 インスタンス 20170127
AWS Step Functions 実践
ウフル・Enebular紹介 170225
Poster-An Expert System for Car Failure Diagnosis
Auto scaling using Amazon Web Services ( AWS )
Auto scaling with Ruby, AWS, Jenkins and Redis
Building a Machine Learning App with AWS Lambda
Tensorflow in production with AWS Lambda
20170210 jawsug横浜(AWSタグ)
Ad

Similar to Automation of Deep learning training with AWS Step Functions (20)

PDF
Serverless in java Lessons learnt
PDF
Serverless in Java Lessons learnt
PDF
AWS DevOps - Terraform, Docker, HashiCorp Vault
PDF
Scaling Django Apps using AWS Elastic Beanstalk
PDF
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
PDF
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
PDF
AWS Lambda Presentation (Tech Talk DC)
PPTX
Serverless design considerations for Cloud Native workloads
PDF
Journey towards serverless infrastructure
PPTX
Building a document signing workflow with Durable Functions
PDF
OpenNTF Webinar - October 2021: Return of the DOTS
PDF
Cloudy in Indonesia: Java and Cloud
PPTX
Serverless at Lifestage
PDF
Getting Started with AWS Lambda & Serverless Cloud
PPTX
Domain's Robot Army
PPTX
Ansible benelux meetup - Amsterdam 27-5-2015
KEY
Real time system_performance_mon
PPTX
Leveraging elastic web scale computing with AWS
PPT
Hosting a Rails App
KEY
Cooking a rabbit pie
Serverless in java Lessons learnt
Serverless in Java Lessons learnt
AWS DevOps - Terraform, Docker, HashiCorp Vault
Scaling Django Apps using AWS Elastic Beanstalk
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
AWS Lambda Presentation (Tech Talk DC)
Serverless design considerations for Cloud Native workloads
Journey towards serverless infrastructure
Building a document signing workflow with Durable Functions
OpenNTF Webinar - October 2021: Return of the DOTS
Cloudy in Indonesia: Java and Cloud
Serverless at Lifestage
Getting Started with AWS Lambda & Serverless Cloud
Domain's Robot Army
Ansible benelux meetup - Amsterdam 27-5-2015
Real time system_performance_mon
Leveraging elastic web scale computing with AWS
Hosting a Rails App
Cooking a rabbit pie

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Modernizing your data center with Dell and AMD
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks

Automation of Deep learning training with AWS Step Functions

  • 1. AUTOMATED DEEP LEARNING TRAINING WITH AWS STEP FUNCTIONS / AWS LAMBDA @mizti
  • 4. SERVERS WITH GNU ARE EXPENSIVE • Some thousands dollars / month with on demand instance • Spot instance with bidding system: much low priced, but not ignorable price for me
  • 5. NOT IGNORABLE PRICE ? • It costs equal to 1 or 2 “Tirol choco” for each server / hour • Not much, but I worry about… * WELL-KNOWN IN JAPAN, THE PRONOUN OF CHEAP CONFECTION
  • 6. AND IT TAKES VERY LONG TIME Half day, One day, Occasionally some days
  • 7. I WANT TO TERMINATE SERVERS ONCE TRAINING COMPLETED SO
  • 10. WITH ONE SERVER, IT TAKES ONLY FEW MINUTES WITH SCP
  • 11. WITH MANY SERVERS, IT TAKES LONG TIME WHAT IS WORSE, WE DON’T KNOW WHEN EACH TASK COMPETE IN EACH SERVER
  • 12. AND I GET CONFUSED “WHAT WAS THE SETTING FOR THIS SERVER?”
  • 13. AT LAST, I TERMINATE SERVER WITHOUT EXTRACTING DATA
  • 14. I WANT TO GATHER DATA INTO ONE PLACE AUTOMATICALLY SO AND WANT TO LABEL TRAINING CONDITIONS…
  • 15. SERVER-LESS ARCHITECTURE • Serverless computing (with my understanding) is • Generate servers when I need, Terminate servers once task completed • Does not use any server to control above. • Thus, I don’t need have any server usually, and can generate any numbers of server when / as many as I need. • (becoming buzz-word these days ?)
  • 16. SERVER-LESS SERVICES IN AWS • AWS Lambda • Users can register code with Node.js / Python / Java / C# • Registered codes can be hooked with events from inside of AWS (and can be kicked by hand, of cause) • Users can automate AWS control with AWS SDK for each languages ( like boto3 for Python ) • No special libraries for AWS Lambda, IOW: AWS Lambda is just a register / starting mechanism of codes • One Lambda function can be alive only 60 seconds at most, so AWS Lambda is not suitable for long-time / many-state jobs.
  • 17. SERVER-LESS SERVICES IN AWS • AWS Step Functions • Users can define multi-state machine like “cell automaton” • Fork / Parallel processes are also can be defined • Each state inputs / receives data into / from AWS Lambda functions. • You can check status of states (process) with Web UI visually. • Users can control long-time / multi-path process
  • 18. WHAT I WANTED TO MAKE: 1. Create S3 bucket for each execution 2. Bid a spot instance 3. If the bidding suceeds, and a spot instance is generated, • Notify with AWS SNS (Email or SMS) • Prepare to training ( Downloading training etc.) • Start training • Periodically upload model dump / output data / logs into S3 bucket 4. Once training completed • Notify with AWS SNS (Email or SMS) • Terminate instance after a certain period of times
  • 19. I MADE: Create S3 bucket Request Spot Instance Check if the bidding succeeded Notify bidding success Check if the task completed Wait for the task completed Notify task completed Terminate Spot instance
  • 20. USAGE • Input a set of json like below to start Step Function • exec_name: name of this execution (also become a name of S3 bucket) • repository url: git repository of code to exec ( used like git clone {repository url} ) • data_dir / output_dir: directory of training data and output data • data_get_command: command executed before training. (typically, getting training data for machine learning) • exec_command: executed command for training.
  • 23. USAGE ・Progress can be checked on Web UI ・Output result is automatically carried into S3 bucket.
  • 24. BENEFIT • Start and Forget. Sleep peacefully. • Make it easy to parallel execution with many patterns of hyper-params • No need of modifying training / model codes • Maybe used also for many kinds of batch-like process
  • 25. MISC • Author: @mizti any comments / questions welcomed • Details: wrote in my blog (but in Japanese lang ; ) http://mizti.hatenablog.com/entry/deeplearningwithawsstepf unction • Code repository: https://github.com/mizti/aws_stepfunc_chainer • Illustration in this slides: http://www.irasutoya.com/