Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤

Step FunctionsとAWS Batch
でオーケストレートするイベン
トドリブンな機械学習基盤
Serverless Conf 2017
11/03 2017
山田雄
ネットビジネス本部
データ基盤チーム
堤崇行
ITサービス・ペイメント事業本部
方式基盤技術部

■山田雄（ヤマダユウ）
株式会社リクルートライフスタイル
ネットビジネス本部
データ基盤T
Twitter:@nii_yan
GitHub:https://github.com/yu-yamada
・以前はメールマーケティング用基盤の作成からデータ分析まで関わる
現在はリクルートライフスタイルの共通分析基盤の開発、運用全般を担当
ビックデータ、Ruby、ビール、カップ焼きそばが好き。
自己紹介

リクルートライフスタイルの持つサービス

80%
基盤エンジニアが運用に割いている割合

開発：
運用：
その他：
理想の割合
70%
20％
10％

トリップAIコンシェルジュシステム概要図

商品概要
• 会社が商品として売り出すものである
• 今後長く使われる可能性がある
• 今後機能が追加になる可能性がある

機械学習基盤に求められるもの

Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤

Machine learning pipelines
on-premises
Data load
Machine
learning
on-premises
State control
Cloud trail
Cloud watch
Monitoring

Limited interface
on-premises
Data load
Machine
learning
on-premises
State control
Cloud trail
Cloud watch
Monitoring

Full managed work flow
on-premises
Data load
Machine
learning
on-premises
State control
Cloud trail
Cloud watch
Monitoring

Scalable batch
on-premises
on-premises
State control
Cloud trail
Cloud watch
Monitoring
Data load
Machine
learning

Data load
Machine
learning
Visualize
on-premises
on-premises
Cloud trail
Cloud watch
Monitoring
State control

State control
Data load
Machine
learning
Infrastructure as code
on-premises
on-premises
Cloud trail
Cloud watch
Monitoring

State control
Data load
Machine
learning
Monitoring
on-premises
on-premises
Cloud trail
Cloud watch
Monitoring

© 2017 NTT DATA Corporation 23
堤崇行（ツツミタカユキ）
株式会社ＮＴＴデータ
ITサービス・ペイメント事業本部
方式基盤統括部
経歴
• Webアプリ開発
• データ基盤開発・運用 / バッチ開発
• ETL / バッチ処理フレームワーク
• ストリーム処理
利用者/運用者/開発者みんなが気持ちよく使える
システムを構築できるよう日々奮闘中
好きなものはチョコレートとビール
自己紹介

Machine Learning Pipeline
on-premises
Data load
Machine
learning
on-premises
State control
Cloud trail
Cloud watch
Monitoring

Components of Pipelines
Interface

Scheduler or Triggers
Scheduled Task Polling Event Trigger

Interface
Interface Processing Interface

Batch Processing with Container
Batch
On
Demand
Scalable
AWS Batch

AWS Batch
Submit Job
Running
Succeeded
/ Failed
JobのCPU数 / メモリを指定
Job Containerが稼動
終了
“最適な”EC2 Instanceが起動Runnable

JobのCPU数 / メモリを指定
“最適な”EC2 Instanceが起動
Job
CPU数
メモリ
EC2
CPU数
メモリ
CPU: 8
メモリ: 24GiB
Type: m4.2xlarge
CPU: 8
メモリ: 32GiB
CPU: 8
メモリ: 500GiB
Type: r4.16xlarge?
CPU: 64
メモリ: 488GiB

Step Functions
Workflow
Scalable
Managed
Event
Driven
Control AWS Batch

Event Driven
BatchStep FunctionsLambdaS3
Data

AWS Step Functions & Batch
State Machine
Submit
Get Status
Loop

Micro State Machine
Pre-
processing
(Data Load)
Processing
(Machine
Learning)

Relay Step Functions
Batch Results
BatchStep Functions
BatchStep Functions

Event Driven with Lambda
ExecutionTrigger
S3 Eventで
Lambdaを実行
起動成功
起動失敗
多重起動

Event Driven with Lambda
Failures & Solutions
SolutionsFailuresTrigger
S3 Eventで
Lambdaを実行
起動失敗再実行
多重起動
多重起動の阻止
多重起動OK

Retry when Execution Failed
Polling
DLQ
DLQによる確実なLambdaの実行
Cloud Watch Events
Event

Preventing Multiple Starts
DynamoDBでステート管理
Conditional
Put Item
Update Item
Batch Status
State Control DB
Start Execution
DON’T Start
CAN’T Put

Support Idempotent Batch
べき等性のあるBatch Jobを実装
多重起動しても正常を保つ
Upsert
Unique Object name
Get Latest Object

Monitoring: Alerts
Cloud Watch Logs
Log監視
Lambdaをフィルタで振分け
ERRORログを検知
Subscription
Filter
Info
Alert

Monitoring: Alerts
Batch Status監視
長時間Runnableを検知
Submit
Running
Succeeded
/ Failed
Job Containerが稼動
“最適な” EC2 Instanceが起動Runnable

Monitoring: Alerts
Step Functionの起動監視
一定の時間以上起動していないを検知
BatchStep FunctionsLambdaS3Data

Monitoring: Visualize Batch Status
DynamoDB
Streams
ES

Cloud trail
Cloud watch
Monitoring
BatchStep
Functions
S3 LambdaObjects
DynamoDB

Monitoring
on-premises
Data load
Machine
learning
on-premises
State control
Cloud trail
Cloud watch

一緒に基盤作ってくれる人募集中！！！
http://engineer.recruit-lifestyle.co.jp/recruiting/

Happy serverless development!!

Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤 (13)

More from Yu Yamada (11)

Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤

Editor's Notes