Legion - AI Runtime Platform

1CONFIDENTIAL
LEGION:
AI RUNTIME PLATFORM
April, 2019
http://legion-platform.org/

2CONFIDENTIAL
The Ultimate Question of Life, the Universe, and Everything
HOW TO USE ML MODELS IN PRODUCTION SYSTEMS?

3CONFIDENTIAL
THE CHALLENGE
• Organizations need a streamlined process to build and deploy ML models from DS lab to production
• Multiple model versions to support different audience segments, experiments, and hypothesis
• A single golden source storing model artifacts required
• Each production model must be traceable back to source
LIFECYCLE
MANAGEMENT
• Model rewrite for production environment is an error-prone and lengthy process
• User facing systems need response latency of 200ms or less (20 ms in AdTech)
• Machine Learning toolkits are optimized for interactive development and faster training, not latency
• Tracking, deployment and upgrade of multiple ML frameworks, and dependencies, is a complicated process
MACHINE
LEARNING
RUNTIME
• Data Science focuses on outcomes, not engineering excellence
• Many ML libraries are aggressively extended at the cost of stability
• Failed model frequently can be replaced by a simple baseline algorithm
RESILIENCE &
SCALING
• New models must be verified before placement to production
• Performance of ML models may degrade during exploitation
• Rapid prototyping needs real-world data access and fast feedback
PERFORMANCE
MONITORING

4CONFIDENTIAL
PRODUCT HYPOTHESIS
MOVE FAST, FAIL FAST (AND RECORD YOUR MOVES)
Build a flexible environment for fast and transparent delivery of models and resilience
to failures.
• High development velocity enables experimentation and faster
result delivery
• Eventual people mistakes are contained, identified quickly and
traced back to the source
• Strong reliance on a specific technology limits options, therefore
degrades velocity
• Clear and strong feedback facilitates skillset improvement and
optimal technology choice

5CONFIDENTIAL
HOW-TO
• Unified environment both in research and in production
• Avoid code rewrite
• No framework imposed model structure requirements
• Prevent migration and communication issues
• Keep learning curve under control
• Smooth out-of-the-box CI/CD environment
• Integrated model quality control
• Full traceability (datasets, code, hyper-parameters)
• Scheduled retraining and regression testing
• Integrated feedback loop
• Input and output capturing
• Real-time performance evaluation and monitoring
• A/B testing and active traffic management
• Resilient and scalable open-source ML platform
• Cloud-agnostic and ML toolchain agnostic
• Integration with the most popular choices (Spark,
Sklearn, Tensorflow, R)

6CONFIDENTIAL
WORKFLOW
Development Environment
Local Machine/ LegionEnclave
TrainingEnvironment
LegionCore
ExecutionEnvironment
LegionEnclave
Developer
Compilation Explotation Evaluation
TrainingDataset
DevDataset
∆
Training
Report
00100110
10010100
01010010
00100100
PortableModel
Docker Image
∑
BinaryModel
00100110
10010100
01010010
00100100
BaseImage
Dependencies
Input/Output Logs
OutcomeLogs
Performance
Report
Training
Feature& model selection
Fitting
Hyper-parameter tuning
Evaluation
Dependenciesintegration
Server codeintegration
Portableartefact
Labeling
Model application
Trafficsplit
Logging
Monitoringandself healing
Outcomebasedmetrics
Codeperformance
Development
DataPreparation
Fitting
Cross-Validation
Evaluation
GITLab
Git-Flow

7CONFIDENTIAL
DATA FLOW
• Model is a set of sources
• Training script
• Build scenario
• Tests
• Training process produces a
Docker image
• Self-contained
• Portable
• Model instances spawned on
demand inside Legion
enclaves
• Security isolation
• Failure isolation
• Resource allocation control
Jenkins
GIT
Legion:PRODEnclave
K L M
HTTPTrafficRouter
HTTP
Data
Scientist
.ipynb
Jenkinsfile
Model
Legion:Test Enclave
v2 v1 Test
HTTPTrafficRouter
HTTP
Build container
sparkContext.rdf.count()
model.fit(X,Y)
toolchain.export(model)
MonitoringPlane
TrainingLog
Metrics
Grafana
Docker Repository
Release
Engineer
Docker Image
BinaryModel
Docker Image
BinaryModel
Prometheus

8CONFIDENTIAL
LEGION ARCHITECTURE
LegionCore
Kubernetes
LegionEnclave
AmazonS3
KubernetesIngress
LegionIngress
Nginx/ LuaJIT
Git
Nexus
Airflow
Grafana/ Prometheus/ Statsd
Fluentd
API Traffic
Identity
Provider
TheCloud
RDBMS
Model XService
LegionPymodel
Model Code
AmazonEFS(NFSv4.1)AmazonEBS
Legion
Open-Source
Feedback
Logs
Specific
Jenkins
OpLogs
Traffic
Data
Credentials
Control
AirflowWorker
ETLJobs
AirflowWorker
Model Jobs
Model Y
Unstructured
WebTraffic
Legion
WebConsole
OpLogs
Prometheus

9CONFIDENTIAL
MULTITENANCY
• Each tenant created by
Kubernetes helm/chart
service
• Tenant placed into a separate
namespace and expose
separate HTTP endpoint to
access API models
• Network isolation
• AWS IAM Role based
authorization
• All tenants are managed
through a central dashboard

10CONFIDENTIAL
• Open analytics platform
• Metrics collection and
visualization
– Training, test errors
– Response time distribution
– Throughput
Monitoring – Grafana

11CONFIDENTIAL
• Running model training jobs and produces
model images
• Keeping training metrics, summaries, and
version history
• Git-flow for model input files
Build Manager - Jenkins

12CONFIDENTIAL
TOOLCHAIN INTEGRATION - PYMODEL
• Toolchain support is implemented
by addition of a python package
• Each toolchain package provides 3
routines:
– export() to serialize model to a
file
– build() to produce a docker image
from a model file
– Serve() to expose a binary model
as HTTP RESTful service
Pymodel ToolchainPackage
TrainingScript
model.init()
model.fit(X,Y)
toolchain.export(model)
JenkinsBuild container
MicroserviceDocker Image
BinaryModel
builds
serialisesin-memorymodel intoafile
legionctl
build …
PythonPackage
export() build() serve()
HTTPRESTful API

13CONFIDENTIAL
TOOLCHAIN INTEGRATION – APACHE SPARK
• Apache Spark is not a good match
for runtime model execution
– Distributed processing framework
– High latency
– Large number of dependencies
• Combust MLeap Runtime for Spark
models
– Apache v2 License
– Spark/PySpark/Sklearn support
– Customizable ML data pipelines
• Legion provides lifecycle
management services
– CI/CD & Testing
– Monitoring and Performance
management
– Self-healing
Spark ToolchainPackage
Spark Driver
MicroserviceDocker Image
BinaryModel
builds
serialisesin-memorymodel intoaProtobuf
legionctl
build…
Spark.MLLib
JAR export()
Combust MLeap
PythonPackage
build()
HTTPRESTful API
Jenkins Build Container
JAR
Combust MLeap Runtime

14CONFIDENTIAL
• Hardened Security
– Identity Mapping
– Model API Access Control
• Additional toolchains:
– Tensorflow
– Spark.ML and Spark.MLLib
• Cluster management web console
NEXT STEPS

15CONFIDENTIAL
THE ROADMAP
• Single line model export
• Automatic capture of dependencies
• Translation to portable model description
AS IS CODE
PROMOTION
• A model transforms into a portable service with RESTful interface
• Easy integration of custom models and frameworks
• Risk free changes and upgrades of tool-chain
• Single model failure does not impact whole system
MICROSERVICE
ARCHITECTURE
• Feedback events recording
• A/B testing with audience or traffic share assignment
• Continuous performance measurement
• Historical metrics recording and threshold alerting
CONTINUOUS
PERFORMANCE
MEASUREMENT
• Secure perimeter with integrated authentication
• Continuous Integration/Continuous Delivery process for models
• Artifact repository and build history
ENTERPRISE
READY
• AWS cloud or on premises
• Automatic scaling to accommodate workload
• No platform or vendor lock-in
FLEXIBLE
DEPLOYMENT
ARCHITECTURE

Legion - AI Runtime Platform

More Related Content

What's hot (20)

Similar to Legion - AI Runtime Platform (20)

Recently uploaded (20)

Legion - AI Runtime Platform

Editor's Notes