1CONFIDENTIAL
LEGION:
AI RUNTIME PLATFORM
April, 2019
http://legion-platform.org/
2CONFIDENTIAL
The Ultimate Question of Life, the Universe, and Everything
HOW TO USE ML MODELS IN PRODUCTION SYSTEMS?
3CONFIDENTIAL
THE CHALLENGE
• Organizations need a streamlined process to build and deploy ML models from DS lab to production
• Multiple model versions to support different audience segments, experiments, and hypothesis
• A single golden source storing model artifacts required
• Each production model must be traceable back to source
LIFECYCLE
MANAGEMENT
• Model rewrite for production environment is an error-prone and lengthy process
• User facing systems need response latency of 200ms or less (20 ms in AdTech)
• Machine Learning toolkits are optimized for interactive development and faster training, not latency
• Tracking, deployment and upgrade of multiple ML frameworks, and dependencies, is a complicated process
MACHINE
LEARNING
RUNTIME
• Data Science focuses on outcomes, not engineering excellence
• Many ML libraries are aggressively extended at the cost of stability
• Failed model frequently can be replaced by a simple baseline algorithm
RESILIENCE &
SCALING
• New models must be verified before placement to production
• Performance of ML models may degrade during exploitation
• Rapid prototyping needs real-world data access and fast feedback
PERFORMANCE
MONITORING
4CONFIDENTIAL
PRODUCT HYPOTHESIS
MOVE FAST, FAIL FAST (AND RECORD YOUR MOVES)
Build a flexible environment for fast and transparent delivery of models and resilience
to failures.
• High development velocity enables experimentation and faster
result delivery
• Eventual people mistakes are contained, identified quickly and
traced back to the source
• Strong reliance on a specific technology limits options, therefore
degrades velocity
• Clear and strong feedback facilitates skillset improvement and
optimal technology choice
5CONFIDENTIAL
HOW-TO
• Unified environment both in research and in production
• Avoid code rewrite
• No framework imposed model structure requirements
• Prevent migration and communication issues
• Keep learning curve under control
• Smooth out-of-the-box CI/CD environment
• Integrated model quality control
• Full traceability (datasets, code, hyper-parameters)
• Scheduled retraining and regression testing
• Integrated feedback loop
• Input and output capturing
• Real-time performance evaluation and monitoring
• A/B testing and active traffic management
• Resilient and scalable open-source ML platform
• Cloud-agnostic and ML toolchain agnostic
• Integration with the most popular choices (Spark,
Sklearn, Tensorflow, R)
6CONFIDENTIAL
WORKFLOW
Development Environment
Local Machine/ LegionEnclave
TrainingEnvironment
LegionCore
ExecutionEnvironment
LegionEnclave
Developer
Compilation Explotation Evaluation
TrainingDataset
DevDataset
∆
Training
Report
00100110
10010100
01010010
00100100
PortableModel
Docker Image
∑
BinaryModel
00100110
10010100
01010010
00100100
BaseImage
Dependencies
Input/Output Logs
OutcomeLogs
Performance
Report
Training
Feature& model selection
Fitting
Hyper-parameter tuning
Evaluation
Dependenciesintegration
Server codeintegration
Portableartefact
Labeling
Model application
Trafficsplit
Logging
Monitoringandself healing
Outcomebasedmetrics
Codeperformance
Development
DataPreparation
Fitting
Cross-Validation
Evaluation
GITLab
Git-Flow
7CONFIDENTIAL
DATA FLOW
• Model is a set of sources
• Training script
• Build scenario
• Tests
• Training process produces a
Docker image
• Self-contained
• Portable
• Model instances spawned on
demand inside Legion
enclaves
• Security isolation
• Failure isolation
• Resource allocation control
Jenkins
GIT
Legion:PRODEnclave
K L M
HTTPTrafficRouter
HTTP
Data
Scientist
.ipynb
Jenkinsfile
Model
Legion:Test Enclave
v2 v1 Test
HTTPTrafficRouter
HTTP
Build container
sparkContext.rdf.count()
model.fit(X,Y)
toolchain.export(model)
MonitoringPlane
TrainingLog
Metrics
Grafana
Docker Repository
Release
Engineer
Docker Image
BinaryModel
Docker Image
BinaryModel
Prometheus
8CONFIDENTIAL
LEGION ARCHITECTURE
LegionCore
Kubernetes
LegionEnclave
AmazonS3
KubernetesIngress
LegionIngress
Nginx/ LuaJIT
Git
Nexus
Airflow
Grafana/ Prometheus/ Statsd
Fluentd
API Traffic
Identity
Provider
TheCloud
RDBMS
Model XService
LegionPymodel
Model Code
AmazonEFS(NFSv4.1)AmazonEBS
Legion
Open-Source
Feedback
Logs
Specific
Jenkins
OpLogs
Traffic
Data
Credentials
Control
AirflowWorker
ETLJobs
AirflowWorker
Model Jobs
Model Y
Unstructured
WebTraffic
Legion
WebConsole
OpLogs
Prometheus
9CONFIDENTIAL
MULTITENANCY
• Each tenant created by
Kubernetes helm/chart
service
• Tenant placed into a separate
namespace and expose
separate HTTP endpoint to
access API models
• Network isolation
• AWS IAM Role based
authorization
• All tenants are managed
through a central dashboard
10CONFIDENTIAL
• Open analytics platform
• Metrics collection and
visualization
– Training, test errors
– Response time distribution
– Throughput
Monitoring – Grafana
11CONFIDENTIAL
• Running model training jobs and produces
model images
• Keeping training metrics, summaries, and
version history
• Git-flow for model input files
Build Manager - Jenkins
12CONFIDENTIAL
TOOLCHAIN INTEGRATION - PYMODEL
• Toolchain support is implemented
by addition of a python package
• Each toolchain package provides 3
routines:
– export() to serialize model to a
file
– build() to produce a docker image
from a model file
– Serve() to expose a binary model
as HTTP RESTful service
Pymodel ToolchainPackage
TrainingScript
model.init()
model.fit(X,Y)
toolchain.export(model)
JenkinsBuild container
MicroserviceDocker Image
BinaryModel
builds
serialisesin-memorymodel intoafile
legionctl
build …
PythonPackage
export() build() serve()
HTTPRESTful API
13CONFIDENTIAL
TOOLCHAIN INTEGRATION – APACHE SPARK
• Apache Spark is not a good match
for runtime model execution
– Distributed processing framework
– High latency
– Large number of dependencies
• Combust MLeap Runtime for Spark
models
– Apache v2 License
– Spark/PySpark/Sklearn support
– Customizable ML data pipelines
• Legion provides lifecycle
management services
– CI/CD & Testing
– Monitoring and Performance
management
– Self-healing
Spark ToolchainPackage
Spark Driver
MicroserviceDocker Image
BinaryModel
builds
serialisesin-memorymodel intoaProtobuf
legionctl
build…
Spark.MLLib
JAR export()
Combust MLeap
PythonPackage
build()
HTTPRESTful API
Jenkins Build Container
JAR
Combust MLeap Runtime
14CONFIDENTIAL
• Hardened Security
– Identity Mapping
– Model API Access Control
• Additional toolchains:
– Tensorflow
– Spark.ML and Spark.MLLib
• Cluster management web console
NEXT STEPS
15CONFIDENTIAL
THE ROADMAP
• Single line model export
• Automatic capture of dependencies
• Translation to portable model description
AS IS CODE
PROMOTION
• A model transforms into a portable service with RESTful interface
• Easy integration of custom models and frameworks
• Risk free changes and upgrades of tool-chain
• Single model failure does not impact whole system
MICROSERVICE
ARCHITECTURE
• Feedback events recording
• A/B testing with audience or traffic share assignment
• Continuous performance measurement
• Historical metrics recording and threshold alerting
CONTINUOUS
PERFORMANCE
MEASUREMENT
• Secure perimeter with integrated authentication
• Continuous Integration/Continuous Delivery process for models
• Artifact repository and build history
ENTERPRISE
READY
• AWS cloud or on premises
• Automatic scaling to accommodate workload
• No platform or vendor lock-in
FLEXIBLE
DEPLOYMENT
ARCHITECTURE

More Related Content

PPTX
VIATRA 3: A Reactive Model Transformation Platform
PPTX
Eclipse Neon Democamp Budapest - VIATRA 1.3 release
PPTX
Incremental Queries and Transformations for Engineering Critical Systems
PPTX
Nasscom ml ops webinar
PPTX
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
PDF
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
VIATRA 3: A Reactive Model Transformation Platform
Eclipse Neon Democamp Budapest - VIATRA 1.3 release
Incremental Queries and Transformations for Engineering Critical Systems
Nasscom ml ops webinar
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform

What's hot (20)

PDF
Dive into POOSL : Simulate your systems!
PDF
Ml ops intro session
PDF
Model versioning done right: A ModelDB 2.0 Walkthrough
PPTX
Magdalena Stenius: MLOPS Will Change Machine Learning
PDF
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
PDF
Dependency inversion using ports and adapters
DOC
Pankaj_Kapila
PDF
Expanding beyond SPL -- More language support in IBM Streams V4.1
PPTX
Enabling .NET Apps with Monitoring and Management Using Steeltoe
DOC
KenCaradineResume MASTER v2 - 08142014
PDF
Building machine learning applications locally with Spark — Joel Pinho Lucas ...
PDF
CI/CD for Machine Learning
PDF
VMware - Application Portability
PDF
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
PDF
Weave GitOps - continuous delivery for any Kubernetes
PDF
Java Webinar #12: "Java Versions and Features: Since JDK 8 to 16"
PDF
Towards Scalable Validation of Low-Code System Models: Mapping EVL to VIATRA ...
PDF
Ml ops deployment choices
PDF
How the Big Data of APM can Supercharge DevOps
PDF
Javantura v4 - Spring Boot and JavaFX - can they play together - Josip Kovaček
Dive into POOSL : Simulate your systems!
Ml ops intro session
Model versioning done right: A ModelDB 2.0 Walkthrough
Magdalena Stenius: MLOPS Will Change Machine Learning
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Dependency inversion using ports and adapters
Pankaj_Kapila
Expanding beyond SPL -- More language support in IBM Streams V4.1
Enabling .NET Apps with Monitoring and Management Using Steeltoe
KenCaradineResume MASTER v2 - 08142014
Building machine learning applications locally with Spark — Joel Pinho Lucas ...
CI/CD for Machine Learning
VMware - Application Portability
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
Weave GitOps - continuous delivery for any Kubernetes
Java Webinar #12: "Java Versions and Features: Since JDK 8 to 16"
Towards Scalable Validation of Low-Code System Models: Mapping EVL to VIATRA ...
Ml ops deployment choices
How the Big Data of APM can Supercharge DevOps
Javantura v4 - Spring Boot and JavaFX - can they play together - Josip Kovaček
Ad

Similar to Legion - AI Runtime Platform (20)

PPTX
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
PDF
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
PDF
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
PPTX
Serverless machine learning architectures at Helixa
PDF
Strata parallel m-ml-ops_sept_2017
PDF
Machine Learning Operations Cababilities
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PDF
Tech leaders guide to effective building of machine learning products
PPTX
MLOps in action
PDF
Pitfalls of machine learning in production
PPTX
Machine Learning Models: From Research to Production 6.13.18
PDF
Multicore 101: Migrating Embedded Apps to Multicore with Linux
PDF
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
PDF
Machine Learning Platform Life-Cycle Management
PDF
Microservices.pdf
PPTX
DevOps for Machine Learning overview en-us
PDF
IncQuery Group's presentation for the INCOSE Polish Chapter 20220310
PDF
Introduction to MLOps_ CI_CD for Machine Learning Models.pdf
PDF
World Artificial Intelligence Conference Shanghai 2018
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Serverless machine learning architectures at Helixa
Strata parallel m-ml-ops_sept_2017
Machine Learning Operations Cababilities
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Tech leaders guide to effective building of machine learning products
MLOps in action
Pitfalls of machine learning in production
Machine Learning Models: From Research to Production 6.13.18
Multicore 101: Migrating Embedded Apps to Multicore with Linux
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
Machine Learning Platform Life-Cycle Management
Microservices.pdf
DevOps for Machine Learning overview en-us
IncQuery Group's presentation for the INCOSE Polish Chapter 20220310
Introduction to MLOps_ CI_CD for Machine Learning Models.pdf
World Artificial Intelligence Conference Shanghai 2018
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Ad

Recently uploaded (20)

PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPT
Geologic Time for studying geology for geologist
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Build Your First AI Agent with UiPath.pptx
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPTX
The various Industrial Revolutions .pptx
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Five Habits of High-Impact Board Members
OpenACC and Open Hackathons Monthly Highlights July 2025
Consumable AI The What, Why & How for Small Teams.pdf
Taming the Chaos: How to Turn Unstructured Data into Decisions
Developing a website for English-speaking practice to English as a foreign la...
Geologic Time for studying geology for geologist
A proposed approach for plagiarism detection in Myanmar Unicode text
Chapter 5: Probability Theory and Statistics
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Custom Battery Pack Design Considerations for Performance and Safety
Final SEM Unit 1 for mit wpu at pune .pptx
Build Your First AI Agent with UiPath.pptx
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
The various Industrial Revolutions .pptx
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
UiPath Agentic Automation session 1: RPA to Agents
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
A review of recent deep learning applications in wood surface defect identifi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Five Habits of High-Impact Board Members

Legion - AI Runtime Platform

  • 1. 1CONFIDENTIAL LEGION: AI RUNTIME PLATFORM April, 2019 http://legion-platform.org/
  • 2. 2CONFIDENTIAL The Ultimate Question of Life, the Universe, and Everything HOW TO USE ML MODELS IN PRODUCTION SYSTEMS?
  • 3. 3CONFIDENTIAL THE CHALLENGE • Organizations need a streamlined process to build and deploy ML models from DS lab to production • Multiple model versions to support different audience segments, experiments, and hypothesis • A single golden source storing model artifacts required • Each production model must be traceable back to source LIFECYCLE MANAGEMENT • Model rewrite for production environment is an error-prone and lengthy process • User facing systems need response latency of 200ms or less (20 ms in AdTech) • Machine Learning toolkits are optimized for interactive development and faster training, not latency • Tracking, deployment and upgrade of multiple ML frameworks, and dependencies, is a complicated process MACHINE LEARNING RUNTIME • Data Science focuses on outcomes, not engineering excellence • Many ML libraries are aggressively extended at the cost of stability • Failed model frequently can be replaced by a simple baseline algorithm RESILIENCE & SCALING • New models must be verified before placement to production • Performance of ML models may degrade during exploitation • Rapid prototyping needs real-world data access and fast feedback PERFORMANCE MONITORING
  • 4. 4CONFIDENTIAL PRODUCT HYPOTHESIS MOVE FAST, FAIL FAST (AND RECORD YOUR MOVES) Build a flexible environment for fast and transparent delivery of models and resilience to failures. • High development velocity enables experimentation and faster result delivery • Eventual people mistakes are contained, identified quickly and traced back to the source • Strong reliance on a specific technology limits options, therefore degrades velocity • Clear and strong feedback facilitates skillset improvement and optimal technology choice
  • 5. 5CONFIDENTIAL HOW-TO • Unified environment both in research and in production • Avoid code rewrite • No framework imposed model structure requirements • Prevent migration and communication issues • Keep learning curve under control • Smooth out-of-the-box CI/CD environment • Integrated model quality control • Full traceability (datasets, code, hyper-parameters) • Scheduled retraining and regression testing • Integrated feedback loop • Input and output capturing • Real-time performance evaluation and monitoring • A/B testing and active traffic management • Resilient and scalable open-source ML platform • Cloud-agnostic and ML toolchain agnostic • Integration with the most popular choices (Spark, Sklearn, Tensorflow, R)
  • 6. 6CONFIDENTIAL WORKFLOW Development Environment Local Machine/ LegionEnclave TrainingEnvironment LegionCore ExecutionEnvironment LegionEnclave Developer Compilation Explotation Evaluation TrainingDataset DevDataset ∆ Training Report 00100110 10010100 01010010 00100100 PortableModel Docker Image ∑ BinaryModel 00100110 10010100 01010010 00100100 BaseImage Dependencies Input/Output Logs OutcomeLogs Performance Report Training Feature& model selection Fitting Hyper-parameter tuning Evaluation Dependenciesintegration Server codeintegration Portableartefact Labeling Model application Trafficsplit Logging Monitoringandself healing Outcomebasedmetrics Codeperformance Development DataPreparation Fitting Cross-Validation Evaluation GITLab Git-Flow
  • 7. 7CONFIDENTIAL DATA FLOW • Model is a set of sources • Training script • Build scenario • Tests • Training process produces a Docker image • Self-contained • Portable • Model instances spawned on demand inside Legion enclaves • Security isolation • Failure isolation • Resource allocation control Jenkins GIT Legion:PRODEnclave K L M HTTPTrafficRouter HTTP Data Scientist .ipynb Jenkinsfile Model Legion:Test Enclave v2 v1 Test HTTPTrafficRouter HTTP Build container sparkContext.rdf.count() model.fit(X,Y) toolchain.export(model) MonitoringPlane TrainingLog Metrics Grafana Docker Repository Release Engineer Docker Image BinaryModel Docker Image BinaryModel Prometheus
  • 8. 8CONFIDENTIAL LEGION ARCHITECTURE LegionCore Kubernetes LegionEnclave AmazonS3 KubernetesIngress LegionIngress Nginx/ LuaJIT Git Nexus Airflow Grafana/ Prometheus/ Statsd Fluentd API Traffic Identity Provider TheCloud RDBMS Model XService LegionPymodel Model Code AmazonEFS(NFSv4.1)AmazonEBS Legion Open-Source Feedback Logs Specific Jenkins OpLogs Traffic Data Credentials Control AirflowWorker ETLJobs AirflowWorker Model Jobs Model Y Unstructured WebTraffic Legion WebConsole OpLogs Prometheus
  • 9. 9CONFIDENTIAL MULTITENANCY • Each tenant created by Kubernetes helm/chart service • Tenant placed into a separate namespace and expose separate HTTP endpoint to access API models • Network isolation • AWS IAM Role based authorization • All tenants are managed through a central dashboard
  • 10. 10CONFIDENTIAL • Open analytics platform • Metrics collection and visualization – Training, test errors – Response time distribution – Throughput Monitoring – Grafana
  • 11. 11CONFIDENTIAL • Running model training jobs and produces model images • Keeping training metrics, summaries, and version history • Git-flow for model input files Build Manager - Jenkins
  • 12. 12CONFIDENTIAL TOOLCHAIN INTEGRATION - PYMODEL • Toolchain support is implemented by addition of a python package • Each toolchain package provides 3 routines: – export() to serialize model to a file – build() to produce a docker image from a model file – Serve() to expose a binary model as HTTP RESTful service Pymodel ToolchainPackage TrainingScript model.init() model.fit(X,Y) toolchain.export(model) JenkinsBuild container MicroserviceDocker Image BinaryModel builds serialisesin-memorymodel intoafile legionctl build … PythonPackage export() build() serve() HTTPRESTful API
  • 13. 13CONFIDENTIAL TOOLCHAIN INTEGRATION – APACHE SPARK • Apache Spark is not a good match for runtime model execution – Distributed processing framework – High latency – Large number of dependencies • Combust MLeap Runtime for Spark models – Apache v2 License – Spark/PySpark/Sklearn support – Customizable ML data pipelines • Legion provides lifecycle management services – CI/CD & Testing – Monitoring and Performance management – Self-healing Spark ToolchainPackage Spark Driver MicroserviceDocker Image BinaryModel builds serialisesin-memorymodel intoaProtobuf legionctl build… Spark.MLLib JAR export() Combust MLeap PythonPackage build() HTTPRESTful API Jenkins Build Container JAR Combust MLeap Runtime
  • 14. 14CONFIDENTIAL • Hardened Security – Identity Mapping – Model API Access Control • Additional toolchains: – Tensorflow – Spark.ML and Spark.MLLib • Cluster management web console NEXT STEPS
  • 15. 15CONFIDENTIAL THE ROADMAP • Single line model export • Automatic capture of dependencies • Translation to portable model description AS IS CODE PROMOTION • A model transforms into a portable service with RESTful interface • Easy integration of custom models and frameworks • Risk free changes and upgrades of tool-chain • Single model failure does not impact whole system MICROSERVICE ARCHITECTURE • Feedback events recording • A/B testing with audience or traffic share assignment • Continuous performance measurement • Historical metrics recording and threshold alerting CONTINUOUS PERFORMANCE MEASUREMENT • Secure perimeter with integrated authentication • Continuous Integration/Continuous Delivery process for models • Artifact repository and build history ENTERPRISE READY • AWS cloud or on premises • Automatic scaling to accommodate workload • No platform or vendor lock-in FLEXIBLE DEPLOYMENT ARCHITECTURE

Editor's Notes

  • #11: P37 – divide into 2 slides
  • #12: P37 – divide into 2 slides