SlideShare a Scribd company logo
DATA ORCHESTRATION SUMMI
T
The hidden engineering behind machine learning products at Helixa
Gianmario Spacagna
Chief Scientist at Helixa
DATA ORCHESTRATION 
SUMMIT
2020
In the next 20 minutes you will
learn about
1. A real ML system powering a platform used by
thousands of marketers around the globe
2. The tools and engineering practices that enabled
us to build fast, cheap, and robust pipelines
About Helixa
Helixa is an audience
intelligence platform that
uses Machine Learning to
provide accurate, and
timely, consumers
insights for modern
market research
DATA ORCHESTRATION SUMMIT
Audience: Size: 1.5M / 223M represented population
François CholletBen Hamner George Hotz
Top Influencers
201x 114x 106x
Cifar News
Top Media
The Hacker News AngelList
65x 31x 28x
Tensorflow
Top Products and Companies
Waymo Airbnb Engineering
107x 66x 55x
Demographics
18-40 years old
Male
U.S. and India
DATA ORCHESTRATION SUMMIT
Platform Requirements
Multiple Datasets Accurate consumers insights Real-time analytics quickly
Always available Minimum infrastructure
maintenance
Cost effective
DATA ORCHESTRATION SUMMIT
Helixa ML System Overview
DATA ORCHESTRATION SUMMIT
Helixa end-to-end pipeline
Insights Engine
Other Analytics
Tools
Audience
Projection
Real-time
analytics
applications
Common
Data Model
Data
Processing
Data IntegrationsData
Contents
Embedding
Entity
Resolution
Taxonomy
Categorization
Users Digital
DNA
Traits
Classifiers
Latent Interests
Augmentation
Machine Learning
jobs
DATA ORCHESTRATION SUMMIT
Helixa architecture
Data Ingestions
ML Cloud Services
Pre-trained models External APIs
ML LibrariesML pipelines
Model repository
Production
DB
Microservices
Data Lake
Batch Jobs
Analytics
applications
DATA ORCHESTRATION SUMMIT
Batch inference
Model repository and evaluation metrics
Training and hyper-parameters tuning
Analysis and Research
ML libraries
Data Labeling
Feature Store
Feature Engineering
Data Lake
Tech stack and tools
In this talk we
will focus on
DATA ORCHESTRATION SUMMIT
The Data Lake(house)
DATA ORCHESTRATION SUMMIT
Native Cloud Object (Data) Storage
Benefits:
● Cheaper
● Elastic
● Highly available
● Performant
Hadoop HDFS
DATA ORCHESTRATION SUMMIT
Artifacts are saved in S3 and crawled by Glue
Athena is used to build logical views on top of them such as:
▪ Retrieve the latest version of the artifact
▪ Aggregate multiple partitions of the same artifact
▪ Filter and merge with other tables
▪ Export snapshot of the views as versioned parquet datasets
Data Lake(house) using Glue and Athena
DATA ORCHESTRATION SUMMIT
Feature Store Partitions (X)
S3 bucket
❏ users
❏ features
❏ feature_family=text_embedding
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ timestamp=2020-09-18-18-35
❏ ...
❏ feature_family=picture_embedding
❏ ...
❏ feature_family=category_counts
❏ ...
❏ items
❏ other entities
Parquet data indexed by user_id
Metadata containing info on how
the features were created
Partition by set of features
generated by the same job
Creation time
DATA ORCHESTRATION SUMMIT
Label Store Partitions (y)
S3 bucket
❏ users
❏ labels
❏ variable=gender
❏ source=first_name
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ source=public_profile
❏ ...
❏ variable=age
❏ items
❏ other entities
Partition by the variable we are
trying to predict
Partition by the source of ground
truth
Label management for weak learning done via
DATA ORCHESTRATION SUMMIT
Prediction Store Partitions (y_pred)
S3 bucket
❏ users
❏ predictions
❏ variable=gender
❏ model=xgbc
❏ timestamp=2020-11-05-17-22
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ model=cnn
❏ ...
❏ variable=age
❏ ...
❏ items
❏ other entities
Partition by the identifier of the
model used to predict
DATA ORCHESTRATION SUMMIT
The Development
DATA ORCHESTRATION SUMMIT
Platforms for managing the ML lifecycle
● Training
● Predictions
● Model serving
● Model repository
● Experiments
tracking
● Evaluation metrics
Production
● Dev data
versioning and
linkage
● Automated
evaluation reports
● Collaborative
experiments
● Deep Learning
computing
environment
R&D
DATA ORCHESTRATION SUMMIT
R&D workflow
Pull
Notebooks and data stored
and shared in S3
Data cache
Dev unix
machine in
the cloud
Notebook name matching branch ID
Install the latest version of
the code
Develop code locally using
professional IDEs
Feature branches
matching Jira key
Gitflow
branching model
Commit and
push
DATA ORCHESTRATION SUMMIT
EC2 memory-optimized machines (r4 or r5 family)
EBS volume of 250GB of storage
Alluxio and Jupyter services to start at boot time
200GB reserved for the Alluxio cache
S3 buckets mounted locally in --readonly mode using fuse API
Read parquet data in multi-processing using Dask directly from the local file system instead of
using the S3 boto API
cache configuration
DATA ORCHESTRATION SUMMIT
Research & Development data: ~1TB
We only focus on 15% of data every month (~150GB)
Re-access of the data for every kernel restart (~5 times a day)
Data science team members (~5 people)
Datasets spread into files of ~120MB each
=> roughly 1.2k files and 500k read requests every month
We observed a speed-up between 3x to 5x using Alluxio
+ all of the benefits of accessing the S3 data from the POSIX API
benefits for the R&D
DATA ORCHESTRATION SUMMIT
Processing large datasets with EMR
Picture source: https://dimensionless.in/different-ways-to-manage-apache-spark-applications-on-amazon-emr/
Ephemeral clusters on spot instances can dramatically reduce the cost of operations
+ SparkMagic
SUBMIT JOB
DATA ORCHESTRATION SUMMIT
The Deployment
DATA ORCHESTRATION SUMMIT
Automate code with a task-oriented
containerized jobs
Picture source: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c
All of the analysis findings are moved into a production-quality
modules and entry points declared in makefiles for tasks such as:
● Data preparations
● Feature extractions
● Model selection / tuning
● Evaluations
● Model Inference
● Predictions post-processing
DATA ORCHESTRATION SUMMIT
Automate tasks execution using Continuous
Integration (CI)
Picture source: https://deploybot.com/blog/the-expert-guide-to-continuous-integration
On commit
Code tests
Evaluation reportsBuilds & Deployment
On release
DATA ORCHESTRATION SUMMIT
Embarrassingly parallel data processing and
batch inference with AWS batch
Source: https://spotinst.com/blog/cost-efficient-batch-computing-on-spot-instances-aws-batch-integration/
JobsData batches
~ a few GBs each
Output storage
DATA ORCHESTRATION SUMMIT
Model serving via microservices
SERVERLESS CHOICE
Cheap and simple solution for
deploying containers without have
to care about the infrastructure
Limits as of today:
Max 4 vCPUs and 30GB of RAM
OR
SERVERFUL CHOICE
Advanced, customizable, powerful,
widespread solutions for containers
orchestration on pools of EC2
instances
Requires infrastructure management
AWS EC2
DATA ORCHESTRATION SUMMIT
How do containers scale for real-time
varying requests load?
Number of requests per second
capacity
unexpected sudden burst
Over-provisioning cost
DATA ORCHESTRATION SUMMIT
Training pipeline
Real-time serverless model serving
Lookup user
and model info
Get users
features
trigger
Update
metainfo
and configs
REST
request
Get model
Package requirements
EFS
read libraries
predictionsreturn
save model
Build and deploy
DATA ORCHESTRATION SUMMIT
Comparison for real-time applications
Horizontal scaling Autoscaling rules based on predicted
load and capacity
Elastic, based on real-time demand
Provisioning time Minutes Immediately or seconds if cold start
Burst concurrency Depends on available resources 3000 + additional 500 every minute
Cost efficiency Pay for the over-provisioning Only pay for what you use (10x
cheaper in our use cases)
Vertical scaling Limited by instance types Limited to 3GB and 2 CPUs
Execution timeout Unlimited 15 minutes
DATA ORCHESTRATION SUMMIT
Pick the best of both worlds
DATA ORCHESTRATION SUMMIT
Orchestrating functions and microservices
with Step Functions
Workflows defined as a finite states
machine and plug-and-play integration
with most of the AWS services:
AWS Batch ECS
Sagemaker
DATA ORCHESTRATION SUMMIT
Hybrid solution for:
DATA ORCHESTRATION SUMMIT
Monitoring and Alerting
DATA ORCHESTRATION SUMMIT
Centralized logging with the ELK stack
Generate Logs Aggregation &
Transformation
Storage & Indexing Visualization & Analysis
DATA ORCHESTRATION SUMMIT
Infrastructure Monitoring and Alerting
Basic Monitoring
AWS resources and
custom metrics generated
by your applications and
services
General Infra Monitoring
Cloud-scale monitoring of
logs, metrics and traces
from distributed, dynamic
and hybrid infrastructure.
Serverless Monitoring
All-in-one performance
management tool down to
the single lines of code
specifically designed for
serverless applications.
DATA ORCHESTRATION SUMMIT
KPIs and Metrics Dashboard Data sanity checks
KPIs over time such as:
● Distribution shifts
● Model drift
● Utilization
● Coverage
Analytics dashboard on top of
athena SQL queries
Custom programmatic dashboards
with interactive charts
DATA ORCHESTRATION SUMMIT
Final Remarks
DATA ORCHESTRATION SUMMITSource: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
Only a small fraction of real-world ML systems is composed of the ML Code.
The required surrounding infrastructure is vast and complex.
DATA ORCHESTRATION SUMMIT
Facebook new motto in 2014Facebook original motto
DATA ORCHESTRATION SUMMIT
Different tools
DATA ORCHESTRATION SUMMIT
Embrace the serverless paradigm
DATA ORCHESTRATION SUMMIT
Download the Non-Technical Guide
Topics covered:
✅ Getting started with understanding the technology
✅ Designing the right ML product
✅ Planning under uncertainty
✅ Building a balanced ML team
www.helixa.ai/machine-learning-guide-2020
Gianmario Spacagna
Chief Scientist at Helixa
gspacagna@helixa.ai
gm_spacagna
gmspacagna
datasciencevademecum.com

More Related Content

PDF
The Future of Computing is Distributed
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
How to teach your data scientist to leverage an analytics cluster with Presto...
PDF
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Unified Data Access with Gimel
PDF
Deep Learning in the Cloud at Scale: A Data Orchestration Story
The Future of Computing is Distributed
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Hybrid data lake on google cloud with alluxio and dataproc
How to teach your data scientist to leverage an analytics cluster with Presto...
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Accelerate Analytics and ML in the Hybrid Cloud Era
Unified Data Access with Gimel
Deep Learning in the Cloud at Scale: A Data Orchestration Story

What's hot (20)

PDF
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
PDF
Alluxio Use Cases and Future Directions
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
PPTX
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Introducing the Hub for Data Orchestration
PPTX
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
PPTX
Presto query optimizer: pursuit of performance
PDF
ETL Practices for Better or Worse
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
PDF
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PDF
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Alluxio Use Cases and Future Directions
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
From limited Hadoop compute capacity to increased data scientist efficiency
Introducing the Hub for Data Orchestration
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Presto query optimizer: pursuit of performance
ETL Practices for Better or Worse
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Iceberg + Alluxio for Fast Data Analytics
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Ad

Similar to The hidden engineering behind machine learning products at Helixa (20)

PPTX
Azure Data Explorer deep dive - review 04.2020
PPTX
Serverless machine learning architectures at Helixa
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
PPT
Computing Outside The Box June 2009
PDF
TimeSeries Machine Learning - PyData London 2025
ODP
Cloud Computing ...changes everything
PPTX
Microsoft Dryad
PPT
Computing Outside The Box September 2009
PDF
Barga IC2E & IoTDI'16 Keynote
PPTX
High availability, real-time and scalable architectures
PDF
Intelligent Monitoring
PDF
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
PDF
High-performance database technology for rock-solid IoT solutions
PDF
Parallel In-Memory Processing and Data Virtualization Redefine Analytics Arch...
PDF
Big data on_aws in korea by abhishek sinha (lunch and learn)
PDF
Get Value From Your Data
PPTX
Big Data: It’s all about the Use Cases
PPTX
Microsoft Azure Big Data Analytics
PDF
Sintelix Software is Fantastic For Text Mining Software
PDF
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
Azure Data Explorer deep dive - review 04.2020
Serverless machine learning architectures at Helixa
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Computing Outside The Box June 2009
TimeSeries Machine Learning - PyData London 2025
Cloud Computing ...changes everything
Microsoft Dryad
Computing Outside The Box September 2009
Barga IC2E & IoTDI'16 Keynote
High availability, real-time and scalable architectures
Intelligent Monitoring
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
High-performance database technology for rock-solid IoT solutions
Parallel In-Memory Processing and Data Virtualization Redefine Analytics Arch...
Big data on_aws in korea by abhishek sinha (lunch and learn)
Get Value From Your Data
Big Data: It’s all about the Use Cases
Microsoft Azure Big Data Analytics
Sintelix Software is Fantastic For Text Mining Software
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Essential Infomation Tech presentation.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPT
JAVA ppt tutorial basics to learn java programming
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
history of c programming in notes for students .pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Complete React Javascript Course Syllabus.pdf
PPT
Introduction Database Management System for Course Database
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
AI in Product Development-omnex systems
PTS Company Brochure 2025 (1).pdf.......
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Odoo POS Development Services by CandidRoot Solutions
Essential Infomation Tech presentation.pptx
Design an Analysis of Algorithms I-SECS-1021-03
JAVA ppt tutorial basics to learn java programming
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms II-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
Which alternative to Crystal Reports is best for small or large businesses.pdf
Materi_Pemrograman_Komputer-Looping.pptx
Upgrade and Innovation Strategies for SAP ERP Customers
How to Migrate SBCGlobal Email to Yahoo Easily
Understanding Forklifts - TECH EHS Solution
history of c programming in notes for students .pptx
ISO 45001 Occupational Health and Safety Management System
Complete React Javascript Course Syllabus.pdf
Introduction Database Management System for Course Database
Operating system designcfffgfgggggggvggggggggg
AI in Product Development-omnex systems

The hidden engineering behind machine learning products at Helixa