SlideShare a Scribd company logo
1Privileged and confidential 1
Data Science at Scale
Privileged and confidential
October 2019
Next generation data processing platforms
Solution Architect
yravlinko@griddynamics.com
2Privileged and confidential
About me
●
●
●
●
●
solution architect
Grid Dynamics, Lviv, Ukraine
Yaroslav Ravlinko
3Privileged and confidential
Problem Definition
4Privileged and confidential
Hidden Tech Debt of ML/DS System
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box
in the middle. The required surrounding infrastructure is vast and complex.
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
5Privileged and confidential
Data Science
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
6Privileged and confidential
+Data Engineering
Configuration Data collection
Feature extraction
Data verification
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
7Privileged and confidential
Ops
Configuration
Feature extraction
Machine resource
management
Process management tools
Analysis tools
Serving
infrastructure
Monitoring
ML core
Data collection
Data verification
8Privileged and confidential
Development and
Release Process
9Privileged and confidential
Machine Learning and Data Processing Workflow
Data ingestion
Feature
engineering
Model selection
validation
Serving
production
Prototyping
training
Data Science/ML platform
Developers point of view
10Privileged and confidential
Revisited Machine Learning and Data Processing Workflow
Data ingestion Data processing
Insight serving
production
Something important
Data Science/ML platform
Ops engineer point of view
Scheduler Workflow management
ML magic
11Privileged and confidential
Some solutions
12Privileged and confidential
Decision tree
Are your services
relying on HDFS
as persistent
storage?
Are your tasks
mostly ETL like?
ETL > Apps
Do you need
mostly run and
deploy apps?
ETL < Apps
NO NO
YESYESYES
13Privileged and confidential
Blueprint
14Privileged and confidential
DS/ML Platform blueprint components
UI and exposed
API/Contracts
Integrations with third party
service providers
Platform/Engine to setup,
manage and execute business
logic
Data Science and ML code
15Privileged and confidential
DS/ML Platform blueprint components
Application runtimes and serving MLP UI/API Sandbox
‘Big’ data processing
toolset
Data Science and
Machine Learning
toolset
Release management
Data ingestion system
Resource management system
Encryption, secret
management
Infrastructure (VM, Network, Disk, GPU)
Scheduler and workflow
management
User management
Monitoring/log
management
16Privileged and confidential
Blueprint: ML Platform on GCP
17Privileged and confidential
MVP on GCP
MongoDB + REST facade kubectl, k8s UI GCP DataLab
BigQuery, Cloud ML
Engine
Python Code Argo
GCP Kubernetes Cluster
GCP VM, Cloud Storage, Persistance Disk
Argo CLI , Argo UI
G-Suit + K8s RBAC
GCP Stacktrace,
K8s logs
Apache Beam,
Google DataFlow
Google Pub/Sub,
Custom connectors
GCP BigQuery,
Google Cloud
Storage
18Privileged and confidential
Allocation
Ingest (Data Platform) ML Processing (Training) Serving
ML Platform
Big Query Tables
Data Bucket
Cloud datalab
Custom framework
Cloud Machine
Learning
Container registry
Custom application
ArgoKubernetes Persistent DiskCloud Pub/Sub
19Privileged and confidential
Integration with Data Platform
ML Processing (Training) Serving
ML Platform
Cloud
datalab
Custom
framework
Cloud Machine
Learning
Container
registry
Custom
application
ArgoKubernetes
Persistent
Disk
ML Platform
Data
Platform API
Data
Processing
Cloud
Dataflow
GCS Data
Bucket
GCS
preprocessing
bucket
Cloud
Pub/Sub
Ingest (internal)
Data Sources
(external)
Adobe
Experian
Facebook
Interflora
SAS
Calyx
BG Tables
Objects
Big Query
tables
20Privileged and confidential
Blueprint: ML Platform
on Hybrid Cloud
21Privileged and confidential
Use case
Data sources
SQL
#NoSQL
Other
On-premise services
HDFS
HDFS API
(Google
storage)
Google
Persistant
disk
Google
storage
HBase API,
BigTable
ALS-API
Workflow/Scheduler
k8sGCP services
GET
GET
GET
GET
ETL Training Serving Validation
Argo
Produce GET/Produce GET Produce Deploy Post
Copy Copy GET
1
1
2
3 5
9
876
4
22Privileged and confidential
MVP on GCP and on-premise Datacenter
Scala REST facade kubectl, k8s UI JupyterHub
ML Flow Python Code Argo
GCP Kubernetes Cluster
GCP VM, Cloud Storage, Persistance Disk
Web UI
(Custom App)
G-Suit + K8s RBAC,
ADFS 2.0
GCP Stacktrace,
K8s logs, ELK,
Prometheus
Apache Spark
Google Pub/Sub,
Custom connectors
BigTable, Redis
On-premise
Hadoop Cluster
23Privileged and confidential
Allocation
Ingest (Data Platform) ML Processing (Training) Serving
ML Platform
Big Query
Tables(Feature
Store)
Data Bucket
Container
registry
Custom
application
ArgoKubernetes
Persistent
Disk
Cloud Pub/Sub
On-premise
HDFS cluster
DWH
Kafka cluster
MLFlow
Custom ML
code (Python)
Spark on k8s
Custom ML
workflow UI
JupyterHub
24Privileged and confidential
Demo
25Privileged and confidential
Demo: Recommendation System
Data sources
SQL
#NoSQL
Other
On-premise services
HDFS
HDFS API
(Google
storage)
Google
Persistant
disk
Google
storage
HBase API,
BigTable
ALS-API
Workflow/Scheduler
k8sGCP services
GET
GET
GET
GET
ETL Training Serving Validation
Argo
Produce GET/Produce GET Produce Deploy Post
Copy Copy GET
1
1
2
3 5
9
876
4
26Privileged and confidential
Some numbers
・ Reduced time of development at 90%
・ More efficient usage of resources (VMs, Disk, Network)
ー Reduced resources usage up to 70% using k8s autoscaling and ephemeral object
・ Increase release time of new model (from month to hours)
・ Reduce time of “ETL-Model Training-Serving” workflow from 24 hours to 3 hours
27Privileged and confidential
Some conclusions
・ We see some pivoting from Hadoop only solutions to more general purposes solutions
as Kubernetes (kubeflow), GCP ML, Amazon ML
・ Back to SQL as main interface to work with DS/ML platforms
・ ML/DS solution still between “genesis” and “product” stage of evolution
・ It is fun but sometimes too much ;)
28Privileged and confidential
Q & A
29Privileged and confidential
Founded in 2006, Grid Dynamics is an engineering services company
built on the premise that cloud computing is disruptive within the
enterprise technology landscape
30Privileged and confidential 30
Thank you!
www.griddynamics.com

More Related Content

PDF
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
PDF
Ai platform at scale
PDF
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Bay Area Apache Flink Meetup Community Update August 2015
PDF
Automated Production Ready ML at Scale
PDF
Phar Data Platform: From the Lakehouse Paradigm to the Reality
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
Ai platform at scale
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Bay Area Apache Flink Meetup Community Update August 2015
Automated Production Ready ML at Scale
Phar Data Platform: From the Lakehouse Paradigm to the Reality

What's hot (20)

PDF
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
PDF
Bigdata Machine Learning Platform
PDF
Observability for Data Pipelines With OpenLineage
PPTX
Graph Data: a New Data Management Frontier
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
PDF
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...
PDF
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
PDF
Better Together: How Graph database enables easy data integration with Spark ...
PDF
Portable Scalable Data Visualization Techniques for Apache Spark and Python N...
PDF
Streaming analytics state of the art
PPTX
Implementing BigPetStore with Apache Flink
PDF
DataFrames: The Good, Bad, and Ugly
PDF
Python Data Wrangling: Preparing for the Future
PDF
Democratizing Data
PDF
Conference on Nagios: Reinhard Scheck on Cacti
PDF
Hops fs huawei internal conference july 2021
PDF
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
PDF
Big Data Meets Learning Science: Keynote by Al Essa
Natalie Godec - AirFlow and GCP: tomorrow's health service data platform
Bigdata Machine Learning Platform
Observability for Data Pipelines With OpenLineage
Graph Data: a New Data Management Frontier
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Using Apache Spark to Predict Installer Retention from Messy Clickstream Data...
Maximize Greenplum For Any Use Cases Decoupling Compute and Storage - Greenpl...
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Better Together: How Graph database enables easy data integration with Spark ...
Portable Scalable Data Visualization Techniques for Apache Spark and Python N...
Streaming analytics state of the art
Implementing BigPetStore with Apache Flink
DataFrames: The Good, Bad, and Ugly
Python Data Wrangling: Preparing for the Future
Democratizing Data
Conference on Nagios: Reinhard Scheck on Cacti
Hops fs huawei internal conference july 2021
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Big Data Meets Learning Science: Keynote by Al Essa
Ad

Similar to ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing platforms» Lviv DevOps Conference 2019 (20)

PPTX
Data platform modernization with Databricks.pptx
PDF
Does it only have to be ML + AI?
PPTX
20160000 Cloud Discovery Event - Cloud Access Security Brokers
PPTX
Big Data IDEA 101 2019
PDF
World Artificial Intelligence Conference Shanghai 2018
PPTX
Protecting Data Privacy in Analytics and Machine Learning
PDF
Data Science at Scale - The DevOps Approach
PPTX
Lecture1 BIG DATA and Types of data in details
PDF
Emergence of cloud computing and internet of things an overview
PPTX
Big Data Analytics PPT - S1 working .pptx
PDF
DevOps for DataScience
PPTX
Introducing Technologies for Handling Big Data by Jaseela
PPTX
Cloud computing and big data analytics
PPTX
Machine Learning Models in Production
PDF
Tech essentials for Product managers
PPTX
Big data analytics and machine intelligence v5.0
PPTX
Big data
PDF
Fms invited talk_2018 v5
PPTX
Deep Learning Technical Pitch Deck
PPTX
Data science and cloud computing
Data platform modernization with Databricks.pptx
Does it only have to be ML + AI?
20160000 Cloud Discovery Event - Cloud Access Security Brokers
Big Data IDEA 101 2019
World Artificial Intelligence Conference Shanghai 2018
Protecting Data Privacy in Analytics and Machine Learning
Data Science at Scale - The DevOps Approach
Lecture1 BIG DATA and Types of data in details
Emergence of cloud computing and internet of things an overview
Big Data Analytics PPT - S1 working .pptx
DevOps for DataScience
Introducing Technologies for Handling Big Data by Jaseela
Cloud computing and big data analytics
Machine Learning Models in Production
Tech essentials for Product managers
Big data analytics and machine intelligence v5.0
Big data
Fms invited talk_2018 v5
Deep Learning Technical Pitch Deck
Data science and cloud computing
Ad

More from UA DevOps Conference (10)

PDF
ІЛЛЯ ЛУБЕНЕЦЬ «DevSecOps наступний етап розвитку DevOps» GO DevOps
PPTX
ОЛЕКСАНДР СНІГОВИЙ «Continuous Deployment: Challenges, Solutions, and Lesson...
PDF
АРТЕМ КОБРІН «Achieve Networking at Scale with a Self-Service Network Solutio...
PDF
ОЛЕКСАНДР СИРОТЕНКО «DataKernel: майструючи український фреймворк для highloa...
PPTX
ОЛЕКСАНДР ВІЛЬЧИНСЬКИЙ «DevOps culture» Lviv DevOps Conference 2019
PDF
КОСТЯНТИН СЕВЕРЕНЧУК «Monitoring and Automation in DevTestSecOps world» Lviv ...
PPTX
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
PPTX
ОЛЕКСАНДР СНІГОВИЙ «Extension of DevOps: Policy as Code» Lviv DevOps Confere...
PPTX
СТАНІСЛАВ КОЛЕНКІН «Cilium – Network security for microservices. Let’s see ho...
PDF
ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019
ІЛЛЯ ЛУБЕНЕЦЬ «DevSecOps наступний етап розвитку DevOps» GO DevOps
ОЛЕКСАНДР СНІГОВИЙ «Continuous Deployment: Challenges, Solutions, and Lesson...
АРТЕМ КОБРІН «Achieve Networking at Scale with a Self-Service Network Solutio...
ОЛЕКСАНДР СИРОТЕНКО «DataKernel: майструючи український фреймворк для highloa...
ОЛЕКСАНДР ВІЛЬЧИНСЬКИЙ «DevOps culture» Lviv DevOps Conference 2019
КОСТЯНТИН СЕВЕРЕНЧУК «Monitoring and Automation in DevTestSecOps world» Lviv ...
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ОЛЕКСАНДР СНІГОВИЙ «Extension of DevOps: Policy as Code» Lviv DevOps Confere...
СТАНІСЛАВ КОЛЕНКІН «Cilium – Network security for microservices. Let’s see ho...
ОЛЕГ МАЦЬКІВ «Crash course on Operator Framework» Lviv DevOps Conference 2019

Recently uploaded (20)

PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
medical staffing services at VALiNTRY
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Nekopoi APK 2025 free lastest update
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
top salesforce developer skills in 2025.pdf
How Creative Agencies Leverage Project Management Software.pdf
medical staffing services at VALiNTRY
Navsoft: AI-Powered Business Solutions & Custom Software Development
Wondershare Filmora 15 Crack With Activation Key [2025
Design an Analysis of Algorithms I-SECS-1021-03
How to Choose the Right IT Partner for Your Business in Malaysia
Upgrade and Innovation Strategies for SAP ERP Customers
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Understanding Forklifts - TECH EHS Solution
ai tools demonstartion for schools and inter college
How to Migrate SBCGlobal Email to Yahoo Easily
Softaken Excel to vCard Converter Software.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Nekopoi APK 2025 free lastest update
VVF-Customer-Presentation2025-Ver1.9.pptx
PTS Company Brochure 2025 (1).pdf.......
top salesforce developer skills in 2025.pdf

ЯРОСЛАВ РАВЛІНКО «Data Science at scale. Next generation data processing platforms» Lviv DevOps Conference 2019

  • 1. 1Privileged and confidential 1 Data Science at Scale Privileged and confidential October 2019 Next generation data processing platforms Solution Architect yravlinko@griddynamics.com
  • 2. 2Privileged and confidential About me ● ● ● ● ● solution architect Grid Dynamics, Lviv, Ukraine Yaroslav Ravlinko
  • 4. 4Privileged and confidential Hidden Tech Debt of ML/DS System Configuration Data collection Feature extraction Data verification Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  • 5. 5Privileged and confidential Data Science Configuration Data collection Feature extraction Data verification Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core
  • 6. 6Privileged and confidential +Data Engineering Configuration Data collection Feature extraction Data verification Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core
  • 7. 7Privileged and confidential Ops Configuration Feature extraction Machine resource management Process management tools Analysis tools Serving infrastructure Monitoring ML core Data collection Data verification
  • 9. 9Privileged and confidential Machine Learning and Data Processing Workflow Data ingestion Feature engineering Model selection validation Serving production Prototyping training Data Science/ML platform Developers point of view
  • 10. 10Privileged and confidential Revisited Machine Learning and Data Processing Workflow Data ingestion Data processing Insight serving production Something important Data Science/ML platform Ops engineer point of view Scheduler Workflow management ML magic
  • 12. 12Privileged and confidential Decision tree Are your services relying on HDFS as persistent storage? Are your tasks mostly ETL like? ETL > Apps Do you need mostly run and deploy apps? ETL < Apps NO NO YESYESYES
  • 14. 14Privileged and confidential DS/ML Platform blueprint components UI and exposed API/Contracts Integrations with third party service providers Platform/Engine to setup, manage and execute business logic Data Science and ML code
  • 15. 15Privileged and confidential DS/ML Platform blueprint components Application runtimes and serving MLP UI/API Sandbox ‘Big’ data processing toolset Data Science and Machine Learning toolset Release management Data ingestion system Resource management system Encryption, secret management Infrastructure (VM, Network, Disk, GPU) Scheduler and workflow management User management Monitoring/log management
  • 17. 17Privileged and confidential MVP on GCP MongoDB + REST facade kubectl, k8s UI GCP DataLab BigQuery, Cloud ML Engine Python Code Argo GCP Kubernetes Cluster GCP VM, Cloud Storage, Persistance Disk Argo CLI , Argo UI G-Suit + K8s RBAC GCP Stacktrace, K8s logs Apache Beam, Google DataFlow Google Pub/Sub, Custom connectors GCP BigQuery, Google Cloud Storage
  • 18. 18Privileged and confidential Allocation Ingest (Data Platform) ML Processing (Training) Serving ML Platform Big Query Tables Data Bucket Cloud datalab Custom framework Cloud Machine Learning Container registry Custom application ArgoKubernetes Persistent DiskCloud Pub/Sub
  • 19. 19Privileged and confidential Integration with Data Platform ML Processing (Training) Serving ML Platform Cloud datalab Custom framework Cloud Machine Learning Container registry Custom application ArgoKubernetes Persistent Disk ML Platform Data Platform API Data Processing Cloud Dataflow GCS Data Bucket GCS preprocessing bucket Cloud Pub/Sub Ingest (internal) Data Sources (external) Adobe Experian Facebook Interflora SAS Calyx BG Tables Objects Big Query tables
  • 20. 20Privileged and confidential Blueprint: ML Platform on Hybrid Cloud
  • 21. 21Privileged and confidential Use case Data sources SQL #NoSQL Other On-premise services HDFS HDFS API (Google storage) Google Persistant disk Google storage HBase API, BigTable ALS-API Workflow/Scheduler k8sGCP services GET GET GET GET ETL Training Serving Validation Argo Produce GET/Produce GET Produce Deploy Post Copy Copy GET 1 1 2 3 5 9 876 4
  • 22. 22Privileged and confidential MVP on GCP and on-premise Datacenter Scala REST facade kubectl, k8s UI JupyterHub ML Flow Python Code Argo GCP Kubernetes Cluster GCP VM, Cloud Storage, Persistance Disk Web UI (Custom App) G-Suit + K8s RBAC, ADFS 2.0 GCP Stacktrace, K8s logs, ELK, Prometheus Apache Spark Google Pub/Sub, Custom connectors BigTable, Redis On-premise Hadoop Cluster
  • 23. 23Privileged and confidential Allocation Ingest (Data Platform) ML Processing (Training) Serving ML Platform Big Query Tables(Feature Store) Data Bucket Container registry Custom application ArgoKubernetes Persistent Disk Cloud Pub/Sub On-premise HDFS cluster DWH Kafka cluster MLFlow Custom ML code (Python) Spark on k8s Custom ML workflow UI JupyterHub
  • 25. 25Privileged and confidential Demo: Recommendation System Data sources SQL #NoSQL Other On-premise services HDFS HDFS API (Google storage) Google Persistant disk Google storage HBase API, BigTable ALS-API Workflow/Scheduler k8sGCP services GET GET GET GET ETL Training Serving Validation Argo Produce GET/Produce GET Produce Deploy Post Copy Copy GET 1 1 2 3 5 9 876 4
  • 26. 26Privileged and confidential Some numbers ・ Reduced time of development at 90% ・ More efficient usage of resources (VMs, Disk, Network) ー Reduced resources usage up to 70% using k8s autoscaling and ephemeral object ・ Increase release time of new model (from month to hours) ・ Reduce time of “ETL-Model Training-Serving” workflow from 24 hours to 3 hours
  • 27. 27Privileged and confidential Some conclusions ・ We see some pivoting from Hadoop only solutions to more general purposes solutions as Kubernetes (kubeflow), GCP ML, Amazon ML ・ Back to SQL as main interface to work with DS/ML platforms ・ ML/DS solution still between “genesis” and “product” stage of evolution ・ It is fun but sometimes too much ;)
  • 29. 29Privileged and confidential Founded in 2006, Grid Dynamics is an engineering services company built on the premise that cloud computing is disruptive within the enterprise technology landscape
  • 30. 30Privileged and confidential 30 Thank you! www.griddynamics.com