SlideShare a Scribd company logo
How the Development Bank of Singapore
solves on-prem compute capacity challenges
with cloud bursting
Vitaliy Baklikov | VP at DBS
Dipti Borkar | VP Product at Alluxio
About DBS
• Headquartered in Singapore
• Largest bank in South East Asia
• Present in 18 markets globally,
including 6 priority markets
• Singapore, Hong Kong, China,
India, Indonesia and Taiwan
• We have a very cool digiBank app
• And lots lots lots of data systems
AWS EnginesOnprem Engines
HDFS
Object Store
Evolution of Data Platforms at DBS
Generation 1
• Boxed data
• Monolithic/Closed Systems
• Proprietary HW/SW
• Data for Targeted Use Cases
Generation 2
• Big Data Explosion
• Hadoop Data Lakes
• Commodity HW and Hadoop
Ecosystem
• Compute tied to Storage
Generation 3
• Data Democratization
• Cloud Native platform…
Hybrid! Multi!
• Open Source Engines
• AI/ML Centric
Teradata
Informatica
SAS
HadoopTeradata
Informatica
SAS
Teradata
Informatica
SAS
Hadoop
4(Layer B) (Layer C/D)
Challenges
1. Data Lake built on local Object Store
– Expensive rename operation
– Object listing is slow
– Variable performance
– Data locality is gone
2. Multiple Data Silos
3. Limited on-premise compute capacity
– Legacy ITIL processes for Infra provisioning
– No dynamic scale out/in
Alluxio at DBS
Mount HDFS from other
platforms into common
Alluxio cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot
data to speed up Presto
and Spark jobs
Extend Alluxio cluster into
AWS VPC
Run EMR for model training
and bring the results back to
on-prem
Advanced Analytics on Object Storage
The Use Case
• Cash-in-Transit use case
• Forecast cash replenishment schedule for each ATM
• Produce a delivery graph by 4am each morning
The Challenges
• Strict fixed SLA
• Need to load all history data for the forecasting model… daily!
Alluxio at DBS
Mount HDFS from other
platforms into common
Alluxio cluster
Unified
Namespace
Object store
Analytics
Hybrid
cloud bursting
Caching layer for hot
data to speed up Presto
and Spark jobs
Extend Alluxio cluster into
AWS VPC
Run EMR for model training
and bring the results back to
on-prem
Burst processing into the Cloud
The Use Case
– Call Center project
– Millions of calls annually
– Why do our customers call us?
– What do they do before picking up the phone?
• Reconstruct customer journey
• Predict the reason for the call
The Challenges
– Transcript quality
– Need lots of compute
• >30TB of clickstream, transaction, customer, and product data
• >20TB of audio files
– Need dynamic compute for training and analysis
High Level Architecture
Next Steps
1. Use Cloud offerings for Speech-to-Text
2. Move more workloads to the cloud
3. Extend to other clouds
Data Orchestration for the Cloud
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to the cloud
Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data available
locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2
Bursting on AWS EMR
Presto Hive
Cluster Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
18
Bursting on Google Cloud: Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
19
§ Google Dataproc with Alluxio (init action integration available)
Google
Dataproc
Cluster
Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store 1
Under Store 2
Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
Questions?

More Related Content

PDF
Enabling big data & AI workloads on the object store at DBS
PDF
Alluxio Use Cases and Future Directions
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Data Orchestration for the Hybrid Cloud Era
Enabling big data & AI workloads on the object store at DBS
Alluxio Use Cases and Future Directions
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
From limited Hadoop compute capacity to increased data scientist efficiency
Data Orchestration for the Hybrid Cloud Era

What's hot (20)

PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PDF
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Reducing large S3 API costs using Alluxio at Datasapiens
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PDF
Alluxio Architecture and Performance
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
The Future of Computing is Distributed
PPTX
DEVNET-1166 Open SDN Controller APIs
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
PDF
What's New in Alluxio 2.3
PDF
Key trends in Big Data and new reference architecture from Hewlett Packard En...
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Apache Spark Workshop at Hadoop Summit
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
PPTX
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Accelerate Analytics and ML in the Hybrid Cloud Era
Reducing large S3 API costs using Alluxio at Datasapiens
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio Architecture and Performance
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Iceberg + Alluxio for Fast Data Analytics
The Future of Computing is Distributed
DEVNET-1166 Open SDN Controller APIs
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
What's New in Alluxio 2.3
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop at Hadoop Summit
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
HBase Global Indexing to support large-scale data ingestion at Uber
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
Ad

Similar to How the Development Bank of Singapore solves on-prem compute capacity challenges with cloud bursting (20)

PDF
Data Orchestration Platform for the Cloud
PDF
Slides: Accelerating Queries on Cloud Data Lakes
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Accelerating workloads and bursting data with Google Dataproc & Alluxio
PDF
Alluxio @ Uber Seattle Meetup
PDF
Accelerate Spark Workloads on S3
PDF
Enabling Apache Spark for Hybrid Cloud
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PPTX
1. beyond mission critical virtualizing big data and hadoop
PDF
Unified Data API for Distributed Cloud Analytics and AI
PDF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
PPTX
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
PDF
HPE Solutions for Challenges in AI and Big Data
PDF
Saviak lviv ai-2019-e-mail (1)
PDF
20150704 benchmark and user experience in sahara weiting
PDF
Achieving compute and storage independence for data-driven workloads
Data Orchestration Platform for the Cloud
Slides: Accelerating Queries on Cloud Data Lakes
Alluxio Data Orchestration Platform for the Cloud
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Achieving Separation of Compute and Storage in a Cloud World
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Alluxio @ Uber Seattle Meetup
Accelerate Spark Workloads on S3
Enabling Apache Spark for Hybrid Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
1. beyond mission critical virtualizing big data and hadoop
Unified Data API for Distributed Cloud Analytics and AI
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
HPE Solutions for Challenges in AI and Big Data
Saviak lviv ai-2019-e-mail (1)
20150704 benchmark and user experience in sahara weiting
Achieving compute and storage independence for data-driven workloads
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
medical staffing services at VALiNTRY
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Introduction to Artificial Intelligence
PPTX
ai tools demonstartion for schools and inter college
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
System and Network Administration Chapter 2
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPT
Introduction Database Management System for Course Database
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
L1 - Introduction to python Backend.pptx
System and Network Administraation Chapter 3
medical staffing services at VALiNTRY
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
2025 Textile ERP Trends: SAP, Odoo & Oracle
Introduction to Artificial Intelligence
ai tools demonstartion for schools and inter college
Internet Downloader Manager (IDM) Crack 6.42 Build 41
System and Network Administration Chapter 2
ManageIQ - Sprint 268 Review - Slide Deck
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction Database Management System for Course Database
Odoo POS Development Services by CandidRoot Solutions
Operating system designcfffgfgggggggvggggggggg
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Navsoft: AI-Powered Business Solutions & Custom Software Development
L1 - Introduction to python Backend.pptx

How the Development Bank of Singapore solves on-prem compute capacity challenges with cloud bursting

  • 1. How the Development Bank of Singapore solves on-prem compute capacity challenges with cloud bursting Vitaliy Baklikov | VP at DBS Dipti Borkar | VP Product at Alluxio
  • 2. About DBS • Headquartered in Singapore • Largest bank in South East Asia • Present in 18 markets globally, including 6 priority markets • Singapore, Hong Kong, China, India, Indonesia and Taiwan • We have a very cool digiBank app • And lots lots lots of data systems
  • 3. AWS EnginesOnprem Engines HDFS Object Store Evolution of Data Platforms at DBS Generation 1 • Boxed data • Monolithic/Closed Systems • Proprietary HW/SW • Data for Targeted Use Cases Generation 2 • Big Data Explosion • Hadoop Data Lakes • Commodity HW and Hadoop Ecosystem • Compute tied to Storage Generation 3 • Data Democratization • Cloud Native platform… Hybrid! Multi! • Open Source Engines • AI/ML Centric Teradata Informatica SAS HadoopTeradata Informatica SAS Teradata Informatica SAS Hadoop
  • 5. Challenges 1. Data Lake built on local Object Store – Expensive rename operation – Object listing is slow – Variable performance – Data locality is gone 2. Multiple Data Silos 3. Limited on-premise compute capacity – Legacy ITIL processes for Infra provisioning – No dynamic scale out/in
  • 6. Alluxio at DBS Mount HDFS from other platforms into common Alluxio cluster Unified Namespace Object store Analytics Hybrid cloud bursting Caching layer for hot data to speed up Presto and Spark jobs Extend Alluxio cluster into AWS VPC Run EMR for model training and bring the results back to on-prem
  • 7. Advanced Analytics on Object Storage The Use Case • Cash-in-Transit use case • Forecast cash replenishment schedule for each ATM • Produce a delivery graph by 4am each morning The Challenges • Strict fixed SLA • Need to load all history data for the forecasting model… daily!
  • 8. Alluxio at DBS Mount HDFS from other platforms into common Alluxio cluster Unified Namespace Object store Analytics Hybrid cloud bursting Caching layer for hot data to speed up Presto and Spark jobs Extend Alluxio cluster into AWS VPC Run EMR for model training and bring the results back to on-prem
  • 9. Burst processing into the Cloud The Use Case – Call Center project – Millions of calls annually – Why do our customers call us? – What do they do before picking up the phone? • Reconstruct customer journey • Predict the reason for the call The Challenges – Transcript quality – Need lots of compute • >30TB of clickstream, transaction, customer, and product data • >20TB of audio files – Need dynamic compute for training and analysis
  • 11. Next Steps 1. Use Cloud offerings for Speech-to-Text 2. Move more workloads to the cloud 3. Extend to other clouds
  • 13. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Enable innovation with any frameworks running on data stored anywhere Data Analyst Data Engineer Storage Ops Data Scientist Lines of Business
  • 14. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Enable innovation with any frameworks running on data stored anywhere
  • 15. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to the cloud
  • 16. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 17. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally SUPPORTS • HDFS • NFS • OpenStack • Ceph • Amazon S3 • Azure • Google Cloud IT OPS FRIENDLY • Storage mounted into Alluxio by central IT • Security in Alluxio mirrors source data • Authentication through LDAP/AD • Wireline encryption HDFS #1 Object Store NFS HDFS #2
  • 18. Bursting on AWS EMR Presto Hive Cluster Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync 18
  • 19. Bursting on Google Cloud: Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync 19 § Google Dataproc with Alluxio (init action integration available) Google Dataproc Cluster
  • 20. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 21. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Bucket Trades Bucket Customers Data requests Feature Highlight: Data Caching for faster compute Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 22. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 23. Spark Presto Hive TensorFlow RAM SSD Disk Framework Bucket Trades Bucket Customers Data requests Feature Highlight - Intelligent Tiering for resource efficiency Read file /customers/145 Out of memory Variable latency with throttling Data moved to another tier
  • 24. Spark Presto Hive TensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management S3 Standard Policy interval : Every day Policy applied everyday