SlideShare a Scribd company logo
Data Orchestration for the Hybrid Cloud Era
Peter Behrakis and Alex Ma - Alluxio
Agenda
• Market
• Alluxio Vision
• What is Data Orchestration
• How can Alluxio help you?
Enterprises have organically created a legacy of data silos through short term
focused projects, mergers & acquisitions!
Data Lakes and Silos Abound
▪ Data lakes and critical data is often in a silo and challenging to access
▪ Consolidation of data lakes and silos are expensive and slow to
complete
▪ Compute is everywhere
Teradata POSIX
DB
Intern
apps
Public
Clouds
S3 Object HDFS 1
HDFS 2
4 Big Trends Driving the Need for a New
Architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
▪ Data volume, velocity and variety are avalanching - data doubles every two years*
▪ The business knows data analytics/ML models allow them to compete effectively*
▪ The Hadoop investment is being replaced by object (on prem and cloud)
▪ The enterprise is a multi cloud world and will remain so for some time
▪ Technical leadership wants the agility to run applications anywhere to sustain
operations offering users a transparent self service experience
▪ Technical organizations struggle to keep up with data ingest and business demands
▪ Data is still not fully optimized yet there are many copies costing $$$$
* “The Fourth Industrial Revolution”, by Klaus Schwab
Market Summary
Alluxio’s Vision
Accelerate analytics and machine learning to enable companies to grow
and remain relevant regardless of where their data and compute are
located.
What can 2X – 5X analytics acceleration do for -
● Fraud protection
● Research for treatments for diseases like COVID-19
● Uptime for all industrial and digital technologies we depend on
What is Data Orchestration?
A platform that brings your data closer to compute across
clusters, regions, clouds, and countries to accelerate results
Companies Using Alluxio
Consumer Travel & TransportationTelco & Media
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
8
Companies use Alluxio to …
• Gain faster results that matter to the business – advanced caching
technology
• Dramatically lower OpEx by eliminating data management and cloud
egress costs – unified namespace and API translations
• Drop into existing on prem and clouds with zero programming
Data Accessibility
Translate access to optimal storage APIs over a slow network
Data Orchestration for the
Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
10
Hybrid Data Lake with Alluxio
A Data Orchestration Approach
11
Approaches to Hybrid Cloud
▪ Simple tools available like distCP
▪ Works for workloads with easily
identifiable datasets
Issues
▪ Datasets for many workloads
cannot always be identified easily
▪ Significantly more data transfer
than workload requirements
▪ Additional copies are very hard to
sync back with master data
Performance can be dramatically
impacted due to cloud storage
limitations
Lift and Shift
Data copy by
workload
Compute-driven
Data Caching
▪ Migration may seem easier as no
application re-architecture needed
Issues
▪ If workloads are not made cloud-
native and elastic, infrastructure cost
can skyrocket
▪ If on-prem data copy needs to be
maintained, syncing cloud and on-
prem data can be hard
▪ Data pulled into cloud based on
compute requests
▪ Data is cached locally to reduce I/O
on remote clusters and is
automatically synced
Issues
▪ Less helpful for workloads that don’t
read data set more than once
12
Problem: HDFS cluster is compute-
bound & complex to maintain
Google Cloud Platform
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Offload on-prem cluster (both compute & I/O)
• Manage working set, not FULL set of data
• Local performance
• Automatic synchronization with on-prem changes
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
GCS
Our Solution: “Zero-Copy Burst”
13
Case Studies
14
Alluxio at Walmart
15
Architectural Components
• Alluxio is co-located with Presto
For Data Locality
• Automatic Metadata
Synchronization To create Hive tables
with Alluxio mount points
• Auto-scaling
To maintain a min number of Alluxio
workers
• Pin frequently used data
To avoid cache evictions
2x Performance
For range queries
High Concurrency
With Alluxio
Cost Reduction
With Half the compute costs or 2x
compute capacity for the same
environment
Alluxio at Walmart
Takeaways
16
Alluxio at Adobe
Primary DC with large Hadoop Cluster out of space,
ad hoc SQL workloads exponentially growing as
analyst headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of primary on-
prem DC for multiple analytical frameworks.
SOLUTION
REMOTE DATA RESULTS
17
Cross Data Center Access
Alluxio at Electronic Arts (EA)
Single Cloud with AWS
Learn More
Upto 6x Performance
When handling a large
number of small files
Elastic Compute
To Reduce Infrastructure
Costs
Reduce S3 Costs
By eliminating S3 access
operations
Core Features
Enable a Hybrid Data Lake
19
Data Locality with Intelligent Multi-tiering
Local Performance from remote data using multi-tier storage
Hot War
m
Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
On-premisesPublic Cloud
20
Metadata Locality with “Active Sync”
Detect on-prem changes and synchronize metadata
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion, TTL
HDFS iNotify Based
Metadata Synchronization
Mutation
On-premisesPublic Cloud
21
Policy Driven Data Migration
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
22
Reference Architecture
Alluxio
MasterZookeeper
/ RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
…
…
Under Store
1
Under Store 2
23
Control Path
Data Path
Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
24
Alluxio Catalog Service
Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquetcsv
25
Transformation Service
Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio
Transformations With
Caching
3s
26
Example Results
Questions?
27
How can Alluxio help you?
• Did you learn what Alluxio Data Orchestration is?
• Do you have a use case Alluxio can accelerate?
For follow up questions and to discuss your situation, please contact Peter at
peter@alluxio.com
I. Burst data lake processing to Dataproc using on-prem Hadoop data
https://cloud.google.com/blog/products/data-analytics/burst-data-lake-processing-dataproc-using-prem-hadoop-data
II. Tutorial: Hybrid Cloud Bursting with GCP and Alluxio
https://docs.alluxio.io/ee/user/stable/en/tutorials/GCP-Tutorial.html
III. “Zero-Copy” Hybrid Cloud for Data Analytics
https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-and-
benchmark-report/
IV. Getting Started with Dataproc and Alluxio
https://docs.alluxio.io/ee/user/stable/en/cloud/Google-Dataproc.html
V. Using Transparent URI
https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html
Additional Resources
29

More Related Content

PDF
Data Orchestration for the Hybrid Cloud Era
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Accelerate Cloud Training with Alluxio
PDF
Reducing large S3 API costs using Alluxio at Datasapiens
PDF
What's New in Alluxio 2.3
PDF
Alluxio Architecture and Performance
PDF
Alluxio Use Cases and Future Directions
Data Orchestration for the Hybrid Cloud Era
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Cloud Training with Alluxio
Reducing large S3 API costs using Alluxio at Datasapiens
What's New in Alluxio 2.3
Alluxio Architecture and Performance
Alluxio Use Cases and Future Directions

What's hot (20)

PDF
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Enabling big data & AI workloads on the object store at DBS
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
PDF
Accelerating Data Computation on Ceph Objects
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
PDF
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Alluxio - Scalable Filesystem Metadata Services
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Hands-on with Alluxio Structured Data Management
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Enabling big data & AI workloads on the object store at DBS
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Iceberg + Alluxio for Fast Data Analytics
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Accelerating Data Computation on Ceph Objects
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio - Scalable Filesystem Metadata Services
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Hands-on with Alluxio Structured Data Management
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Accelerate Analytics and ML in the Hybrid Cloud Era
Ad

Similar to Accelerate Analytics and ML in the Hybrid Cloud Era (20)

PDF
How the Development Bank of Singapore solves on-prem compute capacity challen...
PDF
Slides: Accelerating Queries on Cloud Data Lakes
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Data Orchestration Platform for the Cloud
PDF
Enabling Apache Spark for Hybrid Cloud
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
PDF
Alluxio @ Uber Seattle Meetup
PDF
Accelerating workloads and bursting data with Google Dataproc & Alluxio
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
Achieving compute and storage independence for data-driven workloads
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
A Successful Journey to the Cloud with Data Virtualization
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
PDF
Accelerate Spark Workloads on S3
PDF
Accelerate Migration to the Cloud using Data Virtualization (APAC)
How the Development Bank of Singapore solves on-prem compute capacity challen...
Slides: Accelerating Queries on Cloud Data Lakes
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Achieving Separation of Compute and Storage in a Cloud World
Alluxio Data Orchestration Platform for the Cloud
From limited Hadoop compute capacity to increased data scientist efficiency
Data Orchestration Platform for the Cloud
Enabling Apache Spark for Hybrid Cloud
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Alluxio @ Uber Seattle Meetup
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Open Source Data Orchestration for AI, Big Data, and Cloud
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Achieving compute and storage independence for data-driven workloads
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
A Successful Journey to the Cloud with Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Accelerate Spark Workloads on S3
Accelerate Migration to the Cloud using Data Virtualization (APAC)
Ad

More from Alluxio, Inc. (20)

PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...

Recently uploaded (20)

PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
medical staffing services at VALiNTRY
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
history of c programming in notes for students .pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
Navsoft: AI-Powered Business Solutions & Custom Software Development
Understanding Forklifts - TECH EHS Solution
VVF-Customer-Presentation2025-Ver1.9.pptx
top salesforce developer skills in 2025.pdf
medical staffing services at VALiNTRY
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
history of c programming in notes for students .pptx
Nekopoi APK 2025 free lastest update
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
ManageIQ - Sprint 268 Review - Slide Deck
2025 Textile ERP Trends: SAP, Odoo & Oracle
How Creative Agencies Leverage Project Management Software.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
CHAPTER 2 - PM Management and IT Context
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
ISO 45001 Occupational Health and Safety Management System
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
How to Migrate SBCGlobal Email to Yahoo Easily

Accelerate Analytics and ML in the Hybrid Cloud Era

  • 1. Data Orchestration for the Hybrid Cloud Era Peter Behrakis and Alex Ma - Alluxio
  • 2. Agenda • Market • Alluxio Vision • What is Data Orchestration • How can Alluxio help you?
  • 3. Enterprises have organically created a legacy of data silos through short term focused projects, mergers & acquisitions! Data Lakes and Silos Abound ▪ Data lakes and critical data is often in a silo and challenging to access ▪ Consolidation of data lakes and silos are expensive and slow to complete ▪ Compute is everywhere Teradata POSIX DB Intern apps Public Clouds S3 Object HDFS 1 HDFS 2
  • 4. 4 Big Trends Driving the Need for a New Architecture Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise of the object store
  • 5. ▪ Data volume, velocity and variety are avalanching - data doubles every two years* ▪ The business knows data analytics/ML models allow them to compete effectively* ▪ The Hadoop investment is being replaced by object (on prem and cloud) ▪ The enterprise is a multi cloud world and will remain so for some time ▪ Technical leadership wants the agility to run applications anywhere to sustain operations offering users a transparent self service experience ▪ Technical organizations struggle to keep up with data ingest and business demands ▪ Data is still not fully optimized yet there are many copies costing $$$$ * “The Fourth Industrial Revolution”, by Klaus Schwab Market Summary
  • 6. Alluxio’s Vision Accelerate analytics and machine learning to enable companies to grow and remain relevant regardless of where their data and compute are located. What can 2X – 5X analytics acceleration do for - ● Fraud protection ● Research for treatments for diseases like COVID-19 ● Uptime for all industrial and digital technologies we depend on
  • 7. What is Data Orchestration? A platform that brings your data closer to compute across clusters, regions, clouds, and countries to accelerate results
  • 8. Companies Using Alluxio Consumer Travel & TransportationTelco & Media Learn more TechnologyFinancial Services Retail & Entertainment Data & Analytics Services 8
  • 9. Companies use Alluxio to … • Gain faster results that matter to the business – advanced caching technology • Dramatically lower OpEx by eliminating data management and cloud egress costs – unified namespace and API translations • Drop into existing on prem and clouds with zero programming
  • 10. Data Accessibility Translate access to optimal storage APIs over a slow network Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver 10
  • 11. Hybrid Data Lake with Alluxio A Data Orchestration Approach 11
  • 12. Approaches to Hybrid Cloud ▪ Simple tools available like distCP ▪ Works for workloads with easily identifiable datasets Issues ▪ Datasets for many workloads cannot always be identified easily ▪ Significantly more data transfer than workload requirements ▪ Additional copies are very hard to sync back with master data Performance can be dramatically impacted due to cloud storage limitations Lift and Shift Data copy by workload Compute-driven Data Caching ▪ Migration may seem easier as no application re-architecture needed Issues ▪ If workloads are not made cloud- native and elastic, infrastructure cost can skyrocket ▪ If on-prem data copy needs to be maintained, syncing cloud and on- prem data can be hard ▪ Data pulled into cloud based on compute requests ▪ Data is cached locally to reduce I/O on remote clusters and is automatically synced Issues ▪ Less helpful for workloads that don’t read data set more than once 12
  • 13. Problem: HDFS cluster is compute- bound & complex to maintain Google Cloud Platform Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity• Offload on-prem cluster (both compute & I/O) • Manage working set, not FULL set of data • Local performance • Automatic synchronization with on-prem changes Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days GCS Our Solution: “Zero-Copy Burst” 13
  • 15. Alluxio at Walmart 15 Architectural Components • Alluxio is co-located with Presto For Data Locality • Automatic Metadata Synchronization To create Hive tables with Alluxio mount points • Auto-scaling To maintain a min number of Alluxio workers • Pin frequently used data To avoid cache evictions
  • 16. 2x Performance For range queries High Concurrency With Alluxio Cost Reduction With Half the compute costs or 2x compute capacity for the same environment Alluxio at Walmart Takeaways 16
  • 17. Alluxio at Adobe Primary DC with large Hadoop Cluster out of space, ad hoc SQL workloads exponentially growing as analyst headcount as reached 1800 ppl PROBLEM ● 80% less network usage ● More stable infrastructure ● Lower costs ● Results come in faster ● Easier to scale ● Ability handle new analysts with no impact and increase response times ● Self-service for end-users Leverage compute resources outside of primary on- prem DC for multiple analytical frameworks. SOLUTION REMOTE DATA RESULTS 17 Cross Data Center Access
  • 18. Alluxio at Electronic Arts (EA) Single Cloud with AWS Learn More Upto 6x Performance When handling a large number of small files Elastic Compute To Reduce Infrastructure Costs Reduce S3 Costs By eliminating S3 access operations
  • 19. Core Features Enable a Hybrid Data Lake 19
  • 20. Data Locality with Intelligent Multi-tiering Local Performance from remote data using multi-tier storage Hot War m Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion, TTL On-premisesPublic Cloud 20
  • 21. Metadata Locality with “Active Sync” Detect on-prem changes and synchronize metadata Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion, TTL HDFS iNotify Based Metadata Synchronization Mutation On-premisesPublic Cloud 21
  • 22. Policy Driven Data Migration Migrate Data to Cloud Storage based on Access Policies hdfs://host:port/directory/ Reports Sales • Single Alluxio path backed by multiple storage systems • Example policy: Migrate data older than 7 days from HDFS to S3 22
  • 23. Reference Architecture Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD … … Under Store 1 Under Store 2 23 Control Path Data Path
  • 24. Alluxio Catalog Service Hive Metastore Hive Under Database Functionality Manages metadata for structured data Abstracts other database catalogs as Under Database (UDB) Benefits Schema-aware optimizations Simple deployment 24 Alluxio Catalog Service
  • 25. Transform data to be compute-optimized independent of the storage format Coalesce Format Conversion parquetcsv 25 Transformation Service
  • 26. Attached existing Hive database into Alluxio Catalog Alluxio Catalog served table metadata for Presto Transformed store_sales by coalescing and converting CSV to Parquet Presto Without Alluxio 20s Alluxio Transformations 7s Alluxio Transformations With Caching 3s 26 Example Results
  • 28. How can Alluxio help you? • Did you learn what Alluxio Data Orchestration is? • Do you have a use case Alluxio can accelerate? For follow up questions and to discuss your situation, please contact Peter at peter@alluxio.com
  • 29. I. Burst data lake processing to Dataproc using on-prem Hadoop data https://cloud.google.com/blog/products/data-analytics/burst-data-lake-processing-dataproc-using-prem-hadoop-data II. Tutorial: Hybrid Cloud Bursting with GCP and Alluxio https://docs.alluxio.io/ee/user/stable/en/tutorials/GCP-Tutorial.html III. “Zero-Copy” Hybrid Cloud for Data Analytics https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-and- benchmark-report/ IV. Getting Started with Dataproc and Alluxio https://docs.alluxio.io/ee/user/stable/en/cloud/Google-Dataproc.html V. Using Transparent URI https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html Additional Resources 29