SlideShare a Scribd company logo
Accelerating analytics in the cloud with the
Starburst Presto + Alluxio stack
Matt Fuller | Co-founder & VP, Engineering
Dipti Borkar | VP, Product
About Me
Matt Fuller
Co-Founder at Starburst
matt@starburstdata.com
www.linkedin.com/in/mfuller/
Starburst: SQL on Anything
Query anything, anywhere
Company Overview
Founded 2017
• Team includes the creators of Presto
and many of the largest committers,
contributors, and community
members of Presto
• Former Facebook, Teradata, Vertica,
Netezza, and Ab Initio
Enterprise Presto Offering
• AWS, Azure, GCP, On Premises
• Kubernetes
Why Presto?
Speed Efficiency Freedom
Fast federated ANSI SQL engine Separation storage & compute Open Source; No vendor lock-in
● Proven scalability
● High concurrency
● Cost-based query
optimization
● Scale storage & compute
independently
● No ETL required
● SQL-on-anything
● No Hadoop vendor lock-in
● No storage vendor lock-in
● No cloud vendor lock-in
● Community driven
Why Starburst?
Even Faster Speed Enterprise-Grade Features 24x7 Support
Starburst Distro performs faster Security, automation & connectors From the Presto experts
● Fully tested, stable releases
● Curated by the Presto
creators
● Most up-to-date cost-based
query optimizer
● RBAC + data encryption
● Automated cluster
deployment
● Auto scaling + graceful
shutdown
● 36+ connectors
● 24x7 we’ve got your back
● Hot fixes + security patches
● Access to customer success
team of data architects
Presto Architecture
Processor
Processor
Processor
COORDINATOR
WORKER
WORKER
DATA SOURCES
Parser Optimizer Scheduler
Azure
SQL Database
Presto Extensibility with Connectors
Presto Coordinator
Metadata SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Data Statistics SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Presto Worker
Data Stream SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Data Location SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Starburst Product Offerings
Starburst Presto Community
Free version of Starburst Presto that includes limited additional features.
Starburst Presto Enterprise
Starburst Presto built for the enterprise that includes additional features &
connectors, security integrations, premium 24x7 support, rigorous testing, patch
releases/hotfixes, long term support, additional tooling, and cloud integrations.
Distributed Storage Connector
• Access data stored in scalable and cost effective storage
○ HDFS
○ AWS S3
○ Google GCS
○ Azure Blob & ADLS
○ S3-Compatible (i.e. Minio, Ceph)
• Schema information stored in Hive Metastore or AWS
Glue Catalog
• Uses “Hive-Style” Table format
• Partitions and Bucketing are recognized and used
• Does not use Hive runtime to perform execution
Relational Database Connectivity
• Query relational data through Presto
as the consumption layer
• Federate over multiple data sources
• MySQL
• PostgreSQL
• Redshift
• SQL Server
• Google BigQuery
• Oracle
• DB2
• Teradata
• Snowflake
Non Relational Data Sources
• Apache Accumulo
• Apache Cassandra
• Apache Phoenix
• Elasticsearch
• Apache Kafka
• Apache Kudu
• MongoDB
• Redis
The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP Lab
by then Ph.D. student & nowAlluxio CTO, Haoyuan (H.Y.) Li.
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven apps
such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3-based data lakes or warehouses
Hot top 10 Big Data
2020
Impact 50
2019
Trend-setting product
2019
Trend-setting product
2019
Consumer Travel & TransportationTelco & Media
Alluxio: Data-Driven Innovation Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business
Alluxio Data Orchestration for the Cloud
Structured
Data Catalog
Intelligent
Caching
Data
Transformatio
n
Data
Management
Global
Namespace
Where are you in the cloud
journey?
“I’m all in the cloud”
“I want a hybrid cloud”
“I want to migrate”“Hadoop in the DC”
| EMR w/ S3
| EC2 installed
| Dataproc w/ GCS
| GCE installed
| HDInsights w/ Blob
| VM installed
“Separate Compute &
Storage Tiers”
Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Cloud Data Orchestration
Alluxio Data Orchestration and Control Service
Solution: Consistent High
Performance• Performance increases range from 1.5X
to 10X
• Dramatically reduced operational costs
up to 80%
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
§ SLAs are hard to achieve
§ S3 metadata operations are expensive
§ Copied data storage costs add up
making the solution expensive
Takeaways
• Nearly 2x performance
reduction for small range
queries
• Much more concurrency
with Alluxio
• This means ½ the
compute costs or 2x
more capacity with the
same environment
Now Available: Starburst Presto + Alluxio on
▪ AWS AMI pre-configured to speed up Presto
queries using Alluxio caching
▪ 2x - 5x performance boost depending on
dataset and workload
▪ Tutorial:
https://www.alluxio.io/products/aws/starburst-
alluxio-cft-tutorial/
+
https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-Caching/B07ZTHJ9YF
Compute
Storage
2–5 Mins
2–5 Mins
Elastic
P
Elastic
P
Data Engineers not efficient as data not available
2–4 Weeks
Request
Data
Request Review Find
Dataset
Code
Script/Job
Run
ETL jobs
Grant
Permissions
Not Elastic
!
Dataset
Goal: Enable data workloads in the cloud on existing
on-prem data
Restrictions
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting
Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud
High Level Architecture
Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store
1
Under Store
2
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Read file
/trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Variable latency
with throttling
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
Framework
Read file
/trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Variable latency
with throttling
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
Alluxio Structured Data Management Preview
30
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
Alluxio Transformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage
Starburst Presto + Alluxio AMI & CFT
AMI & CFT:
https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-
Caching/B07ZTHJ9YF
Documentation:
https://docs.starburstdata.com/latest/aws/deploy_caching.html
Tutorial:
https://www.alluxio.io/products/aws/starburst-alluxio-cft-tutorial/
Questions?
Matt Fuller | matt@starburstdata.com
Dipti Borkar | dipti@alluxio.com

More Related Content

PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Reducing large S3 API costs using Alluxio at Datasapiens
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Enabling big data & AI workloads on the object store at DBS
PDF
Burst Presto & Spark workloads to AWS EMR with no data copies
PDF
Alluxio Use Cases and Future Directions
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Reducing large S3 API costs using Alluxio at Datasapiens
Accelerate Analytics and ML in the Hybrid Cloud Era
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enabling big data & AI workloads on the object store at DBS
Burst Presto & Spark workloads to AWS EMR with no data copies
Alluxio Use Cases and Future Directions

What's hot (20)

PDF
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
What's New in Alluxio 2.3
PDF
Data Orchestration for the Hybrid Cloud Era
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Data Orchestration for AI, Big Data, and Cloud
PDF
Orchestrate a Data Symphony
PDF
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
PDF
Introducing the Hub for Data Orchestration
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
Accelerate Cloud Training with Alluxio
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Accelerating Data Computation on Ceph Objects
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
What's New in Alluxio 2.3
Data Orchestration for the Hybrid Cloud Era
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Accelerate Analytics and ML in the Hybrid Cloud Era
From limited Hadoop compute capacity to increased data scientist efficiency
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Data Orchestration for AI, Big Data, and Cloud
Orchestrate a Data Symphony
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Introducing the Hub for Data Orchestration
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Iceberg + Alluxio for Fast Data Analytics
Open Source Data Orchestration for AI, Big Data, and Cloud
Accelerate Cloud Training with Alluxio
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerating Data Computation on Ceph Objects
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Ad

Similar to Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack (20)

PDF
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
How the Development Bank of Singapore solves on-prem compute capacity challen...
PDF
Accelerating workloads and bursting data with Google Dataproc & Alluxio
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Data Orchestration Platform for the Cloud
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PPTX
Accelerating workloads and bursting data with Google Dataproc & Alluxio
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Alluxio @ Uber Seattle Meetup
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PPTX
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
PDF
Slides: Accelerating Queries on Cloud Data Lakes
PDF
Presto: Query Anything - Data Engineer’s perspective
PDF
Cloud comparison - AWS vs Azure vs Google
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Designing a modern data warehouse in azure
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Alluxio Data Orchestration Platform for the Cloud
How the Development Bank of Singapore solves on-prem compute capacity challen...
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Data Orchestration Platform for the Cloud
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Achieving Separation of Compute and Storage in a Cloud World
Alluxio @ Uber Seattle Meetup
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Slides: Accelerating Queries on Cloud Data Lakes
Presto: Query Anything - Data Engineer’s perspective
Cloud comparison - AWS vs Azure vs Google
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Designing a modern data warehouse in azure
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
medical staffing services at VALiNTRY
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPT
Introduction Database Management System for Course Database
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
ai tools demonstartion for schools and inter college
PDF
AI in Product Development-omnex systems
PDF
System and Network Administraation Chapter 3
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
Odoo Companies in India – Driving Business Transformation.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
ISO 45001 Occupational Health and Safety Management System
medical staffing services at VALiNTRY
Adobe Illustrator 28.6 Crack My Vision of Vector Design
2025 Textile ERP Trends: SAP, Odoo & Oracle
ManageIQ - Sprint 268 Review - Slide Deck
Introduction Database Management System for Course Database
VVF-Customer-Presentation2025-Ver1.9.pptx
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
How to Migrate SBCGlobal Email to Yahoo Easily
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Wondershare Filmora 15 Crack With Activation Key [2025
ai tools demonstartion for schools and inter college
AI in Product Development-omnex systems
System and Network Administraation Chapter 3
Navsoft: AI-Powered Business Solutions & Custom Software Development

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

  • 1. Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack Matt Fuller | Co-founder & VP, Engineering Dipti Borkar | VP, Product
  • 2. About Me Matt Fuller Co-Founder at Starburst matt@starburstdata.com www.linkedin.com/in/mfuller/
  • 3. Starburst: SQL on Anything Query anything, anywhere
  • 4. Company Overview Founded 2017 • Team includes the creators of Presto and many of the largest committers, contributors, and community members of Presto • Former Facebook, Teradata, Vertica, Netezza, and Ab Initio Enterprise Presto Offering • AWS, Azure, GCP, On Premises • Kubernetes
  • 5. Why Presto? Speed Efficiency Freedom Fast federated ANSI SQL engine Separation storage & compute Open Source; No vendor lock-in ● Proven scalability ● High concurrency ● Cost-based query optimization ● Scale storage & compute independently ● No ETL required ● SQL-on-anything ● No Hadoop vendor lock-in ● No storage vendor lock-in ● No cloud vendor lock-in ● Community driven
  • 6. Why Starburst? Even Faster Speed Enterprise-Grade Features 24x7 Support Starburst Distro performs faster Security, automation & connectors From the Presto experts ● Fully tested, stable releases ● Curated by the Presto creators ● Most up-to-date cost-based query optimizer ● RBAC + data encryption ● Automated cluster deployment ● Auto scaling + graceful shutdown ● 36+ connectors ● 24x7 we’ve got your back ● Hot fixes + security patches ● Access to customer success team of data architects
  • 8. Presto Extensibility with Connectors Presto Coordinator Metadata SPI Distributed Cassandra Kafka Teradata Snowflake Data Statistics SPI Distributed Cassandra Kafka Teradata Snowflake Presto Worker Data Stream SPI Distributed Cassandra Kafka Teradata Snowflake Data Location SPI Distributed Cassandra Kafka Teradata Snowflake
  • 9. Starburst Product Offerings Starburst Presto Community Free version of Starburst Presto that includes limited additional features. Starburst Presto Enterprise Starburst Presto built for the enterprise that includes additional features & connectors, security integrations, premium 24x7 support, rigorous testing, patch releases/hotfixes, long term support, additional tooling, and cloud integrations.
  • 10. Distributed Storage Connector • Access data stored in scalable and cost effective storage ○ HDFS ○ AWS S3 ○ Google GCS ○ Azure Blob & ADLS ○ S3-Compatible (i.e. Minio, Ceph) • Schema information stored in Hive Metastore or AWS Glue Catalog • Uses “Hive-Style” Table format • Partitions and Bucketing are recognized and used • Does not use Hive runtime to perform execution
  • 11. Relational Database Connectivity • Query relational data through Presto as the consumption layer • Federate over multiple data sources • MySQL • PostgreSQL • Redshift • SQL Server • Google BigQuery • Oracle • DB2 • Teradata • Snowflake
  • 12. Non Relational Data Sources • Apache Accumulo • Apache Cassandra • Apache Phoenix • Elasticsearch • Apache Kafka • Apache Kudu • MongoDB • Redis
  • 13. The Alluxio Story Originated as Tachyon project, at the UC Berkeley’s AMP Lab by then Ph.D. student & nowAlluxio CTO, Haoyuan (H.Y.) Li. 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for the Cloud for data driven apps such as Big Data Analytics, ML and AI. Focus: Accelerating modern app frameworks running on HDFS/S3-based data lakes or warehouses Hot top 10 Big Data 2020 Impact 50 2019 Trend-setting product 2019 Trend-setting product 2019
  • 14. Consumer Travel & TransportationTelco & Media Alluxio: Data-Driven Innovation Across Industries Learn more TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
  • 15. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Enable innovation with any frameworks running on data stored anywhere Data Analyst Data Engineer Storage Ops Data Scientist Lines of Business
  • 16. Alluxio Data Orchestration for the Cloud Structured Data Catalog Intelligent Caching Data Transformatio n Data Management Global Namespace
  • 17. Where are you in the cloud journey? “I’m all in the cloud” “I want a hybrid cloud” “I want to migrate”“Hadoop in the DC” | EMR w/ S3 | EC2 installed | Dataproc w/ GCS | GCE installed | HDInsights w/ Blob | VM installed “Separate Compute & Storage Tiers”
  • 18. Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service Alluxio enables compute! Alluxio Cloud Data Orchestration Alluxio Data Orchestration and Control Service Solution: Consistent High Performance• Performance increases range from 1.5X to 10X • Dramatically reduced operational costs up to 80% Problem: Object Stores have inconsistent performance for analytics and AI workloads § SLAs are hard to achieve § S3 metadata operations are expensive § Copied data storage costs add up making the solution expensive
  • 19. Takeaways • Nearly 2x performance reduction for small range queries • Much more concurrency with Alluxio • This means ½ the compute costs or 2x more capacity with the same environment
  • 20. Now Available: Starburst Presto + Alluxio on ▪ AWS AMI pre-configured to speed up Presto queries using Alluxio caching ▪ 2x - 5x performance boost depending on dataset and workload ▪ Tutorial: https://www.alluxio.io/products/aws/starburst- alluxio-cft-tutorial/ + https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-Caching/B07ZTHJ9YF
  • 21. Compute Storage 2–5 Mins 2–5 Mins Elastic P Elastic P Data Engineers not efficient as data not available 2–4 Weeks Request Data Request Review Find Dataset Code Script/Job Run ETL jobs Grant Permissions Not Elastic ! Dataset
  • 22. Goal: Enable data workloads in the cloud on existing on-prem data Restrictions § Data cannot be persisted in a public cloud § Additional I/O capacity cannot be added to existing Hadoop infrastructure § On-prem level security needs to be maintained § Network bandwidth utilization needs to be minimal Alternatives Lift and Shift Data copy by workload “Zero-copy” Bursting
  • 23. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to scale to the cloud
  • 25. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  • 26. Spark Presto Hive TensorFlow RAM SSD Disk Framework Read file /trades/us Bucket Trades Bucket Customers Data requests Feature Highlight: Data Caching for faster compute Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 27. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 28. Spark Presto Hive TensorFlow RAM SSD Disk Framework Bucket Trades Bucket Customers Data requests Feature Highlight - Intelligent Tiering for resource efficiency Read file /customers/145 Out of memory Variable latency with throttling Data moved to another tier
  • 29. Spark Presto Hive TensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management S3 Standard Policy interval : Every day Policy applied everyday
  • 30. Alluxio Structured Data Management Preview 30 Presto Alluxio Caching Service Alluxio Catalog Service Alluxio Transformation Service Hive Connector Alluxio Connector Hive Metastore Storage
  • 31. Starburst Presto + Alluxio AMI & CFT AMI & CFT: https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with- Caching/B07ZTHJ9YF Documentation: https://docs.starburstdata.com/latest/aws/deploy_caching.html Tutorial: https://www.alluxio.io/products/aws/starburst-alluxio-cft-tutorial/
  • 32. Questions? Matt Fuller | matt@starburstdata.com Dipti Borkar | dipti@alluxio.com