Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Accelerating analytics in the cloud with the
Starburst Presto + Alluxio stack
Matt Fuller | Co-founder & VP, Engineering
Dipti Borkar | VP, Product

About Me
Matt Fuller
Co-Founder at Starburst
matt@starburstdata.com
www.linkedin.com/in/mfuller/

Starburst: SQL on Anything
Query anything, anywhere

Company Overview
Founded 2017
• Team includes the creators of Presto
and many of the largest committers,
contributors, and community
members of Presto
• Former Facebook, Teradata, Vertica,
Netezza, and Ab Initio
Enterprise Presto Offering
• AWS, Azure, GCP, On Premises
• Kubernetes

Why Presto?
Speed Efficiency Freedom
Fast federated ANSI SQL engine Separation storage & compute Open Source; No vendor lock-in
● Proven scalability
● High concurrency
● Cost-based query
optimization
● Scale storage & compute
independently
● No ETL required
● SQL-on-anything
● No Hadoop vendor lock-in
● No storage vendor lock-in
● No cloud vendor lock-in
● Community driven

Why Starburst?
Even Faster Speed Enterprise-Grade Features 24x7 Support
Starburst Distro performs faster Security, automation & connectors From the Presto experts
● Fully tested, stable releases
● Curated by the Presto
creators
● Most up-to-date cost-based
query optimizer
● RBAC + data encryption
● Automated cluster
deployment
● Auto scaling + graceful
shutdown
● 36+ connectors
● 24x7 we’ve got your back
● Hot fixes + security patches
● Access to customer success
team of data architects

Presto Architecture
Processor
Processor
Processor
COORDINATOR
WORKER
WORKER
DATA SOURCES
Parser Optimizer Scheduler
Azure
SQL Database

Presto Extensibility with Connectors
Presto Coordinator
Metadata SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Data Statistics SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Presto Worker
Data Stream SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Data Location SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake

Starburst Product Offerings
Starburst Presto Community
Free version of Starburst Presto that includes limited additional features.
Starburst Presto Enterprise
Starburst Presto built for the enterprise that includes additional features &
connectors, security integrations, premium 24x7 support, rigorous testing, patch
releases/hotfixes, long term support, additional tooling, and cloud integrations.

Distributed Storage Connector
• Access data stored in scalable and cost effective storage
○ HDFS
○ AWS S3
○ Google GCS
○ Azure Blob & ADLS
○ S3-Compatible (i.e. Minio, Ceph)
• Schema information stored in Hive Metastore or AWS
Glue Catalog
• Uses “Hive-Style” Table format
• Partitions and Bucketing are recognized and used
• Does not use Hive runtime to perform execution

Relational Database Connectivity
• Query relational data through Presto
as the consumption layer
• Federate over multiple data sources
• MySQL
• PostgreSQL
• Redshift
• SQL Server
• Google BigQuery
• Oracle
• DB2
• Teradata
• Snowflake

Non Relational Data Sources
• Apache Accumulo
• Apache Cassandra
• Apache Phoenix
• Elasticsearch
• Apache Kafka
• Apache Kudu
• MongoDB
• Redis

The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP Lab
by then Ph.D. student & nowAlluxio CTO, Haoyuan (H.Y.) Li.
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven apps
such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3-based data lakes or warehouses
Hot top 10 Big Data
2020
Impact 50
2019
Trend-setting product
2019
Trend-setting product
2019

Consumer Travel & TransportationTelco & Media
Alluxio: Data-Driven Innovation Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services

Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business

Alluxio Data Orchestration for the Cloud
Structured
Data Catalog
Intelligent
Caching
Data
Transformatio
n
Data
Management
Global
Namespace

Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Cloud Data Orchestration
Solution: Consistent High
Performance• Performance increases range from 1.5X
to 10X
• Dramatically reduced operational costs
up to 80%
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
§ SLAs are hard to achieve
§ S3 metadata operations are expensive
§ Copied data storage costs add up
making the solution expensive

Takeaways
• Nearly 2x performance
reduction for small range
queries
• Much more concurrency
with Alluxio
• This means ½ the
compute costs or 2x
more capacity with the
same environment

Now Available: Starburst Presto + Alluxio on
▪ AWS AMI pre-configured to speed up Presto
queries using Alluxio caching
▪ 2x - 5x performance boost depending on
dataset and workload
▪ Tutorial:
https://www.alluxio.io/products/aws/starburst-
alluxio-cft-tutorial/
+
https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-Caching/B07ZTHJ9YF

Compute
Storage
2–5 Mins
2–5 Mins
Elastic
P
Elastic
P
Data Engineers not efficient as data not available
2–4 Weeks
Request
Data
Request Review Find
Dataset
Code
Script/Job
Run
ETL jobs
Grant
Permissions
Not Elastic
!
Dataset

Goal: Enable data workloads in the cloud on existing
on-prem data
Restrictions
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting

Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud

Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
…
…
Application
Application
Under Store
1
Under Store
2

RAM
SSD
Disk
Framework
Read file
/trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Variable latency
with throttling
/trades/top
Read file
/trades/top
/trades/top
Read file
/trades/top
/trades/top
Read file
/trades/top
Read file /trades/us again

RAM
Framework
Read file
/trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
/trades/top
Read file
/trades/top
Variable latency
with throttling
/trades/top
Read file
/trades/top
/trades/top
Read file
/trades/top
/trades/top
Read file
/trades/top
Read file /trades/us again

RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier

RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday

Alluxio Structured Data Management Preview
30
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
Alluxio Transformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage

Starburst Presto + Alluxio AMI & CFT
AMI & CFT:
https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-
Caching/B07ZTHJ9YF
Documentation:
https://docs.starburstdata.com/latest/aws/deploy_caching.html
Tutorial:
https://www.alluxio.io/products/aws/starburst-alluxio-cft-tutorial/

Questions?
Matt Fuller | matt@starburstdata.com
Dipti Borkar | dipti@alluxio.com

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

More Related Content

What's hot (20)

Similar to Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack