SlideShare a Scribd company logo
ENABLE FAST BIG DATA ANALYTICS ON
CEPH WITH ALLUXIO
Adit Madan
March 2017
ABOUT ME
Adit Madan, Software Engineer @ Alluxio, Inc
Master’s @ Carnegie Mellon University
Bachelor’s @ Indian Institute of Technology, Delhi
Email: adit@alluxio.com
2
ALLUXIO INTRODUCTION
3
FASTEST-GROWING BIG DATA PROJECT
• Fastest growing
open-source
project in the big
data ecosystem
• 400+ contributors
from 100+
organizations
• Running world’s
largest production
clusters
• Welcome to join
the community!
4
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY
…
…
FUSE Compatible File
System
Hadoop Compatible File
System
Native Key-Value
Interface
Native File System
Enabling Application to Access Data from any
Storage System at Memory-speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
5
WHY ALLUXIO
Co-located with compute, provides memory-speed access to data
Virtualized across different storage systems under a unified global namespace
Distributed system, scale-out architecture
Software only, no change needed to existing application
6
ALLUXIO BENEFITS
Unification
New workflows across
any data in any storage
system
Orders of magnitude
improvement in run
time
Choice in compute and
storage – grow each
independently, buy
only what is needed
Performance Flexibility
7
USE CASE – ACCELERATE I/O TO/FROM
REMOTE STORAGE
8
• Compute and Storage Separation
• Advantages
• Meet different compute and storage hardware
requirements efficiently
• Scale compute and storage independently
• Store data in Traditional filers/SANs and object
stores cost effectively
• Compute on data in existing storage via Big Data
Computational frameworks
• Disadvantage
• Accessing data requires remote I/O
USE CASE WITHOUT ALLUXIO
9
Spark
Storage
Low latency, memory
throughput
High latency, network
throughput
USE CASE WITH ALLUXIO
10
Spark
Storage
Alluxio
Keeping data in Alluxio
accelerates data access
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish a
query; using Alluxio, where data may hit
local or remote Alluxio nodes, it took 10-15
seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over
50TB of RAM space
• By using Alluxio, batch queries usually
lasting over 15 minutes were transformed
into an interactive query taking less than 30
seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
11
ALLUXIO ON CEPH
12
ALLUXIO ON CEPH
13
Spark
Ceph Object
Storage
Alluxio
● Connect using RADOS Gateway
○ Swift Object Storage API
EC2 CONFIGURATION
14
● 1  Compute  Master
○ Spark  and  Alluxio  Masters
● 3  Compute  Workers
○ Spark  and  Alluxio  Workers
● 1  Storage  Manager
○ Ceph  RadosGW  and  Monitor
● 2  Storage  Devices
○ Ceph  OSDs
● Instance  type:  r3.xlarge
● Availability  Zone:  us-­east-­1a
SOFTWARE VERSIONS
15
● Ceph  Version:  0.94.9  
● Alluxio  Version:  1.4.0
○ Custom  JOSS  library  0.9.13-­SNAPSHOT
● Spark  Version  1.6.1
DEMO OF THE SOLUTION
16
● Spark,  Alluxio  and  Ceph  Cluster  pre-­deployed
● Ceph  pre-­populated  with  a  60GB  dataset
● Launch  spark  shell
a. First  ‘count’
b. Second  ‘count’
c. <Restart  shell>
d. Third  ‘count’
● Ad-­hoc  queries  w/  Alluxio
a. ‘wordcount’  w/  intermediate  data
SPARK COUNT PERFORMANCE
17
Count  on  60  GB  dataset
● 20x  improvement  for  repeated  access
FOR MORE INFORMATION ….
18
Please  take  a  look  at  our  Whitepaper!
● Blog:  https://alluxio.com/blog/accelerating-­data-­analytics-­on-­
ceph-­object-­storage-­with-­alluxio
● Whitepaper:  https://alluxio.com/resources/accelerating-­data-­
analytics-­on-­ceph-­object-­storage-­with-­alluxio
Thank you!
Contact: adit@alluxio.com or info@alluxio.com
Twitter: @Alluxio
Websites: www.alluxio.com and www.alluxio.org
19

More Related Content

PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
The Missing Piece of On-Demand Clusters
PDF
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
PDF
Alluxio Mesos Meetup - SMACK to SMAACK
PDF
Best Practices for Using Alluxio with Spark
PDF
Best Practices for Using Alluxio with Spark
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
The Missing Piece of On-Demand Clusters
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio Mesos Meetup - SMACK to SMAACK
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark

What's hot (20)

PDF
Best Practices for Using Alluxio with Spark
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
PDF
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PDF
Alluxio-FUSE as a data access layer for Dask
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Alluxio: Unify Data at Memory Speed; 2016-11-18
PDF
Building Fast SQL Analytics on Anything with Presto, Alluxio
PDF
Presto on Alluxio Hands-On Lab
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Alluxio Presentation at AMPLab Summer Retreat 2016
PDF
Accelerating Spark Workloads in a Mesos Environment with Alluxio
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
PPTX
Running Solr in the Cloud at Memory Speed with Alluxio
PDF
Open Source Memory Speed Virtual Distributed Storage
Best Practices for Using Alluxio with Spark
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Spark Summit EU talk by Jiri Simsa
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio-FUSE as a data access layer for Dask
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Iceberg + Alluxio for Fast Data Analytics
Alluxio: Unify Data at Memory Speed; 2016-11-18
Building Fast SQL Analytics on Anything with Presto, Alluxio
Presto on Alluxio Hands-On Lab
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Alluxio Presentation at AMPLab Summer Retreat 2016
Accelerating Spark Workloads in a Mesos Environment with Alluxio
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Running Solr in the Cloud at Memory Speed with Alluxio
Open Source Memory Speed Virtual Distributed Storage
Ad

Similar to Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 (20)

PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
PDF
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
PDF
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Alluxio @ Uber Seattle Meetup
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Accelerate Cloud Training with Alluxio
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Achieving compute and storage independence for data-driven workloads
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Alluxio Keynote at Strata+Hadoop World Beijing 2016
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Accelerate Analytics and ML in the Hybrid Cloud Era
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Alluxio @ Uber Seattle Meetup
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Accelerate Cloud Training with Alluxio
Open Source Data Orchestration for AI, Big Data, and Cloud
Achieving Separation of Compute and Storage in a Cloud World
Achieving compute and storage independence for data-driven workloads
Accelerate Analytics and ML in the Hybrid Cloud Era
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Spark Summit EU talk by Jiri Simsa
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Ad

More from Alluxio, Inc. (20)

PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101

Recently uploaded (20)

PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Architecture types and enterprise applications.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
August Patch Tuesday
PPTX
The various Industrial Revolutions .pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Hindi spoken digit analysis for native and non-native speakers
cloud_computing_Infrastucture_as_cloud_p
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Architecture types and enterprise applications.pdf
A comparative study of natural language inference in Swahili using monolingua...
Final SEM Unit 1 for mit wpu at pune .pptx
1 - Historical Antecedents, Social Consideration.pdf
August Patch Tuesday
The various Industrial Revolutions .pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
WOOl fibre morphology and structure.pdf for textiles
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Module 1.ppt Iot fundamentals and Architecture
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Enhancing emotion recognition model for a student engagement use case through...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Hindi spoken digit analysis for native and non-native speakers

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

  • 1. ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIO Adit Madan March 2017
  • 2. ABOUT ME Adit Madan, Software Engineer @ Alluxio, Inc Master’s @ Carnegie Mellon University Bachelor’s @ Indian Institute of Technology, Delhi Email: adit@alluxio.com 2
  • 4. FASTEST-GROWING BIG DATA PROJECT • Fastest growing open-source project in the big data ecosystem • 400+ contributors from 100+ organizations • Running world’s largest production clusters • Welcome to join the community! 4
  • 5. BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY … … FUSE Compatible File System Hadoop Compatible File System Native Key-Value Interface Native File System Enabling Application to Access Data from any Storage System at Memory-speed BIG DATA ECOSYSTEM ISSUES GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 5
  • 6. WHY ALLUXIO Co-located with compute, provides memory-speed access to data Virtualized across different storage systems under a unified global namespace Distributed system, scale-out architecture Software only, no change needed to existing application 6
  • 7. ALLUXIO BENEFITS Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility 7
  • 8. USE CASE – ACCELERATE I/O TO/FROM REMOTE STORAGE 8 • Compute and Storage Separation • Advantages • Meet different compute and storage hardware requirements efficiently • Scale compute and storage independently • Store data in Traditional filers/SANs and object stores cost effectively • Compute on data in existing storage via Big Data Computational frameworks • Disadvantage • Accessing data requires remote I/O
  • 9. USE CASE WITHOUT ALLUXIO 9 Spark Storage Low latency, memory throughput High latency, network throughput
  • 10. USE CASE WITH ALLUXIO 10 Spark Storage Alluxio Keeping data in Alluxio accelerates data access
  • 11. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run interactive queries to gain insights into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 11
  • 13. ALLUXIO ON CEPH 13 Spark Ceph Object Storage Alluxio ● Connect using RADOS Gateway ○ Swift Object Storage API
  • 14. EC2 CONFIGURATION 14 ● 1  Compute  Master ○ Spark  and  Alluxio  Masters ● 3  Compute  Workers ○ Spark  and  Alluxio  Workers ● 1  Storage  Manager ○ Ceph  RadosGW  and  Monitor ● 2  Storage  Devices ○ Ceph  OSDs ● Instance  type:  r3.xlarge ● Availability  Zone:  us-­east-­1a
  • 15. SOFTWARE VERSIONS 15 ● Ceph  Version:  0.94.9   ● Alluxio  Version:  1.4.0 ○ Custom  JOSS  library  0.9.13-­SNAPSHOT ● Spark  Version  1.6.1
  • 16. DEMO OF THE SOLUTION 16 ● Spark,  Alluxio  and  Ceph  Cluster  pre-­deployed ● Ceph  pre-­populated  with  a  60GB  dataset ● Launch  spark  shell a. First  ‘count’ b. Second  ‘count’ c. <Restart  shell> d. Third  ‘count’ ● Ad-­hoc  queries  w/  Alluxio a. ‘wordcount’  w/  intermediate  data
  • 17. SPARK COUNT PERFORMANCE 17 Count  on  60  GB  dataset ● 20x  improvement  for  repeated  access
  • 18. FOR MORE INFORMATION …. 18 Please  take  a  look  at  our  Whitepaper! ● Blog:  https://alluxio.com/blog/accelerating-­data-­analytics-­on-­ ceph-­object-­storage-­with-­alluxio ● Whitepaper:  https://alluxio.com/resources/accelerating-­data-­ analytics-­on-­ceph-­object-­storage-­with-­alluxio
  • 19. Thank you! Contact: adit@alluxio.com or info@alluxio.com Twitter: @Alluxio Websites: www.alluxio.com and www.alluxio.org 19