SlideShare a Scribd company logo
ENABLE FAST BIG DATA ANALYTICS ON
CEPH WITH ALLUXIO
Adit Madan
March 2017
ABOUT ME
Adit Madan, Software Engineer @ Alluxio, Inc
Master’s @ Carnegie Mellon University
Bachelor’s @ Indian Institute of Technology, Delhi
Email: adit@alluxio.com
2
ALLUXIO INTRODUCTION
3
FASTEST-GROWING BIG DATA PROJECT
• Fastest growing
open-source
project in the big
data ecosystem
• 400+ contributors
from 100+
organizations
• Running world’s
largest production
clusters
• Welcome to join
the community!
4
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY
…
…
FUSE Compatible File
System
Hadoop Compatible File
System
Native Key-Value
Interface
Native File System
Enabling Application to Access Data from any
Storage System at Memory-speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
5
WHY ALLUXIO
Co-located with compute, provides memory-speed access to data
Virtualized across different storage systems under a unified global namespace
Distributed system, scale-out architecture
Software only, no change needed to existing application
6
ALLUXIO BENEFITS
Unification
New workflows across
any data in any storage
system
Orders of magnitude
improvement in run
time
Choice in compute and
storage – grow each
independently, buy
only what is needed
Performance Flexibility
7
USE CASE – ACCELERATE I/O TO/FROM
REMOTE STORAGE
8
• Compute and Storage Separation
• Advantages
• Meet different compute and storage hardware
requirements efficiently
• Scale compute and storage independently
• Store data in Traditional filers/SANs and object
stores cost effectively
• Compute on data in existing storage via Big Data
Computational frameworks
• Disadvantage
• Accessing data requires remote I/O
USE CASE WITHOUT ALLUXIO
9
Spark
Storage
Low latency, memory
throughput
High latency, network
throughput
USE CASE WITH ALLUXIO
10
Spark
Storage
Alluxio
Keeping data in Alluxio
accelerates data access
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark
SQL alone, it took 100-150 seconds to finish a
query; using Alluxio, where data may hit
local or remote Alluxio nodes, it took 10-15
seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over
50TB of RAM space
• By using Alluxio, batch queries usually
lasting over 15 minutes were transformed
into an interactive query taking less than 30
seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
11
ALLUXIO ON CEPH
12
ALLUXIO ON CEPH
13
Spark
Ceph Object
Storage
Alluxio
● Connect using RADOS Gateway
○ Swift Object Storage API
EC2 CONFIGURATION
14
● 1 Compute Master
○ Spark and Alluxio Masters
● 3 Compute Workers
○ Spark and Alluxio Workers
● 1 Storage Manager
○ Ceph RadosGW and Monitor
● 2 Storage Devices
○ Ceph OSDs
● Instance type: r3.xlarge
● Availability Zone: us-east-1a
SOFTWARE VERSIONS
15
● Ceph Version: 0.94.9
● Alluxio Version: 1.4.0
○ Custom JOSS library 0.9.13-SNAPSHOT
● Spark Version 1.6.1
DEMO OF THE SOLUTION
16
● Spark, Alluxio and Ceph Cluster pre-deployed
● Ceph pre-populated with a 60GB dataset
● Launch spark shell
a. First ‘count’
b. Second ‘count’
c. <Restart shell>
d. Third ‘count’
● Ad-hoc queries w/ Alluxio
a. ‘wordcount’ w/ intermediate data
SPARK COUNT PERFORMANCE
17
Count on 60 GB dataset
● 20x improvement for repeated access
FOR MORE INFORMATION ….
18
Please take a look at our Whitepaper!
● Blog: https://alluxio.com/blog/accelerating-data-analytics-on-
ceph-object-storage-with-alluxio
● Whitepaper: https://alluxio.com/resources/accelerating-data-
analytics-on-ceph-object-storage-with-alluxio
Thank you!
Contact: adit@alluxio.com or info@alluxio.com
Twitter: @Alluxio
Websites: www.alluxio.com and www.alluxio.org
19

More Related Content

PDF
Ceph Day San Jose - Object Storage for Big Data
PPTX
Ceph Day San Jose - Ceph at Salesforce
PDF
Ceph Day San Jose - From Zero to Ceph in One Minute
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
PPTX
Ceph Day Seoul - The Anatomy of Ceph I/O
PDF
Ceph Day San Jose - HA NAS with CephFS
PPTX
Ceph Day Tokyo - Bring Ceph to Enterprise
PPTX
Ceph: Low Fail Go Scale
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - From Zero to Ceph in One Minute
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day Seoul - The Anatomy of Ceph I/O
Ceph Day San Jose - HA NAS with CephFS
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph: Low Fail Go Scale

What's hot (18)

PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
PPTX
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
PDF
Ceph Day Seoul - Ceph: a decade in the making and still going strong
PPTX
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
PPTX
Ceph Deployment at Target: Customer Spotlight
PDF
Ceph Day Taipei - Bring Ceph to Enterprise
PDF
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
PPTX
Ceph Day KL - Ceph on All-Flash Storage
PPTX
Walk Through a Software Defined Everything PoC
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
2016-JAN-28 -- High Performance Production Databases on Ceph
PPTX
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
PPTX
MySQL Head-to-Head
PDF
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
PPTX
Ceph Day KL - Ceph Tiering with High Performance Archiecture
PDF
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
PPTX
Ceph - High Performance Without High Costs
PPTX
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Ceph Deployment at Target: Customer Spotlight
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day KL - Ceph on All-Flash Storage
Walk Through a Software Defined Everything PoC
QCT Ceph Solution - Design Consideration and Reference Architecture
2016-JAN-28 -- High Performance Production Databases on Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
MySQL Head-to-Head
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph - High Performance Without High Costs
Ceph Day Taipei - Accelerate Ceph via SPDK
Ad

Viewers also liked (20)

PPTX
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
PPTX
Ceph Day San Jose - Ceph in a Post-Cloud World
PDF
Ceph Day Tokyo -- Ceph on All-Flash Storage
PPTX
Ceph Day Tokyo - Ceph Community Update
PDF
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
PDF
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
PPTX
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
PDF
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
PPTX
Ceph Day Seoul - Ceph on All-Flash Storage
PPTX
Ceph Day Tokyo - High Performance Layered Architecture
PPTX
Ceph Day Seoul - Community Update
PDF
Red Hat Storage Day Dallas - Storage for OpenShift Containers
PDF
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
PDF
Performance Metrics and Ontology for Describing Performance Data of Grid Work...
PPTX
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
PPTX
Web security-–-everything-we-know-is-wrong-eoin-keary
PDF
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
PPTX
Connected Vehicle Data Platform
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Ceph in a Post-Cloud World
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Seoul - Community Update
Red Hat Storage Day Dallas - Storage for OpenShift Containers
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
Performance Metrics and Ontology for Describing Performance Data of Grid Work...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Web security-–-everything-we-know-is-wrong-eoin-keary
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Connected Vehicle Data Platform
Ad

Similar to Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio (20)

PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
PDF
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
PDF
Alluxio: Unify Data at Memory Speed; 2016-11-18
PDF
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
PDF
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
PDF
Alluxio @ Uber Seattle Meetup
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Accelerate Cloud Training with Alluxio
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
Achieving Separation of Compute and Storage in a Cloud World
PDF
Achieving compute and storage independence for data-driven workloads
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio: Unify Data at Memory Speed; 2016-11-18
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Accelerate Analytics and ML in the Hybrid Cloud Era
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Alluxio @ Uber Seattle Meetup
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Accelerate Cloud Training with Alluxio
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Open Source Data Orchestration for AI, Big Data, and Cloud
Achieving Separation of Compute and Storage in a Cloud World
Achieving compute and storage independence for data-driven workloads

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio

  • 1. ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIO Adit Madan March 2017
  • 2. ABOUT ME Adit Madan, Software Engineer @ Alluxio, Inc Master’s @ Carnegie Mellon University Bachelor’s @ Indian Institute of Technology, Delhi Email: adit@alluxio.com 2
  • 4. FASTEST-GROWING BIG DATA PROJECT • Fastest growing open-source project in the big data ecosystem • 400+ contributors from 100+ organizations • Running world’s largest production clusters • Welcome to join the community! 4
  • 5. BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY … … FUSE Compatible File System Hadoop Compatible File System Native Key-Value Interface Native File System Enabling Application to Access Data from any Storage System at Memory-speed BIG DATA ECOSYSTEM ISSUES GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface 5
  • 6. WHY ALLUXIO Co-located with compute, provides memory-speed access to data Virtualized across different storage systems under a unified global namespace Distributed system, scale-out architecture Software only, no change needed to existing application 6
  • 7. ALLUXIO BENEFITS Unification New workflows across any data in any storage system Orders of magnitude improvement in run time Choice in compute and storage – grow each independently, buy only what is needed Performance Flexibility 7
  • 8. USE CASE – ACCELERATE I/O TO/FROM REMOTE STORAGE 8 • Compute and Storage Separation • Advantages • Meet different compute and storage hardware requirements efficiently • Scale compute and storage independently • Store data in Traditional filers/SANs and object stores cost effectively • Compute on data in existing storage via Big Data Computational frameworks • Disadvantage • Accessing data requires remote I/O
  • 9. USE CASE WITHOUT ALLUXIO 9 Spark Storage Low latency, memory throughput High latency, network throughput
  • 10. USE CASE WITH ALLUXIO 10 Spark Storage Alluxio Keeping data in Alluxio accelerates data access
  • 11. ACCELERATE I/O TO/FROM REMOTE STORAGE The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu RESULTS • Data queries are now 30x faster with Alluxio • Alluxio cluster runs stably, providing over 50TB of RAM space • By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds Baidu’s PMs and analysts run interactive queries to gain insights into their products and business • 200+ nodes deployment • 2+ petabytes of storage • Mix of memory + HDD ALLUXIO Baidu File System 11
  • 13. ALLUXIO ON CEPH 13 Spark Ceph Object Storage Alluxio ● Connect using RADOS Gateway ○ Swift Object Storage API
  • 14. EC2 CONFIGURATION 14 ● 1 Compute Master ○ Spark and Alluxio Masters ● 3 Compute Workers ○ Spark and Alluxio Workers ● 1 Storage Manager ○ Ceph RadosGW and Monitor ● 2 Storage Devices ○ Ceph OSDs ● Instance type: r3.xlarge ● Availability Zone: us-east-1a
  • 15. SOFTWARE VERSIONS 15 ● Ceph Version: 0.94.9 ● Alluxio Version: 1.4.0 ○ Custom JOSS library 0.9.13-SNAPSHOT ● Spark Version 1.6.1
  • 16. DEMO OF THE SOLUTION 16 ● Spark, Alluxio and Ceph Cluster pre-deployed ● Ceph pre-populated with a 60GB dataset ● Launch spark shell a. First ‘count’ b. Second ‘count’ c. <Restart shell> d. Third ‘count’ ● Ad-hoc queries w/ Alluxio a. ‘wordcount’ w/ intermediate data
  • 17. SPARK COUNT PERFORMANCE 17 Count on 60 GB dataset ● 20x improvement for repeated access
  • 18. FOR MORE INFORMATION …. 18 Please take a look at our Whitepaper! ● Blog: https://alluxio.com/blog/accelerating-data-analytics-on- ceph-object-storage-with-alluxio ● Whitepaper: https://alluxio.com/resources/accelerating-data- analytics-on-ceph-object-storage-with-alluxio
  • 19. Thank you! Contact: adit@alluxio.com or info@alluxio.com Twitter: @Alluxio Websites: www.alluxio.com and www.alluxio.org 19