SlideShare a Scribd company logo
Alluxio (formerly Tachyon):
Getting Started with Alluxio + Spark + S3
Calvin Jia
June 15, 2016 @ Alluxio Meetup (hosted by Intel)
Related Blog Post: http://goo.gl/MUpL0O
Who Am I?
• Calvin Jia
• SWE @ Alluxio, Inc.
• Alluxio PMC Member
• Twitter: @JiaCalvin
2
Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
3
Alluxio Ecosystem
4
Why Alluxio?
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
5
In-­‐Memory	
  	
  	
  	
  	
  	
  
Storage
block	
  1
block	
  3
In-­‐Memory	
  
Storage
block	
  1
block	
  3
block	
  2
block	
  4
storage	
  engine	
  &	
  
execution	
  engine
same	
  process
Data Sharing Between Jobs
Inter-­‐process	
  sharing	
  slowed	
  down	
  by	
  network	
  I/O
6
Data Sharing Between Jobs
block	
  1
block	
  3
block	
  2
block	
  4
HDFS
disk
block	
  1
block	
  3
block	
  2
block	
  4 In-­‐Memory
block	
  1
block	
  3 block	
  4
storage	
  &	
  
execution	
  engine
separated
Inter-­‐process	
  sharing	
  can	
  happen	
  at	
  memory	
  speed
7
Data Resilience during Crashes
In-­‐Memory	
  Storage
block	
  1
block	
  3
block	
  1
block	
  3
block	
  2
block	
  4
storage	
  engine	
  &	
  
execution	
  engine
same	
  process
Process	
  crash	
  requires	
  network	
  I/O	
  to	
  re-­‐read	
  the	
  data
8
Data Resilience during Crashes
Crash
In-­‐Memory	
  Storage
block	
  1
block	
  3
block	
  1
block	
  3
block	
  2
block	
  4
storage	
  engine	
  &	
  
execution	
  engine
same	
  process
Process	
  crash	
  requires	
  network	
  I/O	
  to	
  re-­‐read	
  the	
  data
9
Data Resilience during Crashes
block	
  1
block	
  3
block	
  2
block	
  4
Crash
storage	
  engine	
  &	
  
execution	
  engine
same	
  process
Process	
  crash	
  requires	
  network	
  I/O	
  to	
  re-­‐read	
  the	
  data
10
Data Resilience during Crashes
storage	
  &	
  
execution	
  engine
separated
HDFS
disk
block	
  1
block	
  3
block	
  2
block	
  4 In-­‐Memory
block	
  1
block	
  3 block	
  4
Process	
  crash	
  only	
  needs	
  memory	
  I/O	
  to	
  re-­‐read	
  the	
  data
11
Data Resilience during Crashes
Crash
storage	
  &	
  
execution	
  engine
separated
Process	
  crash	
  only	
  needs	
  memory	
  I/O	
  to	
  re-­‐read	
  the	
  data
HDFS
disk
block	
  1
block	
  3
block	
  2
block	
  4 In-­‐Memory
block	
  1
block	
  3 block	
  4
12
Consolidating Memory
In-­‐Memory
Storage
block	
  1
block	
  3
In-­‐Memory
Storage
block	
  3
block	
  1
block	
  1
block	
  3
block	
  2
block	
  4
storage	
  engine	
  &	
  
execution	
  engine
same	
  process
Data	
  duplicated	
  at	
  memory-­‐level
13
Consolidating Memory
block	
  1
block	
  3
block	
  2
block	
  4
storage	
  &	
  
execution	
  engine
separated
HDFS
disk
block	
  1
block	
  3
block	
  2
block	
  4 In-­‐Memory
block	
  1
block	
  3 block	
  4
Data	
  not	
  duplicated	
  at	
  memory-­‐level
14
Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
15
Visualizing the Stack
16
FAST  
104  -­ 105  MB/s
MODERATE  103 -­ 104 MB/s
SLOW  102 -­ 103 MB/s
Only  when  necessary
Limited
Often
SSD
HDD
Mem
When to use Alluxio
•Two or more jobs access the same dataset
•Job(s) may not always succeed
•Dataset larger than Spark JVM
•Jobs are pipelined
•Resulting data does not need to be
immediately persisted
17
Version Selection
• Alluxio 1.1.0
–Latest released version
–Many improvements, upgrade recommended
• Spark 1.6.1
–Latest released version
–Remember to use Spark Alluxio client, ie. -
Pspark
–Spark 2.0 is coming out soon, will recommend
the best way to integrate with Alluxio
18
API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
–Minimal code change
–Do not need to reason about logic
•Example:
–val  file  =  sc.textFile(“s3n://my-­‐bucket/myFile”)
–val  file  =  sc.textFile(“alluxio://master:19998/myFile”)
19
Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
20

More Related Content

PPTX
Alluxio Presentation at Strata San Jose 2016
PDF
Alluxio Keynote at Strata+Hadoop World Beijing 2016
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
PDF
Open Source Memory Speed Virtual Distributed Storage
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PDF
Alluxio: Unify Data at Memory Speed; 2016-11-18
PPTX
Tachyon workshop 2015-07-19
PDF
Tachyon: An Open Source Memory-Centric Distributed Storage System
Alluxio Presentation at Strata San Jose 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio: Unify Data at Memory Speed; 2016-11-18
Tachyon workshop 2015-07-19
Tachyon: An Open Source Memory-Centric Distributed Storage System

What's hot (20)

PDF
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PPTX
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Tachyon Presentation at AMPCamp 6 (November, 2015)
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
PDF
Best Practices for Using Alluxio with Spark
PDF
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
PPTX
Presentation by TachyonNexus & Intel at Strata Singapore 2015
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PDF
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
PDF
Alluxio-FUSE as a data access layer for Dask
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
PDF
The Missing Piece of On-Demand Clusters
PDF
Alluxio Presentation at AMPLab Summer Retreat 2016
PDF
Flexible and Fast Storage for Deep Learning with Alluxio
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Spark Summit EU talk by Jiri Simsa
Tachyon Presentation at AMPCamp 6 (November, 2015)
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Best Practices for Using Alluxio with Spark
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio-FUSE as a data access layer for Dask
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
The Missing Piece of On-Demand Clusters
Alluxio Presentation at AMPLab Summer Retreat 2016
Flexible and Fast Storage for Deep Learning with Alluxio
Ad

Viewers also liked (6)

PDF
232 deview2013 oss를활용한분산아키텍처구현
PDF
Play node conference
PDF
NODE.JS 글로벌 기업 적용 사례 그리고, real-time 어플리케이션 개발하기
PDF
Node.js in Flitto
PDF
시간당 수백만 요청을 처리하는 node.js 서버 운영기 - Playnode 2015
PDF
Java/Spring과 Node.js의공존
232 deview2013 oss를활용한분산아키텍처구현
Play node conference
NODE.JS 글로벌 기업 적용 사례 그리고, real-time 어플리케이션 개발하기
Node.js in Flitto
시간당 수백만 요청을 처리하는 node.js 서버 운영기 - Playnode 2015
Java/Spring과 Node.js의공존
Ad

Similar to Getting Started with Alluxio + Spark + S3 (20)

PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PDF
Accelerate Spark Workloads on S3
PDF
Improving Memory Utilization of Spark Jobs Using Alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
PDF
Best Practices for Using Alluxio with Spark
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PDF
Spark Summit EU talk by Jiri Simsa
PPTX
Spark Pipelines in the Cloud with Alluxio by Bin Fan
PDF
Accelerating Spark Workloads in a Mesos Environment with Alluxio
PDF
Accelerating Spark Workloads in an Apache Mesos Environment with Alluxio
PDF
Alluxio @ Uber Seattle Meetup
PDF
Accelerating Spark with Kubernetes
PDF
Achieving compute and storage independence for data-driven workloads
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
PDF
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
PDF
Alluxio Community Office Hour: Getting Started with Alluxio Open Source
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Accelerate Spark Workloads on S3
Improving Memory Utilization of Spark Jobs Using Alluxio
Best Practices for Using Alluxio with Spark
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Spark Summit EU talk by Jiri Simsa
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Accelerating Spark Workloads in a Mesos Environment with Alluxio
Accelerating Spark Workloads in an Apache Mesos Environment with Alluxio
Alluxio @ Uber Seattle Meetup
Accelerating Spark with Kubernetes
Achieving compute and storage independence for data-driven workloads
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Alluxio Community Office Hour: Getting Started with Alluxio Open Source

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Modernizing your data center with Dell and AMD
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Modernizing your data center with Dell and AMD
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Getting Started with Alluxio + Spark + S3

  • 1. Alluxio (formerly Tachyon): Getting Started with Alluxio + Spark + S3 Calvin Jia June 15, 2016 @ Alluxio Meetup (hosted by Intel) Related Blog Post: http://goo.gl/MUpL0O
  • 2. Who Am I? • Calvin Jia • SWE @ Alluxio, Inc. • Alluxio PMC Member • Twitter: @JiaCalvin 2
  • 3. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 3
  • 5. Why Alluxio? • Data sharing between jobs • Data resilience during application crashes • Consolidate memory usage and alleviate GC issues 5
  • 6. In-­‐Memory             Storage block  1 block  3 In-­‐Memory   Storage block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Data Sharing Between Jobs Inter-­‐process  sharing  slowed  down  by  network  I/O 6
  • 7. Data Sharing Between Jobs block  1 block  3 block  2 block  4 HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 storage  &   execution  engine separated Inter-­‐process  sharing  can  happen  at  memory  speed 7
  • 8. Data Resilience during Crashes In-­‐Memory  Storage block  1 block  3 block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Process  crash  requires  network  I/O  to  re-­‐read  the  data 8
  • 9. Data Resilience during Crashes Crash In-­‐Memory  Storage block  1 block  3 block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Process  crash  requires  network  I/O  to  re-­‐read  the  data 9
  • 10. Data Resilience during Crashes block  1 block  3 block  2 block  4 Crash storage  engine  &   execution  engine same  process Process  crash  requires  network  I/O  to  re-­‐read  the  data 10
  • 11. Data Resilience during Crashes storage  &   execution  engine separated HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 Process  crash  only  needs  memory  I/O  to  re-­‐read  the  data 11
  • 12. Data Resilience during Crashes Crash storage  &   execution  engine separated Process  crash  only  needs  memory  I/O  to  re-­‐read  the  data HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 12
  • 13. Consolidating Memory In-­‐Memory Storage block  1 block  3 In-­‐Memory Storage block  3 block  1 block  1 block  3 block  2 block  4 storage  engine  &   execution  engine same  process Data  duplicated  at  memory-­‐level 13
  • 14. Consolidating Memory block  1 block  3 block  2 block  4 storage  &   execution  engine separated HDFS disk block  1 block  3 block  2 block  4 In-­‐Memory block  1 block  3 block  4 Data  not  duplicated  at  memory-­‐level 14
  • 15. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 15
  • 16. Visualizing the Stack 16 FAST   104  -­ 105  MB/s MODERATE  103 -­ 104 MB/s SLOW  102 -­ 103 MB/s Only  when  necessary Limited Often SSD HDD Mem
  • 17. When to use Alluxio •Two or more jobs access the same dataset •Job(s) may not always succeed •Dataset larger than Spark JVM •Jobs are pipelined •Resulting data does not need to be immediately persisted 17
  • 18. Version Selection • Alluxio 1.1.0 –Latest released version –Many improvements, upgrade recommended • Spark 1.6.1 –Latest released version –Remember to use Spark Alluxio client, ie. - Pspark –Spark 2.0 is coming out soon, will recommend the best way to integrate with Alluxio 18
  • 19. API Selection • Access data directly through the FileSystem API, but change scheme to alluxio:// –Minimal code change –Do not need to reason about logic •Example: –val  file  =  sc.textFile(“s3n://my-­‐bucket/myFile”) –val  file  =  sc.textFile(“alluxio://master:19998/myFile”) 19
  • 20. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 20