SlideShare a Scribd company logo
Spark Pipelines in the Cloud
with Alluxio
Gene Pang,Alluxio, Inc.
Spark Summit EU - October 2017
About Me
•  Gene Pang
•  Software engineer @ Alluxio, Inc.
•  Alluxio open source PMC member
•  Ph.D. from AMPLab @ UC Berkeley
•  Worked at Google before UC Berkeley
•  Twitter: @unityxx
•  Github: @gpang
©2017 Alluxio, Inc.All Rights Reserved 2
Outline
©2017 Alluxio, Inc.All Rights Reserved 3
1
2
3
Alluxio Overview
Data Pipelines
Experiments
History of Alluxio
Started at UC Berkeley AMPLab In Summer 2012
•  Originally named as Tachyon
•  Rebranded to Alluxio in early 2016
Open Sourced in 2013
•  Apache License 2.0
•  Latest Release:Alluxio 1.6.0
©2017 Alluxio, Inc.All Rights Reserved 4
5©2017 Alluxio, Inc.All Rights Reserved
Alluxio: Unify Data at Memory Speed
Namespace Unification
Architecture Flexibility
IO Performance
Data Ecosystem with Alluxio
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  In-Memory
Performance
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc.All Rights Reserved 6
Next Gen Analytics with Alluxio
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage
at Memory Speed
ü  Big Data/IoT
ü  AI/ML
ü  Deep Learning
ü  Cloud Migration
ü  Multi Platform
ü  Autonomous
©2017 Alluxio, Inc.All Rights Reserved 7
Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
Now: 600+ Contributors
from 100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc.All Rights Reserved 8
Outline
©2017 Alluxio, Inc.All Rights Reserved 9
1
2
3
Alluxio Overview
Data Pipelines
Experiments
1 0©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Stage 1 Stage 2 Stage 3 Stage 4
Output of stage is input of next stage
1 1©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Sharing via common storage
Shared Storage
Stage 1 Stage 2 Stage 3 Stage 4
1 2©2017 Alluxio, Inc.All Rights Reserved
What about
pipelines in the cloud?
1 3©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline in the Cloud
S3 (object store)
Stage 1 Stage 2 Stage 3 Stage 4
1 4©2017 Alluxio, Inc.All Rights Reserved
Sharing data via cloud storage
slows down performance
1 5©2017 Alluxio, Inc.All Rights Reserved
Cloud Pipeline with Alluxio
Sharing via Alluxio memory
S3 (object store)
Stage 1 Stage 2 Stage 3 Stage 4
Sharing Data in the Cloud
Previous stage writes output to storage
Next stage reads input from storage
…
©2017 Alluxio, Inc.All Rights Reserved 1 6
Sharing Data in the Cloud with Alluxio
Previous stage writes output to storage memory
Next stage reads input from storage memory
…
©2017 Alluxio, Inc.All Rights Reserved 1 7
1 8©2017 Alluxio, Inc.All Rights Reserved
Alluxio enables
in-memory data sharing
Faster pipeline performance
Alluxio – Fast Durable Writes
Improves write performance,
without sacrificing fault tolerance
©2017 Alluxio, Inc.All Rights Reserved 1 9
Alluxio – Fast Durable Writes
Synchronously write to replicas in Alluxio memory
Asynchronously write to underlying storage
©2017 Alluxio, Inc.All Rights Reserved 2 0
Alluxio – Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 1
client
Storage
Write is completed
Alluxio – Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 2
client
Storage
Asynchronously
written to
storage
Outline
©2017 Alluxio, Inc.All Rights Reserved 2 3
1
2
3
Alluxio Overview
Data Pipelines
Experiments
Log Pipeline in Amazon Web Services
©2017 Alluxio, Inc.All Rights Reserved 2 4
Generate Parquet Transform Aggregate
Generate: [MapReduce] Create random csv log data
Parquet: [MapReduce] Convert csv to parquet format
Transform: [Spark] Update column values
Aggregate: [Spark] Compute group by / aggregate
Log Pipeline Environment
•  r4.2xlarge instances (61 GB ram, 8 CPUs)
•  1 master, 3 workers
•  Apache Spark 2.2.0
•  Apache Hadoop 2.7.2
•  Alluxio 1.6.0
•  Generate 12 GB of logs
•  Compare AWS S3 vs Alluxio w/ Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 5
2 6©2017 Alluxio, Inc.All Rights Reserved
Average Stage Completion Time
2 7©2017 Alluxio, Inc.All Rights Reserved
Pipeline Completion Time
Over 9x speedup!
Alluxio and Pipelines in the Cloud
Alluxio enables in-memory sharing for data
pipelines in the cloud
Alluxio’s Fast Durable Write feature increases
performance without sacrificing fault tolerance
©2017 Alluxio, Inc.All Rights Reserved 2 8
Thank you!
Gene Pang
gene@alluxio.com
Twitter: @unityxx
Twi$er.com/alluxio	
  
Linkedin.com/alluxio	
  
	
  
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
©2017 Alluxio, Inc.All Rights Reserved 2 9

More Related Content

PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PDF
State of Spark in the cloud (Spark Summit EU 2017)
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PDF
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
PDF
Spark day 2017 - Spark on Kubernetes
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
State of Spark in the cloud (Spark Summit EU 2017)
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Spark day 2017 - Spark on Kubernetes
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...

What's hot (20)

PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PDF
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
PDF
Spark Summit EU talk by Jorg Schad
PPTX
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
PDF
High Availability PostgreSQL on OpenShift...and more!
PDF
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
PDF
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PPTX
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
PDF
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
PDF
Spark and S3 with Ryan Blue
PPTX
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Building a Business Logic Translation Engine with Spark Streaming for Communi...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit EU talk by Jorg Schad
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
High Availability PostgreSQL on OpenShift...and more!
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Spark and S3 with Ryan Blue
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Ad

Similar to Spark Pipelines in the Cloud with Alluxio with Gene Pang (20)

PPTX
Spark Pipelines in the Cloud with Alluxio by Bin Fan
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Alluxio @ Uber Seattle Meetup
PDF
Best Practices for Using Alluxio with Spark
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PDF
Best Practices for Using Alluxio with Spark
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Best Practices for Using Alluxio with Spark
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
PDF
Building Cloud Native Analytical Pipelines on AWS
PDF
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
PDF
Getting Started with Alluxio + Spark + S3
PDF
The Architecture of Decoupling Compute and Storage with Alluxio
PDF
Unify Data at Memory Speed
PDF
Accelerate Spark Workloads on S3
PDF
Running Machine Learning Workloads with Tensorflow, Alluxio and AWS S3
PDF
Alluxio Presentation at AMPLab Summer Retreat 2016
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
Alluxio @ Uber Seattle Meetup
Best Practices for Using Alluxio with Spark
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Best Practices for Using Alluxio with Spark
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Best Practices for Using Alluxio with Spark
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Building Cloud Native Analytical Pipelines on AWS
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Getting Started with Alluxio + Spark + S3
The Architecture of Decoupling Compute and Storage with Alluxio
Unify Data at Memory Speed
Accelerate Spark Workloads on S3
Running Machine Learning Workloads with Tensorflow, Alluxio and AWS S3
Alluxio Presentation at AMPLab Summer Retreat 2016
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Lecture1 pattern recognition............
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Lecture1 pattern recognition............
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Acumen Training GuidePresentation.pptx
Launch Your Data Science Career in Kochi – 2025
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Clinical guidelines as a resource for EBP(1).pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
climate analysis of Dhaka ,Banglades.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Spark Pipelines in the Cloud with Alluxio with Gene Pang

  • 1. Spark Pipelines in the Cloud with Alluxio Gene Pang,Alluxio, Inc. Spark Summit EU - October 2017
  • 2. About Me •  Gene Pang •  Software engineer @ Alluxio, Inc. •  Alluxio open source PMC member •  Ph.D. from AMPLab @ UC Berkeley •  Worked at Google before UC Berkeley •  Twitter: @unityxx •  Github: @gpang ©2017 Alluxio, Inc.All Rights Reserved 2
  • 3. Outline ©2017 Alluxio, Inc.All Rights Reserved 3 1 2 3 Alluxio Overview Data Pipelines Experiments
  • 4. History of Alluxio Started at UC Berkeley AMPLab In Summer 2012 •  Originally named as Tachyon •  Rebranded to Alluxio in early 2016 Open Sourced in 2013 •  Apache License 2.0 •  Latest Release:Alluxio 1.6.0 ©2017 Alluxio, Inc.All Rights Reserved 4
  • 5. 5©2017 Alluxio, Inc.All Rights Reserved Alluxio: Unify Data at Memory Speed Namespace Unification Architecture Flexibility IO Performance
  • 6. Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  In-Memory Performance Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface ©2017 Alluxio, Inc.All Rights Reserved 6
  • 7. Next Gen Analytics with Alluxio Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface Apps, Data & Storage at Memory Speed ü  Big Data/IoT ü  AI/ML ü  Deep Learning ü  Cloud Migration ü  Multi Platform ü  Autonomous ©2017 Alluxio, Inc.All Rights Reserved 7
  • 8. Fastest Growing Big Data Open Source Projects Fastest Growing open- source project in the big data ecosystem Running in large production clusters Now: 600+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive ©2017 Alluxio, Inc.All Rights Reserved 8
  • 9. Outline ©2017 Alluxio, Inc.All Rights Reserved 9 1 2 3 Alluxio Overview Data Pipelines Experiments
  • 10. 1 0©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline Stage 1 Stage 2 Stage 3 Stage 4 Output of stage is input of next stage
  • 11. 1 1©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline Sharing via common storage Shared Storage Stage 1 Stage 2 Stage 3 Stage 4
  • 12. 1 2©2017 Alluxio, Inc.All Rights Reserved What about pipelines in the cloud?
  • 13. 1 3©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline in the Cloud S3 (object store) Stage 1 Stage 2 Stage 3 Stage 4
  • 14. 1 4©2017 Alluxio, Inc.All Rights Reserved Sharing data via cloud storage slows down performance
  • 15. 1 5©2017 Alluxio, Inc.All Rights Reserved Cloud Pipeline with Alluxio Sharing via Alluxio memory S3 (object store) Stage 1 Stage 2 Stage 3 Stage 4
  • 16. Sharing Data in the Cloud Previous stage writes output to storage Next stage reads input from storage … ©2017 Alluxio, Inc.All Rights Reserved 1 6
  • 17. Sharing Data in the Cloud with Alluxio Previous stage writes output to storage memory Next stage reads input from storage memory … ©2017 Alluxio, Inc.All Rights Reserved 1 7
  • 18. 1 8©2017 Alluxio, Inc.All Rights Reserved Alluxio enables in-memory data sharing Faster pipeline performance
  • 19. Alluxio – Fast Durable Writes Improves write performance, without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 1 9
  • 20. Alluxio – Fast Durable Writes Synchronously write to replicas in Alluxio memory Asynchronously write to underlying storage ©2017 Alluxio, Inc.All Rights Reserved 2 0
  • 21. Alluxio – Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 1 client Storage Write is completed
  • 22. Alluxio – Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 2 client Storage Asynchronously written to storage
  • 23. Outline ©2017 Alluxio, Inc.All Rights Reserved 2 3 1 2 3 Alluxio Overview Data Pipelines Experiments
  • 24. Log Pipeline in Amazon Web Services ©2017 Alluxio, Inc.All Rights Reserved 2 4 Generate Parquet Transform Aggregate Generate: [MapReduce] Create random csv log data Parquet: [MapReduce] Convert csv to parquet format Transform: [Spark] Update column values Aggregate: [Spark] Compute group by / aggregate
  • 25. Log Pipeline Environment •  r4.2xlarge instances (61 GB ram, 8 CPUs) •  1 master, 3 workers •  Apache Spark 2.2.0 •  Apache Hadoop 2.7.2 •  Alluxio 1.6.0 •  Generate 12 GB of logs •  Compare AWS S3 vs Alluxio w/ Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 5
  • 26. 2 6©2017 Alluxio, Inc.All Rights Reserved Average Stage Completion Time
  • 27. 2 7©2017 Alluxio, Inc.All Rights Reserved Pipeline Completion Time Over 9x speedup!
  • 28. Alluxio and Pipelines in the Cloud Alluxio enables in-memory sharing for data pipelines in the cloud Alluxio’s Fast Durable Write feature increases performance without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 2 8
  • 29. Thank you! Gene Pang gene@alluxio.com Twitter: @unityxx Twi$er.com/alluxio   Linkedin.com/alluxio     Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ ©2017 Alluxio, Inc.All Rights Reserved 2 9