Spark Pipelines in the Cloud with Alluxio with Gene Pang

Spark Pipelines in the Cloud
with Alluxio
Gene Pang,Alluxio, Inc.
Spark Summit EU - October 2017

About Me
•  Gene Pang
•  Software engineer @ Alluxio, Inc.
•  Alluxio open source PMC member
•  Ph.D. from AMPLab @ UC Berkeley
•  Worked at Google before UC Berkeley
•  Twitter: @unityxx
•  Github: @gpang
©2017 Alluxio, Inc.All Rights Reserved 2

Outline
1
2
3
Alluxio Overview
Data Pipelines
Experiments

History of Alluxio
Started at UC Berkeley AMPLab In Summer 2012
•  Originally named as Tachyon
•  Rebranded to Alluxio in early 2016
Open Sourced in 2013
•  Apache License 2.0
•  Latest Release:Alluxio 1.6.0

5©2017 Alluxio, Inc.All Rights Reserved
Alluxio: Unify Data at Memory Speed
Namespace Unification
Architecture Flexibility
IO Performance

Data Ecosystem with Alluxio
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  In-Memory
Performance
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface

Next Gen Analytics with Alluxio
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage
at Memory Speed
ü  Big Data/IoT
ü  AI/ML
ü  Deep Learning
ü  Cloud Migration
ü  Multi Platform
ü  Autonomous

Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
Now: 600+ Contributors
from 100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive

Outline
1
2
3
Alluxio Overview
Data Pipelines
Experiments

1 0©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Stage 1 Stage 2 Stage 3 Stage 4
Output of stage is input of next stage

Data Processing Pipeline
Sharing via common storage
Shared Storage

What about
pipelines in the cloud?

Data Processing Pipeline in the Cloud
S3 (object store)

Sharing data via cloud storage
slows down performance

Cloud Pipeline with Alluxio
Sharing via Alluxio memory
S3 (object store)

Sharing Data in the Cloud
Previous stage writes output to storage
Next stage reads input from storage
…
©2017 Alluxio, Inc.All Rights Reserved 1 6

Sharing Data in the Cloud with Alluxio
Previous stage writes output to storage memory
Next stage reads input from storage memory
…

Alluxio enables
in-memory data sharing
Faster pipeline performance

Alluxio – Fast Durable Writes
Improves write performance,
without sacrificing fault tolerance

Synchronously write to replicas in Alluxio memory
Asynchronously write to underlying storage

client
Storage
Write is completed

client
Storage
Asynchronously
written to
storage

Outline
1
2
3
Alluxio Overview
Data Pipelines
Experiments

Log Pipeline in Amazon Web Services
Generate Parquet Transform Aggregate
Generate: [MapReduce] Create random csv log data
Parquet: [MapReduce] Convert csv to parquet format
Transform: [Spark] Update column values
Aggregate: [Spark] Compute group by / aggregate

Log Pipeline Environment
•  r4.2xlarge instances (61 GB ram, 8 CPUs)
•  1 master, 3 workers
•  Apache Spark 2.2.0
•  Apache Hadoop 2.7.2
•  Alluxio 1.6.0
•  Generate 12 GB of logs
•  Compare AWS S3 vs Alluxio w/ Fast Durable Writes

Average Stage Completion Time

Pipeline Completion Time
Over 9x speedup!

Alluxio and Pipelines in the Cloud
Alluxio enables in-memory sharing for data
pipelines in the cloud
Alluxio’s Fast Durable Write feature increases
performance without sacrificing fault tolerance

Thank you!
Gene Pang
gene@alluxio.com
Twitter: @unityxx
Twi$er.com/alluxio

Linkedin.com/alluxio

Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™

Spark Pipelines in the Cloud with Alluxio with Gene Pang

More Related Content

What's hot (20)

Similar to Spark Pipelines in the Cloud with Alluxio with Gene Pang (20)

More from Spark Summit (20)

Recently uploaded (20)

Spark Pipelines in the Cloud with Alluxio with Gene Pang