Getting Started with Alluxio + Spark + S3

Alluxio (formerly Tachyon):
Getting Started with Alluxio + Spark + S3
Calvin Jia
June 15, 2016 @ Alluxio Meetup (hosted by Intel)
Related Blog Post: http://goo.gl/MUpL0O

Who Am I?
• Calvin Jia
• SWE @ Alluxio, Inc.
• Alluxio PMC Member
• Twitter: @JiaCalvin
2

Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
3

Why Alluxio?
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
5

In-‐Memory

Storage
block
1
block
3
In-‐Memory

Storage
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Data Sharing Between Jobs
Inter-‐process
sharing
slowed
down
by
network
I/O
6

Data Sharing Between Jobs
block
1
block
3
block
2
block
4
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
storage
&

execution
engine
separated
Inter-‐process
sharing
can
happen
at
memory
speed
7

Data Resilience during Crashes
In-‐Memory
Storage
block
1
block
3
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
8

Crash
In-‐Memory
Storage
block
1
block
3
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
9

block
1
block
3
block
2
block
4
Crash
storage
engine
&

execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
10

storage
&

execution
engine
separated
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
Process
crash
only
needs
memory
I/O
to
re-‐read
the
data
11

Crash
storage
&

execution
engine
separated
Process
crash
only
needs
memory
I/O
to
re-‐read
the
data
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
12

Consolidating Memory
In-‐Memory
Storage
block
1
block
3
In-‐Memory
Storage
block
3
block
1
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Data
duplicated
at
memory-‐level
13

Consolidating Memory
block
1
block
3
block
2
block
4
storage
&

execution
engine
separated
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
Data
not
duplicated
at
memory-‐level
14

Outline
• Demo
15

Visualizing the Stack
16
FAST
104 - 105 MB/s
MODERATE 103 - 104 MB/s
SLOW 102 - 103 MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem

When to use Alluxio
•Two or more jobs access the same dataset
•Job(s) may not always succeed
•Dataset larger than Spark JVM
•Jobs are pipelined
•Resulting data does not need to be
immediately persisted
17

Version Selection
• Alluxio 1.1.0
–Latest released version
–Many improvements, upgrade recommended
• Spark 1.6.1
–Latest released version
–Remember to use Spark Alluxio client, ie. -
Pspark
–Spark 2.0 is coming out soon, will recommend
the best way to integrate with Alluxio
18

API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
–Minimal code change
–Do not need to reason about logic
•Example:
–val file = sc.textFile(“s3n://my-‐bucket/myFile”)
–val file = sc.textFile(“alluxio://master:19998/myFile”)
19

Outline
• Demo
20

Getting Started with Alluxio + Spark + S3

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Getting Started with Alluxio + Spark + S3 (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Getting Started with Alluxio + Spark + S3