Using Spark with Tachyon by Gene Pang

Using Spark with Tachyon: An
Open Source Memory-Centric
Distributed Storage System
Gene Pang, Tachyon Nexus
gene@tachyonnexus.com
October 29, 2015 @ Spark Summit Europe

Who Am I?
• Gene Pang
• PhD from UC Berkeley AMPLab
• Software Engineer at Tachyon Nexus

• Team consists of Tachyon creators, top contributors
• Series A ($7.5 million) from Andreessen Horowitz
• Committed to Tachyon Open Source Project
• www.tachyonnexus.com

Outline
• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved

History of Tachyon
• Started at UC Berkeley AMPLab
– From Summer 2012
– Same lab produced Apache Spark and Apache
Mesos
• Open sourced on April 2013
– Apache License 2.0
– Latest Release: Version 0.8.0 (October 2015)
• Deployed at > 100 companies

Contributors Growth
1 3
15
30
46
70
111
v0.1
Dec'12
v0.2
Apr'13
v0.3
Oct'13
v0.4
Feb'14
v0.5
Jul'14
v0.6
Mar'15
v0.7
Jul'15

Contributors Growth
150+ Contributors
50+ Organizations

One of the Fastest
Growing Big Data
Open Source Projects

Thanks to Contributors and Users!

Open Source
Memory-Centric
Distributed Storage
System

Performance Trend:
Memory is Fast
• RAM throughput
increasing exponentially
• Disk throughput
increasing slowly
Memory-locality is important!

Price Trend: Memory is Cheaper
source: jcmit.com

These Memory Trends are
Realized By Many…

Is the
Problem Solved?
Missing a Solution
for the Storage Layer

enables reliable data sharing
at memory-speed within and
across computation
frameworks/jobs

How Does Tachyon Work?
Memory-Centric Storage Architecture
Lineage in Storage Layer

Tachyon Memory-Centric
Architecture

Fast and general engine for
large-scale data processing
What are some potential
issues?

Issue 1
Data Sharing bottleneck in
analytics pipeline:
Slow writes to disk
Spark Job1
Spark
Memory
block 1
block 3
Spark Job2
Spark
Memory
block 3
block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process

Issue 1
Spark Job
Spark
Memory
block 1
block 3
Hadoop MR
Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Data Sharing bottleneck in
analytics pipeline:
Slow writes to disk
storage engine &
execution engine
same process

Issue 1 resolved with Tachyon
Memory-speed data sharing
among different jobs and
different frameworks
Spark Job
Spark mem
Hadoop MR
Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
storage engine &
execution engine
same process

Issue 2
Spark Task
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
In-Memory data loss when
computation crashes
storage engine &
execution engine
same process

Issue 2
crash
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
computation crashes

HDFS / Amazon S3
Issue 2
block 1
block 3
block 2
block 4
crash
storage engine &
execution engine
same process
computation crashes

HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Spark Task
Spark Memory
block manager
storage engine &
execution engine
same process
Keep in-memory data safe, even
when computation crashes

HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
crash
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Keep in-memory data safe, even
when computation crashes

HDFS / Amazon S3
Issue 3
In-memory Data Duplication &
Java Garbage Collection
Spark Job1
Spark
Memory
block 1
block 3
Spark Job2
Spark
Memory
block 3
block 1
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process

No in-memory data duplication,
much less GC
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
storage engine &
execution engine
same process

Tachyon Use Case: Baidu
• Framework: SparkSQL
• Under Storage: Baidu’s File System
• Tachyon Storage Media: MEM + HDD
• 100+ Tachyon nodes
• 1PB+ Tachyon managed storage
• 30x Performance Improvement

Tachyon Use Case: An Oil
Company
• Framework: Spark
• Under Storage: GlusterFS
• Tachyon Storage Media: MEM only
• Analyzing data in traditional storage

Tachyon Use Case: A SAAS
Company
• Framework: Spark
• Under Storage: S3
• Tachyon Storage Media: SSD only
• Elastic Tachyon deployment

Tachyon 0.8.0 Just Released!
http://tachyon-project.org/

Use different frameworks to enable
workloads on different storage
1. Growing Ecosystem

MEM
SSD
HDD
Faster
Greater Capacity
2. Tiered Storage
Tachyon manages more than DRAM

MEM only
MEM + HDD
SSD only
2. Tiered Storage
Configurable storage tiers

Evict stale data
to lower tier
Promote hot data
to upper tier
3. Pluggable Data Management
Policy

Tachyon Storage System (HDFS, S3, …)
tachyon://host:port/
Data Users
Reports Sales Alice Bob
s3n://bucket/directory/
Data Users
Reports Sales Alice Bob
4. Transparent Naming
• Persisted Tachyon files are mapped to under
storage
• Tachyon paths are preserved in under
storage

Tachyon Storage System A
tachyon://host:port/
Data Users
Alice Bob
hdfs://host:port/
Users
Alice Bob
Storage System B
s3n://bucket/directory/
Reports Sales
Reports Sales
5. Unified Namespace
• Unified namespace for multiple storage
systems
• Share data across storage systems
• On-the-fly mounting/unmounting

Additional Features
Remote Write Support
Easy deployment with Mesos and Yarn
Initial Security Support
One Command Cluster Deployment
Metrics for Clients/Workers/Master

Welcome users and collaborators!
Memory-Centric Distributed
Storage System

Try Tachyon: http://tachyon-project.org
Develop Tachyon: https://github.com/amplab/tachyon
Meet Friends: http://www.meetup.com/Tachyon
Tachyon Nexus: http://www.tachyonnexus.com
Email: gene@tachyonnexus.com
Thank you!

Using Spark with Tachyon by Gene Pang

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Using Spark with Tachyon by Gene Pang (20)

More from Spark Summit (20)

Recently uploaded (20)

Using Spark with Tachyon by Gene Pang