SlideShare a Scribd company logo
Using Spark with Tachyon: An
Open Source Memory-Centric
Distributed Storage System
Gene Pang, Tachyon Nexus
gene@tachyonnexus.com
October 29, 2015 @ Spark Summit Europe
Who Am I?
• Gene Pang
• PhD from UC Berkeley AMPLab
• Software Engineer at Tachyon Nexus
• Team consists of Tachyon creators, top contributors
• Series A ($7.5 million) from Andreessen Horowitz
• Committed to Tachyon Open Source Project
• www.tachyonnexus.com
Using Spark with Tachyon by Gene Pang
Outline
• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Outline
• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
History of Tachyon
• Started at UC Berkeley AMPLab
– From Summer 2012
– Same lab produced Apache Spark and Apache
Mesos
• Open sourced on April 2013
– Apache License 2.0
– Latest Release: Version 0.8.0 (October 2015)
• Deployed at > 100 companies
Contributors Growth
1 3
15
30
46
70
111
v0.1
Dec'12
v0.2
Apr'13
v0.3
Oct'13
v0.4
Feb'14
v0.5
Jul'14
v0.6
Mar'15
v0.7
Jul'15
Contributors Growth
150+ Contributors
50+ Organizations
One of the Fastest
Growing Big Data
Open Source Projects
Thanks to Contributors and Users!
Reported Tachyon Usage
What is Tachyon?
Open Source
Memory-Centric
Distributed Storage
System
Tachyon Stack
Why Use Tachyon?
Performance Trend:
Memory is Fast
• RAM throughput
increasing exponentially
• Disk throughput
increasing slowly
Memory-locality is important!
Price Trend: Memory is Cheaper
source: jcmit.com
These Memory Trends are
Realized By Many…
Is the
Problem Solved?
Missing a Solution
for the Storage Layer
enables reliable data sharing
at memory-speed within and
across computation
frameworks/jobs
How Does Tachyon Work?
Memory-Centric Storage Architecture
Lineage in Storage Layer
Tachyon Memory-Centric
Architecture
Lineage in Tachyon
Outline
• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Fast and general engine for
large-scale data processing
What are some potential
issues?
Issue 1
Data Sharing bottleneck in
analytics pipeline:
Slow writes to disk
Spark Job1
Spark
Memory
block 1
block 3
Spark Job2
Spark
Memory
block 3
block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Issue 1
Spark Job
Spark
Memory
block 1
block 3
Hadoop MR
Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Data Sharing bottleneck in
analytics pipeline:
Slow writes to disk
storage engine &
execution engine
same process
Issue 1 resolved with Tachyon
Memory-speed data sharing
among different jobs and
different frameworks
Spark Job
Spark mem
Hadoop MR
Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
storage engine &
execution engine
same process
Issue 2
Spark Task
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
In-Memory data loss when
computation crashes
storage engine &
execution engine
same process
Issue 2
crash
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
In-Memory data loss when
computation crashes
HDFS / Amazon S3
Issue 2
block 1
block 3
block 2
block 4
crash
storage engine &
execution engine
same process
In-Memory data loss when
computation crashes
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Issue 2 resolved with Tachyon
Spark Task
Spark Memory
block manager
storage engine &
execution engine
same process
Keep in-memory data safe, even
when computation crashes
Issue 2 resolved with Tachyon
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
crash
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Keep in-memory data safe, even
when computation crashes
HDFS / Amazon S3
Issue 3
In-memory Data Duplication &
Java Garbage Collection
Spark Job1
Spark
Memory
block 1
block 3
Spark Job2
Spark
Memory
block 3
block 1
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Issue 3 resolved with Tachyon
No in-memory data duplication,
much less GC
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
storage engine &
execution engine
same process
Tachyon Use Case: Baidu
• Framework: SparkSQL
• Under Storage: Baidu’s File System
• Tachyon Storage Media: MEM + HDD
• 100+ Tachyon nodes
• 1PB+ Tachyon managed storage
• 30x Performance Improvement
Tachyon Use Case: An Oil
Company
• Framework: Spark
• Under Storage: GlusterFS
• Tachyon Storage Media: MEM only
• Analyzing data in traditional storage
Tachyon Use Case: A SAAS
Company
• Framework: Spark
• Under Storage: S3
• Tachyon Storage Media: SSD only
• Elastic Tachyon deployment
Outline
• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Tachyon 0.8.0 Just Released!
http://tachyon-project.org/
Use different frameworks to enable
workloads on different storage
1. Growing Ecosystem
MEM
SSD
HDD
Faster
Greater Capacity
2. Tiered Storage
Tachyon manages more than DRAM
MEM only
MEM + HDD
SSD only
2. Tiered Storage
Configurable storage tiers
Evict stale data
to lower tier
Promote hot data
to upper tier
3. Pluggable Data Management
Policy
Tachyon Storage System (HDFS, S3, …)
tachyon://host:port/
Data Users
Reports Sales Alice Bob
s3n://bucket/directory/
Data Users
Reports Sales Alice Bob
4. Transparent Naming
• Persisted Tachyon files are mapped to under
storage
• Tachyon paths are preserved in under
storage
Tachyon Storage System A
tachyon://host:port/
Data Users
Alice Bob
hdfs://host:port/
Users
Alice Bob
Storage System B
s3n://bucket/directory/
Reports Sales
Reports Sales
5. Unified Namespace
• Unified namespace for multiple storage
systems
• Share data across storage systems
• On-the-fly mounting/unmounting
Additional Features
Remote Write Support
Easy deployment with Mesos and Yarn
Initial Security Support
One Command Cluster Deployment
Metrics for Clients/Workers/Master
Outline
• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Welcome users and collaborators!
Memory-Centric Distributed
Storage System
Try Tachyon: http://tachyon-project.org
Develop Tachyon: https://github.com/amplab/tachyon
Meet Friends: http://www.meetup.com/Tachyon
Tachyon Nexus: http://www.tachyonnexus.com
Email: gene@tachyonnexus.com
Thank you!

More Related Content

PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Spark Summit EU talk by Jorg Schad
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PPTX
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
State of Spark in the cloud (Spark Summit EU 2017)
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Jorg Schad
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit EU talk by Debasish Das and Pramod Narasimha
State of Spark in the cloud (Spark Summit EU 2017)

What's hot (20)

PDF
SSR: Structured Streaming for R and Machine Learning
PPTX
Jaws - Data Warehouse with Spark SQL by Ema Orhian
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Productionizing Spark and the Spark Job Server
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
PDF
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
PDF
Apache Spark Performance: Past, Future and Present
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark, spark streaming & tachyon
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
How To Connect Spark To Your Own Datasource
PDF
Spark Summit EU talk by Mike Percy
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PDF
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
PDF
Scale-Out Using Spark in Serverless Herd Mode!
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PDF
Introduction to Apache Spark
SSR: Structured Streaming for R and Machine Learning
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Productionizing Spark and the Spark Job Server
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Apache Spark Performance: Past, Future and Present
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark, spark streaming & tachyon
Big Data visualization with Apache Spark and Zeppelin
How To Connect Spark To Your Own Datasource
Spark Summit EU talk by Mike Percy
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Scale-Out Using Spark in Serverless Herd Mode!
Writing Continuous Applications with Structured Streaming in PySpark
Tuning and Monitoring Deep Learning on Apache Spark
Introduction to Apache Spark
Ad

Viewers also liked (20)

PPTX
Introduction to Apache Spark Developer Training
PPTX
Data Science at Scale by Sarah Guido
PPTX
Bring the Spark To Your Eyes
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Getting the best performance with PySpark - Spark Summit West 2016
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Intro to Apache Spark
PPTX
Thing you didn't know you could do in Spark
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PPTX
Spark 101 - First steps to distributed computing
PDF
Connecting Python To The Spark Ecosystem
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PPTX
Introduction to Spark - DataFactZ
PDF
Getting Started Running Apache Spark on Apache Mesos
PPTX
Data Architectures for Robust Decision Making
PPTX
End to End Streaming Architectures
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Introduction to Apache Spark Developer Training
Data Science at Scale by Sarah Guido
Bring the Spark To Your Eyes
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Unified Big Data Processing with Apache Spark (QCON 2014)
Getting the best performance with PySpark - Spark Summit West 2016
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Intro to Apache Spark
Thing you didn't know you could do in Spark
Strata NYC 2015: What's new in Spark Streaming
Databricks Meetup @ Los Angeles Apache Spark User Group
Spark 101 - First steps to distributed computing
Connecting Python To The Spark Ecosystem
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Introduction to Spark - DataFactZ
Getting Started Running Apache Spark on Apache Mesos
Data Architectures for Robust Decision Making
End to End Streaming Architectures
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Ad

Similar to Using Spark with Tachyon by Gene Pang (20)

PPTX
Tachyon workshop 2015-07-19
PDF
A Reliable Memory-Centric Distributed Storage System
PDF
Tachyon: An Open Source Memory-Centric Distributed Storage System
PDF
Tachyon Presentation at AMPCamp 6 (November, 2015)
PDF
Tachyon-2014-11-21-amp-camp5
PPTX
Presentation by TachyonNexus & Intel at Strata Singapore 2015
PDF
Fast Big Data Analytics with Spark on Tachyon
PPTX
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
PPTX
Tachyon_meetup_5-28-2015-IBM
PPTX
Tachyon meetup slides.
PPTX
Tachyon meetup San Francisco Oct 2014
PDF
Tachyon and Apache Spark
PDF
Tachyon memory centric, fault tolerance storage for cluster framworks
PPTX
Spark, Tachyon and Mesos internals
PDF
First-ever scalable, distributed deep learning architecture using Spark & Tac...
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PPT
Solving Big Data Problems
PDF
The BDAS Open Source Community
PPTX
My Other Computer is a Data Center: The Sector Perspective on Big Data
Tachyon workshop 2015-07-19
A Reliable Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon-2014-11-21-amp-camp5
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Fast Big Data Analytics with Spark on Tachyon
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Tachyon_meetup_5-28-2015-IBM
Tachyon meetup slides.
Tachyon meetup San Francisco Oct 2014
Tachyon and Apache Spark
Tachyon memory centric, fault tolerance storage for cluster framworks
Spark, Tachyon and Mesos internals
First-ever scalable, distributed deep learning architecture using Spark & Tac...
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Solving Big Data Problems
The BDAS Open Source Community
My Other Computer is a Data Center: The Sector Perspective on Big Data

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PDF
Lecture1 pattern recognition............
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Global journeys: estimating international migration
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Introduction to Business Data Analytics.
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
Fluorescence-microscope_Botany_detailed content
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Global journeys: estimating international migration
Launch Your Data Science Career in Kochi – 2025
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Business Data Analytics.
STUDY DESIGN details- Lt Col Maksud (21).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Mega Projects Data Mega Projects Data
Moving the Public Sector (Government) to a Digital Adoption
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Business Acumen Training GuidePresentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Using Spark with Tachyon by Gene Pang

  • 1. Using Spark with Tachyon: An Open Source Memory-Centric Distributed Storage System Gene Pang, Tachyon Nexus gene@tachyonnexus.com October 29, 2015 @ Spark Summit Europe
  • 2. Who Am I? • Gene Pang • PhD from UC Berkeley AMPLab • Software Engineer at Tachyon Nexus
  • 3. • Team consists of Tachyon creators, top contributors • Series A ($7.5 million) from Andreessen Horowitz • Committed to Tachyon Open Source Project • www.tachyonnexus.com
  • 5. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  • 6. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  • 7. History of Tachyon • Started at UC Berkeley AMPLab – From Summer 2012 – Same lab produced Apache Spark and Apache Mesos • Open sourced on April 2013 – Apache License 2.0 – Latest Release: Version 0.8.0 (October 2015) • Deployed at > 100 companies
  • 10. One of the Fastest Growing Big Data Open Source Projects
  • 17. Performance Trend: Memory is Fast • RAM throughput increasing exponentially • Disk throughput increasing slowly Memory-locality is important!
  • 18. Price Trend: Memory is Cheaper source: jcmit.com
  • 19. These Memory Trends are Realized By Many…
  • 20. Is the Problem Solved? Missing a Solution for the Storage Layer
  • 21. enables reliable data sharing at memory-speed within and across computation frameworks/jobs
  • 22. How Does Tachyon Work? Memory-Centric Storage Architecture Lineage in Storage Layer
  • 25. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  • 26. Fast and general engine for large-scale data processing What are some potential issues?
  • 27. Issue 1 Data Sharing bottleneck in analytics pipeline: Slow writes to disk Spark Job1 Spark Memory block 1 block 3 Spark Job2 Spark Memory block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process
  • 28. Issue 1 Spark Job Spark Memory block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process
  • 29. Issue 1 resolved with Tachyon Memory-speed data sharing among different jobs and different frameworks Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 storage engine & execution engine same process
  • 30. Issue 2 Spark Task Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 In-Memory data loss when computation crashes storage engine & execution engine same process
  • 31. Issue 2 crash Spark Memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process In-Memory data loss when computation crashes
  • 32. HDFS / Amazon S3 Issue 2 block 1 block 3 block 2 block 4 crash storage engine & execution engine same process In-Memory data loss when computation crashes
  • 33. HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Issue 2 resolved with Tachyon Spark Task Spark Memory block manager storage engine & execution engine same process Keep in-memory data safe, even when computation crashes
  • 34. Issue 2 resolved with Tachyon HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Keep in-memory data safe, even when computation crashes
  • 35. HDFS / Amazon S3 Issue 3 In-memory Data Duplication & Java Garbage Collection Spark Job1 Spark Memory block 1 block 3 Spark Job2 Spark Memory block 3 block 1 block 1 block 3 block 2 block 4 storage engine & execution engine same process
  • 36. Issue 3 resolved with Tachyon No in-memory data duplication, much less GC Spark Job1 Spark mem Spark Job2 Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 storage engine & execution engine same process
  • 37. Tachyon Use Case: Baidu • Framework: SparkSQL • Under Storage: Baidu’s File System • Tachyon Storage Media: MEM + HDD • 100+ Tachyon nodes • 1PB+ Tachyon managed storage • 30x Performance Improvement
  • 38. Tachyon Use Case: An Oil Company • Framework: Spark • Under Storage: GlusterFS • Tachyon Storage Media: MEM only • Analyzing data in traditional storage
  • 39. Tachyon Use Case: A SAAS Company • Framework: Spark • Under Storage: S3 • Tachyon Storage Media: SSD only • Elastic Tachyon deployment
  • 40. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  • 41. Tachyon 0.8.0 Just Released! http://tachyon-project.org/
  • 42. Use different frameworks to enable workloads on different storage 1. Growing Ecosystem
  • 43. MEM SSD HDD Faster Greater Capacity 2. Tiered Storage Tachyon manages more than DRAM
  • 44. MEM only MEM + HDD SSD only 2. Tiered Storage Configurable storage tiers
  • 45. Evict stale data to lower tier Promote hot data to upper tier 3. Pluggable Data Management Policy
  • 46. Tachyon Storage System (HDFS, S3, …) tachyon://host:port/ Data Users Reports Sales Alice Bob s3n://bucket/directory/ Data Users Reports Sales Alice Bob 4. Transparent Naming • Persisted Tachyon files are mapped to under storage • Tachyon paths are preserved in under storage
  • 47. Tachyon Storage System A tachyon://host:port/ Data Users Alice Bob hdfs://host:port/ Users Alice Bob Storage System B s3n://bucket/directory/ Reports Sales Reports Sales 5. Unified Namespace • Unified namespace for multiple storage systems • Share data across storage systems • On-the-fly mounting/unmounting
  • 48. Additional Features Remote Write Support Easy deployment with Mesos and Yarn Initial Security Support One Command Cluster Deployment Metrics for Clients/Workers/Master
  • 49. Outline • Introduction to Tachyon • Using Spark with Tachyon • New Tachyon Features • Getting Involved
  • 50. Welcome users and collaborators! Memory-Centric Distributed Storage System
  • 51. Try Tachyon: http://tachyon-project.org Develop Tachyon: https://github.com/amplab/tachyon Meet Friends: http://www.meetup.com/Tachyon Tachyon Nexus: http://www.tachyonnexus.com Email: gene@tachyonnexus.com Thank you!