SlideShare a Scribd company logo
Tachyon: memory centric, fault tolerance
storage for cluster framworks
presented by Viet-Trung Tran
Memory is King
• RAM throughput increasing exponentially
• Disk throughput increasing slowly
Memory-locality key to interactive response time
Memory as cache
• Improve READ
• Cannot help much with write
• Replication for fault tolerance
• Network bandwidth and latency are much worse than that of memory
• Write throughput is limited by disk I/O
• Required at least one copy on disk
• Inter-job data sharing cost dominates pipeline end-to-end latency
• 34% jobs output as large as input (Cloudera survey)
Different jobs share data
Slow writes to disk
Spark Task
Spark mem
block manager
block 1
block 3
Spark Task
Spark mem
block manager
block 3
block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
(slow writes)
4
Different frameworks share data
Spark Task
Spark mem
block manager
block 1
block 3
Hadoop MR
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
(slow writes)
5
Slow writes to disk
Tachyon: realiable data sharing at memory speed
within and across frameworks/jobs
Tachyon
Spark
MapRe
duce
Spark
SQL
H2O GraphX Impala
HDFS S3
Gluster
FS
Orange
FS
NFS Ceph ……
……
Challenges
How to achieve reliability data sharing without replication?
Target workload properties
• Immutable data
• Deterministic jobs
• Locality based scheduling
• All data vs working set
• Program size vs data size
System architecture
Consists of two layer
• Lineage
• Deliver high throughput I/O
• Capture sequence of jobs/tasks that create output
• Persistence
• Asynchronous checkpoints
Facts
• One data copy in memory
• Recomputation for fault-tolerance
Memory-Centric Storage Architecture
10
Tachyon  memory centric, fault tolerance storage for cluster framworks
Master Node
• Similar to HDFS and GPS
• Passive standby model
• BUT also contains a workflow manager
• Track lineage information
• Compute checkpoint order
• Interact with cluster resource manager to allocate resources for re-
computations
Lineage
More complex lineage
Lineage metadata
• Binary program
• Configuration
• Input Files List
• Output Files List
• Dependency Type
• Narrow (filter, map)
• Wide (suffle, join)
Fault-recovery by recomputations
• Challenge
• Bounding the recomputation cost for a long running storage
• Asynchronous checkpointing
• Allocate resources for recomputations
• Make sure recomputation tasks get enough resources
• Do not impact system performance (task priorities)
• Assumption
• Input files are immutable
• job executions are deterministic
• Client side caching to mitigate read hotspots
Asynchronous checkpointing
• Goals
• Bounded recomputation time
• Checkpointing hot files
• Avoid checkpointing temp files
• Edge algoritim
• Modeling relationships of files with a DAG
• Vertices are files
• Edge from A to B if B is generated by a job that read A
Edge algorithm
• Checkpoint leaves
• Checkpointing hot files
• Most file access are less than 3 ( yahoo survey for big data workload)
• Thus, access more than twice get checkpointed
• Dealing with large dataset
• 96% active job sizes fit in the cluster memory
• synchronously write dataset above a defined threshold to disk
• Most of the files in memory checkpointed can be evicted from memory
to make room
Resource allocation
• Depend on the scheduling policy of the running cluster
• Requirements
• Priority compatibility
• Resource sharing
• Avoid cascading recomputation
• Best ordering recomputation
• Most common policies
• priority based
• weighted fair sharing
Priority based scheduler
•
Fair sharing based scheduler
Evaluation
• 110x faster than MemHDFS
• 4x faster in realistic jobs
• 3,8x faster in case of failure
• Recover from master failure within 1 second
• reduce replication caused network traffic up to 50%
• recomputation impact is less than 1,6%

More Related Content

PDF
MongoDB Capacity Planning
PDF
Voldemort Nosql
PDF
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PPT
Hw09 Low Latency, Random Reads From Hdfs
PPTX
UWP apps development - Part 3
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PDF
Drill architecture 20120913
MongoDB Capacity Planning
Voldemort Nosql
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Hw09 Low Latency, Random Reads From Hdfs
UWP apps development - Part 3
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Drill architecture 20120913

What's hot (20)

PDF
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
PPTX
2013 year of real-time hadoop
PPTX
MongoDB Capacity Planning
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PPTX
Hardware Provisioning
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PDF
HBaseConAsia2018 Track1-3: HBase at Xiaomi
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PDF
POLARDB: A database architecture for the cloud
PPT
January 2011 HUG: Kafka Presentation
PPTX
Capacity Planning
PPTX
Hadoop data ingestion
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PDF
POLARDB: A database architecture for the cloud
PPT
HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
PDF
Efficient processing of large and complex XML documents in Hadoop
PPT
Pnuts Review
PPTX
Capacity Planning For Your Growing MongoDB Cluster
PDF
Presto at Twitter
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
2013 year of real-time hadoop
MongoDB Capacity Planning
Cosco: An Efficient Facebook-Scale Shuffle Service
Hardware Provisioning
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
HBaseConAsia2018 Track1-3: HBase at Xiaomi
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
POLARDB: A database architecture for the cloud
January 2011 HUG: Kafka Presentation
Capacity Planning
Hadoop data ingestion
Hoodie: How (And Why) We built an analytical datastore on Spark
POLARDB: A database architecture for the cloud
HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
Efficient processing of large and complex XML documents in Hadoop
Pnuts Review
Capacity Planning For Your Growing MongoDB Cluster
Presto at Twitter
Ad

Viewers also liked (20)

PPTX
The Rules - SGS
PPTX
Balanceo de una ecuación química
PDF
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
PDF
Interactive big data analytics
PPT
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
PDF
Social media strategies for libraries poster
PDF
Jobs consultant
PDF
How to increase traffic to your WordPress website.
PPTX
Charitable Giving and Happiness
PPTX
Latin Dansları
PPTX
teaching methods
PPT
xoxooo tkmmm
PDF
Practica 2 quimica organica -espol
PDF
The State of Facilities at Eastern Region Institutions JUNE16
PDF
William Gross Sues Pimco for Hundreds of Millions
PDF
Moving to the Right Side of Safety
PPT
Jvm mbeans jmxtran
PPTX
God Is Forgiving
PPTX
Torque
DOC
Guia De Estudio Digestivo
The Rules - SGS
Balanceo de una ecuación química
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Interactive big data analytics
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Social media strategies for libraries poster
Jobs consultant
How to increase traffic to your WordPress website.
Charitable Giving and Happiness
Latin Dansları
teaching methods
xoxooo tkmmm
Practica 2 quimica organica -espol
The State of Facilities at Eastern Region Institutions JUNE16
William Gross Sues Pimco for Hundreds of Millions
Moving to the Right Side of Safety
Jvm mbeans jmxtran
God Is Forgiving
Torque
Guia De Estudio Digestivo
Ad

Similar to Tachyon memory centric, fault tolerance storage for cluster framworks (20)

PPTX
Cloud computing UNIT 2.1 presentation in
PDF
Distributed Data processing in a Cloud
PPT
Oracle Architecture software overview ppts
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPTX
whyPostgres, a presentation on the project choice for a storage system
PPTX
Geek Sync | Guide to Understanding and Monitoring Tempdb
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PPT
HDFS_architecture.ppt
KEY
Writing Scalable Software in Java
PPTX
Investigate TempDB Like Sherlock Holmes
PPTX
Dissecting Scalable Database Architectures
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
PDF
Still All on One Server: Perforce at Scale
PPTX
Flashy prefetching for high performance flash drives
PDF
Meta scale kognitio hadoop webinar
PPTX
Introduction to Hadoop and Big Data
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
PDF
Tuning Linux Windows and Firebird for Heavy Workload
PPT
08 operating system support
PPT
08 operating system support
Cloud computing UNIT 2.1 presentation in
Distributed Data processing in a Cloud
Oracle Architecture software overview ppts
In-memory Caching in HDFS: Lower Latency, Same Great Taste
whyPostgres, a presentation on the project choice for a storage system
Geek Sync | Guide to Understanding and Monitoring Tempdb
hbaseconasia2017: Large scale data near-line loading method and architecture
HDFS_architecture.ppt
Writing Scalable Software in Java
Investigate TempDB Like Sherlock Holmes
Dissecting Scalable Database Architectures
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Still All on One Server: Perforce at Scale
Flashy prefetching for high performance flash drives
Meta scale kognitio hadoop webinar
Introduction to Hadoop and Big Data
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Tuning Linux Windows and Firebird for Heavy Workload
08 operating system support
08 operating system support

More from Viet-Trung TRAN (20)

PDF
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
PDF
Dynamo: Amazon’s Highly Available Key-value Store
PDF
Pregel: Hệ thống xử lý đồ thị lớn
PDF
Mapreduce simplified-data-processing
PDF
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
PPTX
giasan.vn real-estate analytics: a Vietnam case study
PDF
Giasan.vn @rstars
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PPTX
Large-Scale Geographically Weighted Regression on Spark
PDF
Recent progress on distributing deep learning
PDF
success factors for project proposals
PDF
GPSinsights poster
PPTX
OCR processing with deep learning: Apply to Vietnamese documents
PDF
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
PDF
Deep learning for nlp
PDF
Introduction to BigData @TCTK2015
PDF
From neural networks to deep learning
PDF
From decision trees to random forests
PPTX
Recommender systems: Content-based and collaborative filtering
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Dynamo: Amazon’s Highly Available Key-value Store
Pregel: Hệ thống xử lý đồ thị lớn
Mapreduce simplified-data-processing
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
giasan.vn real-estate analytics: a Vietnam case study
Giasan.vn @rstars
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
Large-Scale Geographically Weighted Regression on Spark
Recent progress on distributing deep learning
success factors for project proposals
GPSinsights poster
OCR processing with deep learning: Apply to Vietnamese documents
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Deep learning for nlp
Introduction to BigData @TCTK2015
From neural networks to deep learning
From decision trees to random forests
Recommender systems: Content-based and collaborative filtering

Tachyon memory centric, fault tolerance storage for cluster framworks

  • 1. Tachyon: memory centric, fault tolerance storage for cluster framworks presented by Viet-Trung Tran
  • 2. Memory is King • RAM throughput increasing exponentially • Disk throughput increasing slowly Memory-locality key to interactive response time
  • 3. Memory as cache • Improve READ • Cannot help much with write • Replication for fault tolerance • Network bandwidth and latency are much worse than that of memory • Write throughput is limited by disk I/O • Required at least one copy on disk • Inter-job data sharing cost dominates pipeline end-to-end latency • 34% jobs output as large as input (Cloudera survey)
  • 4. Different jobs share data Slow writes to disk Spark Task Spark mem block manager block 1 block 3 Spark Task Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes) 4
  • 5. Different frameworks share data Spark Task Spark mem block manager block 1 block 3 Hadoop MR YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes) 5 Slow writes to disk
  • 6. Tachyon: realiable data sharing at memory speed within and across frameworks/jobs Tachyon Spark MapRe duce Spark SQL H2O GraphX Impala HDFS S3 Gluster FS Orange FS NFS Ceph …… ……
  • 7. Challenges How to achieve reliability data sharing without replication?
  • 8. Target workload properties • Immutable data • Deterministic jobs • Locality based scheduling • All data vs working set • Program size vs data size
  • 9. System architecture Consists of two layer • Lineage • Deliver high throughput I/O • Capture sequence of jobs/tasks that create output • Persistence • Asynchronous checkpoints Facts • One data copy in memory • Recomputation for fault-tolerance
  • 12. Master Node • Similar to HDFS and GPS • Passive standby model • BUT also contains a workflow manager • Track lineage information • Compute checkpoint order • Interact with cluster resource manager to allocate resources for re- computations
  • 15. Lineage metadata • Binary program • Configuration • Input Files List • Output Files List • Dependency Type • Narrow (filter, map) • Wide (suffle, join)
  • 16. Fault-recovery by recomputations • Challenge • Bounding the recomputation cost for a long running storage • Asynchronous checkpointing • Allocate resources for recomputations • Make sure recomputation tasks get enough resources • Do not impact system performance (task priorities) • Assumption • Input files are immutable • job executions are deterministic • Client side caching to mitigate read hotspots
  • 17. Asynchronous checkpointing • Goals • Bounded recomputation time • Checkpointing hot files • Avoid checkpointing temp files • Edge algoritim • Modeling relationships of files with a DAG • Vertices are files • Edge from A to B if B is generated by a job that read A
  • 18. Edge algorithm • Checkpoint leaves • Checkpointing hot files • Most file access are less than 3 ( yahoo survey for big data workload) • Thus, access more than twice get checkpointed • Dealing with large dataset • 96% active job sizes fit in the cluster memory • synchronously write dataset above a defined threshold to disk • Most of the files in memory checkpointed can be evicted from memory to make room
  • 19. Resource allocation • Depend on the scheduling policy of the running cluster • Requirements • Priority compatibility • Resource sharing • Avoid cascading recomputation • Best ordering recomputation • Most common policies • priority based • weighted fair sharing
  • 21. Fair sharing based scheduler
  • 22. Evaluation • 110x faster than MemHDFS • 4x faster in realistic jobs • 3,8x faster in case of failure • Recover from master failure within 1 second • reduce replication caused network traffic up to 50% • recomputation impact is less than 1,6%