SlideShare a Scribd company logo
Cassandra - A Decentralized Structured Storage
System
Avinash Lakshman, Prashant Malik (Facebook), ACM SIGOPS
(2010)
Jiacheng & Varad
Dept. of Computer Science,
Donald Bren School of Information and Computer Sciences,
UC Irvine.
jiachenf@uci.edu, vmeru@ics.uci.edu
Outline
● Introduction
● Data Model
● API
● System Architecture
○ Partitioning
○ Replication
○ Membership
○ Bootstrapping and Scaling
○ Local Persistence
● Practical Experiences
● Conclusion
INTRODUCTION
The need!
Why Cassandra?
● Facebook
○ Requirements -
■ Performance.
■ Reliability.
■ Efficiency.
■ Support for continuous growth.
■ Fail-friendly.
○ Application
■ Facebook’s Inbox Search (Old: Now moved to HBase!1
)
● 2008: 100 Mn, 2010: 250 Mn
● 2014: 1.35 Bn users; 864 Mn daily average for september2
.
1:Storage Infrastructure Behind Facebook Messages: Using HBase at Scale: http://on.fb.me/1ok9SnC| 2: http://newsroom.fb.com/company-info/
Related Work
● Ficus, Coda
○ High availability with specialized conflict resolution for updates. (A-P)
● Farsite
○ Serverless, distributed file system without trust.
● Google File System
○ Single master - simple design. Made fault tolerant with Chubby.
● Bayou
○ Distributed Relational Database with eventual consistency and partition
tolerance. (A-P)
● Dynamo
○ Structured overlay network with 1-hop request routing. Gossip based
information sharing.
● Bigtable
○ GFS based system.
Data Model
Data Model
● Multi-dimensional map.
○ key - string, generally 16-36 bytes long. No size restrictions.
● Column Family - group of columns.
○ Simple column family
○ Super - column family of column families.
○ Column order: sorted by timestamp of name. (Time sorting helps in Inbox
Search)
● Column Access
○ Simple column
■ column_family:column
○ Super column
■ column_family:super_column:column
Data Model (contd)
API
API
● Three simple methods
○ insert (table, key, rowMutation)
[default@keyspace] set User[‘vmeru’][‘fname’] = ‘Varad’;
[default@keyspace] set User[‘vmeru’][ascii(‘email’)] = ‘vmeru@uci.edu’;
cqlsh> INSERT INTO users (firstname, lastname, age, email, city)
VALUES ('John', 'Smith', 46, 'johnsmith@email.com', 'Sacramento');
○ get ( table, key, columnName)
[default@keyspace] get User[‘vmeru’];
=> (column=656d6169c, value=vmeru@uci.edu,timestamp=135225847342)
cqlsh:demo> SELECT * FROM users where lastname= 'Doe';
○ delete (table, key, columnName)
[default@keyspace] del User[‘vmeru’];
row removed.
cqlsh:demo> DELETE from users WHERE lastname = “Doe”;
System Architecture
Over to Jiacheng :)
Requirements
● Scalability:
○ Continuous growth of the platform
● Reliability:
○ Treat failure as norm
System Architecture
Characteristics
● Partitioning
● Replication
● Membership
● Failure
● Handling
● Scaling
Partitioning
● Purpose:
○ The ability to dynamically partition the data over the set
of nodes.
○ The ability to scale incrementally.
● Consistent hashing using an order preserving hash function
Consistent Hashing
Basics :-
● Assigns each node and key an identifier using a base hash function.
● The hash space is large, and is treated as if it wraps around to form a
circle - hence hash ring. Identifiers are ordered on an identifier circle
modulo M.
● Each data item identified by a key is assigned to a node by hashing the
data item’s key to yield its position on the ring, and then walking the ring
clockwise to find the first node with a position larger than the item’s
position.
Consistent Hashing
Example -
● The ring has 10 nodes and stores five keys.
● The successor of identifier 10 is node 14, so
key 10 would be located at node 14.
● If a node were to join with identifier 26, it
would capture the key with identifier 24
from the node with identifier 32
The principal advantage of consistent hashing is
that departure or arrival of a node only affects
its immediate neighbors and other nodes remain
unaffected
Consistent Hashing
The basic consistent hashing algorithm presents some challenges.
1. First, the random position assignment of each node on the ring leads to non-
uniform data and load distribution.
2. Second, the basic algorithm is oblivious to the heterogeneity in the performance
of nodes.
There are two solutions
1. Nodes can be assigned to multiple positions in the circle
2. Analyze load information on the ring and have lightly loaded nodes move on the
ring to alleviate heavily loaded nodes (chosen)
Order preserving hash function
If keys are given in some order a1
, a2
, ..., an
and for any keys aj
and ak
, j<k implies F(aj
)<F(ak
)
Replication
Purpose: achieve high availability and durability
● Each data item is replicated at N hosts, where N is the
replication factor
● Each key, k, is assigned to a coordinator node. The
coordinator is in charge of the replication of the data items
that fall within its range.
Replication
Rack Unaware
● The non-coordinator replicas are chosen by picking N-1 successors of the
coordinator on the ring
● Cassandra system elects a leader amongst its nodes using a system called
Zookeeper.
● All nodes on joining the cluster contact the leader
● Leader tells them for what ranges they are replicas for (maintain the invariant
that no node is responsible for more than N-1 ranges in the ring)
● The metadata cached locally and in Zookeeper (crashes and comes back up
knows what ranges it was responsible for)
Replication
Scheme
● Cassandra is configured such that each row is replicated
across multiple data centers.
● Data centers are connected through high speed network
links.
This scheme allows to handle entire data center failures without
any outage.
Failure Detection
Purpose
● Locally determine if any other node in the system is up or down
● avoid attempts to communicate with unreachable nodes during various
operations.
Modified version of the ΦAccrual Failure Detector :
● Doesn’t emit a Boolean value stating a node is up or down
● Emits a value which represents a suspicion level for each of monitored nodes.
Failure Detection
● Every node in the system maintains a sliding window of inter-arrival
times of gossip messages from other nodes in the cluster.
● The distribution of these inter-arrival times is determined and Φ is
calculated
● Assuming that we decide to suspect a node A when Φ =1, then the
likelihood that we will make a mistake is about 10%. The likelihood is
about 1% with Φ = 2, 0.1% with Φ = 3, and so on.
Approximate Distribution
● Gaussian distribution
● Exponential distribution (chosen)
Bootstrapping
● When a node starts for the first time, it chooses a random
token for its position in the ring.
● For fault tolerance, the mapping is persisted to disk locally
and also in Zookeeper.
● The token information is then gossiped around the cluster.
This is how we know about all nodes and their respective
positions in the ring. This enables any node to route a request
for a key to the correct node in the cluster.
Scaling the Cluster
When a new node is added into the system, it gets assigned a token such that
it can alleviate a heavily loaded node.
● Bootstrap algorithm is initiated from any other node in the system
● The node giving up the data streams the data over to the new no de
using kernel-kernel copy techniques.
● Operational experience has shown that data can be transferred at the
rate of 40 MB/sec from a single node.
● Improving this by having multiple replicas take part in the bootstrap
transfer thereby parallelizing the effort, similar to Bittorrent
Local Persistence
● Typical write operation involves a write into a commit log for durability and
recoverability and an update into an in-memory data structure.
● The write into the in-memory data structure is performed only after a successful
write into the commit log
● When the in-memory data structure crosses a certain threshold, it dumps itself to
disk
● Over time many such files could exist on disk and a merge process runs in the
background to collate the different files into one file.
Local Persistence
Speed Up
● Bloom Filter: avoid lookups into files that do not contain the key
● Column Indices: prevent scanning of every column on disk and which allow us to
jump to the right chunk on disk for column retrieval
Practical Experiences
Practical Experiences
● Use of MapReduce jobs to index the inbox data
○ 7 TB data.
○ Special back channel for MR to process aggregate reverse index per user and
send over the serialized data.
● Atomic operations
● Failure detection is difficult
○ Initial: 2 minutes for a 100-node setup. Later, 15 seconds.
● Integrated with Ganglia for monitoring.
● Some coordination with Zookeeper.
Facebook Inbox Search
● Per user index of all the messages
● Two search features: Two column families.
○ Term Search
■ <user_id, message> : message becomes the super-column.
■ Individual message identifiers that contain the word become columns.
○ Interactions
■ <user_id, recipients_user_id> : recipients_user_id is the super-column.
■ Individual message identifiers are the columns.
● Caching user’s index as soon as he/she clicks the search bar.
● Current: 50 TB, on 150 Node cluster (East/West coast data centers).
Questions?
One Size does not fit all
Source:http://tcrn.ch/1x3RCSp
Thank You

More Related Content

PPTX
Cassandra - A decentralized storage system
PPTX
Cassandra - Research Paper Overview
PPT
Using galera replication to create geo distributed clusters on the wan
PDF
ProxySQL Tutorial - PLAM 2016
ODP
Comparison between OCFS2 and GFS2
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
Disaster Recovery Plans for Apache Kafka
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Cassandra - A decentralized storage system
Cassandra - Research Paper Overview
Using galera replication to create geo distributed clusters on the wan
ProxySQL Tutorial - PLAM 2016
Comparison between OCFS2 and GFS2
hbaseconasia2017: HBase Practice At XiaoMi
Disaster Recovery Plans for Apache Kafka
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

What's hot (20)

PDF
How to use histograms to get better performance
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PPTX
When is MyRocks good?
PDF
Maxscale switchover, failover, and auto rejoin
PPTX
PostgreSQL and CockroachDB SQL
PDF
TiDB Introduction
PDF
Pacemakerを使いこなそう
PDF
ETL With Cassandra Streaming Bulk Loading
PDF
Intro to Cassandra
PDF
ClickHouse Keeper
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PDF
Apache Spark Core – Practical Optimization
PDF
MyRocks Deep Dive
KEY
Introduction to memcached
PPTX
Autoscaling Flink with Reactive Mode
PPTX
遺伝研スパコンを使った解析の並列化.pptx
PDF
Cassandra Introduction & Features
How to use histograms to get better performance
Deep Dive: Memory Management in Apache Spark
Evening out the uneven: dealing with skew in Flink
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
HBase and HDFS: Understanding FileSystem Usage in HBase
When is MyRocks good?
Maxscale switchover, failover, and auto rejoin
PostgreSQL and CockroachDB SQL
TiDB Introduction
Pacemakerを使いこなそう
ETL With Cassandra Streaming Bulk Loading
Intro to Cassandra
ClickHouse Keeper
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Apache Spark Core – Practical Optimization
MyRocks Deep Dive
Introduction to memcached
Autoscaling Flink with Reactive Mode
遺伝研スパコンを使った解析の並列化.pptx
Cassandra Introduction & Features
Ad

Viewers also liked (17)

PDF
LJC: Fault tolerance with Apache Cassandra
PDF
Predicting rainfall using ensemble of ensembles
PPT
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
PDF
Apache cassandra an introduction
PDF
Cassandra Prophecy
PPT
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
PPT
Introduction to cassandra
PDF
Generating Musical Notes and Transcription using Deep Learning
PPTX
Cassandra
PPTX
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
PPTX
Cassandra
PPTX
Voldemort
PPTX
Apache Cassandra 2.0
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PPT
Data Presentations Cassandra Sigmod
PPT
Cassandra architecture
PDF
The Cassandra Distributed Database
LJC: Fault tolerance with Apache Cassandra
Predicting rainfall using ensemble of ensembles
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
Apache cassandra an introduction
Cassandra Prophecy
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
Introduction to cassandra
Generating Musical Notes and Transcription using Deep Learning
Cassandra
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Cassandra
Voldemort
Apache Cassandra 2.0
Azure + DataStax Enterprise Powers Office 365 Per User Store
Data Presentations Cassandra Sigmod
Cassandra architecture
The Cassandra Distributed Database
Ad

Similar to Cassandra - A Decentralized Structured Storage System (20)

PDF
Pregel: A System For Large Scale Graph Processing
PPTX
Megastore by Google
PDF
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
PDF
Clustering in PostgreSQL - Because one database server is never enough (and n...
PPTX
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
PPTX
Storing the real world data
PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Netflix machine learning
PDF
Taskerman - a distributed cluster task manager
PDF
An Introduction to Apache Cassandra
PDF
Machine learning at Scale with Apache Spark
PDF
Scaling Up Logging and Metrics
PDF
Challenges and Opportunities of Big Data Genomics
PDF
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PDF
BISSA: Empowering Web gadget Communication with Tuple Spaces
PPTX
Chorus - Distributed Operating System [ case study ]
PDF
Intro to cassandra
PDF
Interactive Data Analysis in Spark Streaming
PPT
Cassandra
PPT
6.1-Cassandra.ppt
Pregel: A System For Large Scale Graph Processing
Megastore by Google
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
Clustering in PostgreSQL - Because one database server is never enough (and n...
20240515 - Chicago PUG - Clustering in PostgreSQL: Because one database serve...
Storing the real world data
Apache Spark 101 - Demi Ben-Ari
Netflix machine learning
Taskerman - a distributed cluster task manager
An Introduction to Apache Cassandra
Machine learning at Scale with Apache Spark
Scaling Up Logging and Metrics
Challenges and Opportunities of Big Data Genomics
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
BISSA: Empowering Web gadget Communication with Tuple Spaces
Chorus - Distributed Operating System [ case study ]
Intro to cassandra
Interactive Data Analysis in Spark Streaming
Cassandra
6.1-Cassandra.ppt

More from Varad Meru (14)

PDF
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
PDF
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
PDF
Kakuro: Solving the Constraint Satisfaction Problem
PDF
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
PDF
Cloud Computing: An Overview
PDF
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
PPTX
Machine Learning and Apache Mahout : An Introduction
PDF
K-Means, its Variants and its Applications
PDF
Introduction to Mahout and Machine Learning
PDF
Data clustering using map reduce
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
PPTX
Big Data, Hadoop, NoSQL and more ...
PDF
Final Year Project Guidance
PPTX
OpenSourceEducation
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Kakuro: Solving the Constraint Satisfaction Problem
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
Cloud Computing: An Overview
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Machine Learning and Apache Mahout : An Introduction
K-Means, its Variants and its Applications
Introduction to Mahout and Machine Learning
Data clustering using map reduce
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Big Data, Hadoop, NoSQL and more ...
Final Year Project Guidance
OpenSourceEducation

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25 Week I

Cassandra - A Decentralized Structured Storage System

  • 1. Cassandra - A Decentralized Structured Storage System Avinash Lakshman, Prashant Malik (Facebook), ACM SIGOPS (2010) Jiacheng & Varad Dept. of Computer Science, Donald Bren School of Information and Computer Sciences, UC Irvine. jiachenf@uci.edu, vmeru@ics.uci.edu
  • 2. Outline ● Introduction ● Data Model ● API ● System Architecture ○ Partitioning ○ Replication ○ Membership ○ Bootstrapping and Scaling ○ Local Persistence ● Practical Experiences ● Conclusion
  • 4. Why Cassandra? ● Facebook ○ Requirements - ■ Performance. ■ Reliability. ■ Efficiency. ■ Support for continuous growth. ■ Fail-friendly. ○ Application ■ Facebook’s Inbox Search (Old: Now moved to HBase!1 ) ● 2008: 100 Mn, 2010: 250 Mn ● 2014: 1.35 Bn users; 864 Mn daily average for september2 . 1:Storage Infrastructure Behind Facebook Messages: Using HBase at Scale: http://on.fb.me/1ok9SnC| 2: http://newsroom.fb.com/company-info/
  • 5. Related Work ● Ficus, Coda ○ High availability with specialized conflict resolution for updates. (A-P) ● Farsite ○ Serverless, distributed file system without trust. ● Google File System ○ Single master - simple design. Made fault tolerant with Chubby. ● Bayou ○ Distributed Relational Database with eventual consistency and partition tolerance. (A-P) ● Dynamo ○ Structured overlay network with 1-hop request routing. Gossip based information sharing. ● Bigtable ○ GFS based system.
  • 7. Data Model ● Multi-dimensional map. ○ key - string, generally 16-36 bytes long. No size restrictions. ● Column Family - group of columns. ○ Simple column family ○ Super - column family of column families. ○ Column order: sorted by timestamp of name. (Time sorting helps in Inbox Search) ● Column Access ○ Simple column ■ column_family:column ○ Super column ■ column_family:super_column:column
  • 9. API
  • 10. API ● Three simple methods ○ insert (table, key, rowMutation) [default@keyspace] set User[‘vmeru’][‘fname’] = ‘Varad’; [default@keyspace] set User[‘vmeru’][ascii(‘email’)] = ‘vmeru@uci.edu’; cqlsh> INSERT INTO users (firstname, lastname, age, email, city) VALUES ('John', 'Smith', 46, 'johnsmith@email.com', 'Sacramento'); ○ get ( table, key, columnName) [default@keyspace] get User[‘vmeru’]; => (column=656d6169c, value=vmeru@uci.edu,timestamp=135225847342) cqlsh:demo> SELECT * FROM users where lastname= 'Doe'; ○ delete (table, key, columnName) [default@keyspace] del User[‘vmeru’]; row removed. cqlsh:demo> DELETE from users WHERE lastname = “Doe”;
  • 12. Requirements ● Scalability: ○ Continuous growth of the platform ● Reliability: ○ Treat failure as norm
  • 13. System Architecture Characteristics ● Partitioning ● Replication ● Membership ● Failure ● Handling ● Scaling
  • 14. Partitioning ● Purpose: ○ The ability to dynamically partition the data over the set of nodes. ○ The ability to scale incrementally. ● Consistent hashing using an order preserving hash function
  • 15. Consistent Hashing Basics :- ● Assigns each node and key an identifier using a base hash function. ● The hash space is large, and is treated as if it wraps around to form a circle - hence hash ring. Identifiers are ordered on an identifier circle modulo M. ● Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.
  • 16. Consistent Hashing Example - ● The ring has 10 nodes and stores five keys. ● The successor of identifier 10 is node 14, so key 10 would be located at node 14. ● If a node were to join with identifier 26, it would capture the key with identifier 24 from the node with identifier 32 The principal advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected
  • 17. Consistent Hashing The basic consistent hashing algorithm presents some challenges. 1. First, the random position assignment of each node on the ring leads to non- uniform data and load distribution. 2. Second, the basic algorithm is oblivious to the heterogeneity in the performance of nodes. There are two solutions 1. Nodes can be assigned to multiple positions in the circle 2. Analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily loaded nodes (chosen)
  • 18. Order preserving hash function If keys are given in some order a1 , a2 , ..., an and for any keys aj and ak , j<k implies F(aj )<F(ak )
  • 19. Replication Purpose: achieve high availability and durability ● Each data item is replicated at N hosts, where N is the replication factor ● Each key, k, is assigned to a coordinator node. The coordinator is in charge of the replication of the data items that fall within its range.
  • 20. Replication Rack Unaware ● The non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring ● Cassandra system elects a leader amongst its nodes using a system called Zookeeper. ● All nodes on joining the cluster contact the leader ● Leader tells them for what ranges they are replicas for (maintain the invariant that no node is responsible for more than N-1 ranges in the ring) ● The metadata cached locally and in Zookeeper (crashes and comes back up knows what ranges it was responsible for)
  • 21. Replication Scheme ● Cassandra is configured such that each row is replicated across multiple data centers. ● Data centers are connected through high speed network links. This scheme allows to handle entire data center failures without any outage.
  • 22. Failure Detection Purpose ● Locally determine if any other node in the system is up or down ● avoid attempts to communicate with unreachable nodes during various operations. Modified version of the ΦAccrual Failure Detector : ● Doesn’t emit a Boolean value stating a node is up or down ● Emits a value which represents a suspicion level for each of monitored nodes.
  • 23. Failure Detection ● Every node in the system maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. ● The distribution of these inter-arrival times is determined and Φ is calculated ● Assuming that we decide to suspect a node A when Φ =1, then the likelihood that we will make a mistake is about 10%. The likelihood is about 1% with Φ = 2, 0.1% with Φ = 3, and so on. Approximate Distribution ● Gaussian distribution ● Exponential distribution (chosen)
  • 24. Bootstrapping ● When a node starts for the first time, it chooses a random token for its position in the ring. ● For fault tolerance, the mapping is persisted to disk locally and also in Zookeeper. ● The token information is then gossiped around the cluster. This is how we know about all nodes and their respective positions in the ring. This enables any node to route a request for a key to the correct node in the cluster.
  • 25. Scaling the Cluster When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node. ● Bootstrap algorithm is initiated from any other node in the system ● The node giving up the data streams the data over to the new no de using kernel-kernel copy techniques. ● Operational experience has shown that data can be transferred at the rate of 40 MB/sec from a single node. ● Improving this by having multiple replicas take part in the bootstrap transfer thereby parallelizing the effort, similar to Bittorrent
  • 26. Local Persistence ● Typical write operation involves a write into a commit log for durability and recoverability and an update into an in-memory data structure. ● The write into the in-memory data structure is performed only after a successful write into the commit log ● When the in-memory data structure crosses a certain threshold, it dumps itself to disk ● Over time many such files could exist on disk and a merge process runs in the background to collate the different files into one file.
  • 27. Local Persistence Speed Up ● Bloom Filter: avoid lookups into files that do not contain the key ● Column Indices: prevent scanning of every column on disk and which allow us to jump to the right chunk on disk for column retrieval
  • 29. Practical Experiences ● Use of MapReduce jobs to index the inbox data ○ 7 TB data. ○ Special back channel for MR to process aggregate reverse index per user and send over the serialized data. ● Atomic operations ● Failure detection is difficult ○ Initial: 2 minutes for a 100-node setup. Later, 15 seconds. ● Integrated with Ganglia for monitoring. ● Some coordination with Zookeeper.
  • 30. Facebook Inbox Search ● Per user index of all the messages ● Two search features: Two column families. ○ Term Search ■ <user_id, message> : message becomes the super-column. ■ Individual message identifiers that contain the word become columns. ○ Interactions ■ <user_id, recipients_user_id> : recipients_user_id is the super-column. ■ Individual message identifiers are the columns. ● Caching user’s index as soon as he/she clicks the search bar. ● Current: 50 TB, on 150 Node cluster (East/West coast data centers).
  • 32. One Size does not fit all Source:http://tcrn.ch/1x3RCSp