Summary of "Cassandra" for 3rd nosql summer reading in Tokyo

Cassandra – A Decentralized Structured Storage SystemGemini Mobile Technologies, Inc.NOSQL Tokyo Reading Group(http://nosqlsummer.org/city/tokyo)August 25, 2010Tags: #cassandra #nosql2010/8/23Gemini Mobile Technologies, Inc.1

Cassandra: A Decentralized Structured Storage SystemAuthors: AvinashLakshman, PrashantMalik.Abstract: Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. …Appeared in:3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware, 2009.http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.2

1. Introduction and 2. Related WorkFacebook inbox search: Enables users to search through their inbox.Launched 6/2008. Highly scalable: 250M users.Tolerant for server/network failures.Very high write throughput: “billions of writes per day”.Replicate data across data centers.Related WorkDistributed file systems: Ficus, Coda, Farsite, GFS, Bayou.Storage systems: Dynamo, Bigtable.“The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.”2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.3

3. Data ModelMulti-level Index:Table: Set of rowsKey: Identifies the rowKey is arbitrary byte[].Each row can contain a variable number of columns/CFs. No need for rows to contain same columns/CFs.Each row can contain millions of columns/CFsAtomic operations per key per replica.3. ColumnName: Identify the column value(s).Can be either “Column”, “ColumnFamily”, “ColumnFamily:Column”, “ColumnFamily:ColumnFamily”, etc.ColumnFamily (CF) is a group of Columns.CFs and Columns are sorted. Time-based or name-based.Columns can be added/deleted efficiently during run-time.2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.4

Data Model Example: Inbox Search2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.5Query: Find all messages of user3 with “hello”.Get(UserMessages, “user3”, “term:hello”)Table: UserMessagesKey:<userid>CF:”term”CF: <word>Name:<timestamp> Val:<messageID>“term”user3“hello”“how”“you”time4time12time4time4time12time1msg10msg81msg10msg10msg81msg03

4. APISimple get/put operations:Insert(table, key, rowMutation)Single columns, Multiple columns, Batch of multiple keys.Get(table, key, columnName)Key: Single key or key range.columnName: “Slice” range or name.Delete(table, key, columnName)Also, specify Consistency Level.2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.6

5. System ArchitectureData partitioned to subset of nodes: Consistent HashingData replicated to multiple nodes for redundancy, performance: Quorum using “preference list” of nodesNode management:Membership algorithm to know which nodes are up/down.“Accrual failure detection + Gossip”Bootstrapping to add node.Manual operation + “Seed” nodes2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.7ConsistentHashNodeANodeCNodeDGossipNodeB

5.1 Partitioning Algorithm: Consistent HashingEach node is assigned a random position on ring.Key k is hashed to fixed circular space.Nodes are assigned by walking clockwise from hash location.Example: Nodes A, B, C, D, E, F, G assigned to ring.Hash(k) is between A and B.Since 3 replicas, choose next 3 nodes on ring (i.e., B, C, D).2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.8Hash(k)ANode assignmentBGCFDE

5.1 Consistent HashingKey advantage: Adding, deleting, re-allocating nodes is cheap. It affects only immediate neighbor node keys.Hash functionLocalityLoad distribution.Load-balancing by moving nodes toward heavily-loaded nodes.2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.9

5.2 ReplicationEach data item is replicated at multiple nodes (N).Each key is assigned to a “coordinator” node by consistent hash function.“Coordinator” node replicates the key to an additional N-1 nodes.“Consistency Level” is set by client per read/write request.ZERO, ONE, ALL, ANY, QUORUMZookeeper used to elect leader node and distribute “preference list”Leader node owns “preference list” that maps key to node list.2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.10

5.3 MembershipEach node locally determines if any other node in the system is up/down.Φ (phi) Accrual Failure DetectorInstead of boolean value (up or down), compute a numeric value Φ representing suspicion level for each monitored nodes.Φ is computed using inter-arrival times of gossip messages from other nodes in the cluster.If Φ exceeds a particular threshold, then node is considered as “down”.In experiment of 100 nodes with threshold of 5, average time to detect failure: 15 seconds.2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.11

5.4 Bootstrapping, 5.5 Scaling the ClusterNew nodes check configuration for “seed” nodes to get initial gossip data like “preference” lists.Add/remove of nodes is not done automatically. Requires manual command-line operation.New node needs to have data moved to it from other nodes. Operationally, 40MB/s. Working to improve this by copying data from multiple replicas a la BitTorrent.2010/8/23Gemini Mobile Technologies, Inc. All rights reserved.12

Bloom filter to reduce SSTable access

Check SSTables in time-orderIn-MemoryTable Dumped to SSTable when fullCommitLogSS TableSS TableSS TableWRITESSTable No reads, seeks

Summary of "Cassandra" for 3rd nosql summer reading in Tokyo

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Summary of "Cassandra" for 3rd nosql summer reading in Tokyo (20)

More from CLOUDIAN KK (20)

Recently uploaded (20)

Summary of "Cassandra" for 3rd nosql summer reading in Tokyo