Cassandra1.2

A highly scalable, eventually
consistent, distributed,
structured key-value store.
张天伦

Stable release: 1.2.0 / Jan. 2, 2013
'07 '06
'09

BigTable Dynamo
Cassandra
Data model
Tablet write / read
Compaction
Bloom filter
Cluster membership
Eventual consistency
Partition
Fault tolerance
Hbase
Hypertable
Voldemort
Riak
Family tree

Architecture Overview
Messaging Layer
Cluster MembershipFailure Detector
Storage Layer
Partitioner Replicator
Cassandra API Tools

Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index

Cassandra is a row-oriented
database

keyspace
Row Key: column
name
value
timestamp
column column
column family
Row Key: column column column
Data Model Keyspace is like database in an
RDBMS
A column family is a tableEach row has a unique Row
Key, like primary key

Cassandra is a Distributed Hash Table using
consistent hashing

Firstly, we have an empty token ring with
2^64 positions
-2^632^63 - 1 A token
represents a
position on the
ring

We add two nodes (B and D) and their tokens
determine their positions on the ring
D
B
-2^63
0
2^63 - 1
Nodes mean
machines here
Tokens could
be assigned
manually or
generated
randomly

A node is responsible for the range between
its predecessor and itself
D
B
-2^63
0
2^63 - 1
B's range
D's range

D has a list of seed nodes that include B
such that D knows the IP address of B and
could talk to B
D
B
-2^63
0
2^63 - 1
messages

When D hasn't received a reply from B for a
while it suspects that B is down
D
B
-2^63
0
2^63 - 1
No reply

Then we add more nodes (A and C)
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
Part of B and D's
ranges are taken
by A and C

Node A and C have D as their seed node so that
they could talk to D
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages messages

D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages
messages
messages
Node A gets B and C's IP addresses from D;
node C gets B and A's IP addresses from D
Now node A, B and C could talk to one another

The way A and C learn about other nodes are
called Gossip
● Gossip is a peer-to-peer communication
protocol for exchanging location and state
information between nodes

a row key will also get a token (a position
on the ring)
Row key Token

The row is stored on the node that is
responsible for the range
D
AC
B
-2^63
-2^62
0
2^62
johnny
jim
suzy
carol
2^63 - 1
e.g. johnny's
token falls in
the range of
A and is
hence stored
there

Partitioner is to assign tokens
partitioner function range
Murmur3Partitioner
MurmurHash
Function
[-2^63, 2^63 - 1]
RandomPartitioner MD5 hash value [0, 2^127 - 1]
ByteOrderedPartitioner
Orders rows
lexically by key
bytes
Platform 's
default charset
(e.g. 32 bit for
utf8)
One cluster, one partitioner !

D
AC
B
Murmur3Partitioner /
RandomPartitioner:
ByteOrderedPartitioner:
Row key Column
family
carol ...
jim ...
johnny ...
suzy ...
Scans are different for them
Scan by token
Scan by row key order

Drawback of ByteOrderedPartitioner
✗ Sequential writes can cause hot spots
✗ More administrative overhead to load balance
the cluster
✗ Uneven load balancing for multiple column
families

Data could be lost when nodes fail; we need a
replication strategy

D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
The first replica is determined by partitioner
and additional replicas are placed on the next
nodes clockwise in the ring (SimpleStrategy)
Suppose we
store 3 replicas

D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
What if node A, B, C are on the same rack ?
rack failure
would mean
data loss

D
AC
B
joh.
H
EG
F
Cassandra can replica data across racks and
data centers
West Data center East Data center
Suppose A, B are on rack1 and C, D are
on rack2
Suppose E, F and G are on rack1 and H are
on rack2

This is called NetworkTopologyStrategy
● Use for multiple racks in a data center and
multiple data centers
● Specify how many replicas you want in each
data center
● Places replicas in the same data center by
walking down the ring clockwise until reaching
the first node in another rack

A write or read request could go to any node
which serves as a coordinator

A write request where D serves as the
coordinator and replicas are stored on A, B, C
D
AC
B
client
insert 'johnny'
coordinator
johnny
johnny
johnny
By partitioner and
replica strategy, a
coordinator
determines which
nodes to get the
request

When does a coordinator return an
acknowledgement to the client ?
● When the write succeeds on consistency level
replicas
✔ Consistency is the synchronization of data on
replicas in a cluster
✔ Consistency level is a client setting that defines a
successful write or read by the number of cluster
replicas that acknowledge the write or respond to
the read request, respectively

insert 'johnny' with consistency level = one
D
AC
B
client
insert 'johnny'
coordinator
johnny
lost lost
ACK
ACK

insert 'johnny' with consistency level = quorum
D
AC
B
client coordinator
johnny
johnny
(replicas / 2) + 1ACKACK
ACK
ACK
lost
insert 'johnny'
Quorum
means
majority

get 'johnny' with consistency level = quorum
D
AC
B
client
johnny
v2
johnny
v1
johnny
v2 Coordinator returns
the most recent data
determined by timestamp

What if I want strong consistency
● Write CL + Read CL > Replicas
e.g. write one, read all
write all, read one
write quorum, read quorum
A B C
client
A B C
client
A B C
client
read
write

So Cassandra' s consistency model is tunable

A write's journey
Each column family has a Memtable

Flush after several inserts
memtable
Commit log

● Memtable
an in-memory sorted map from row key to
columns
● SSTable
an immutable data file to which Cassandra
writes memtables periodically
● Commit log
a redo log to which Cassandra appends data
for recovery in the event of a hardware failure
What are they ?

More updates and flush
memtable
Commit log
They belong to the same column family

A read's journey
memtable
Commit log

● A tombstone is written to indicate a deleted
column
● Columns marked with a tombstone exist for
configured gc_grace_seconds after which
compaction permanently deletes the column
SSTable is immutable, how about delete ?

compaction
● In the background, Cassandra periodically
merges SSTables together into larger
SSTables
● Compaction merges row fragments, removes
expired tombstones, and rebuilds indexes.

CQL
● Cassandra Query Language (CQL) is a SQL
like language for querying Cassandra.
● CQL doesn't support joins; Cassandra
encourages denormalization
We refer to CQL3 here
Joins require expensive random reads, which
need to be merged across the network

CQL3 structure
clientcqlsh
Thrift RPC CQL binary protocol
Query Processor
Internal write / read API
Local path Remote path
server
transport
Java / .NET driver

CQL3 queries
CREATE TABLE profiles (
id text PRIMARY KEY,
first_name text,
last_name text,
age int
);
id first_name last_name age
11485603 tianlun zhang 23
INSERT INTO profiles (id,
first_name, last_name, age)
VALUES ('11485603',
'tianlun', 'zhang', 23);
SELECT * FROM profiles;
Table means column family here

CQL3 hides internal storage from
users
id first_name last_name age
11485603 tianlun zhang 23
first_name:
last_name:
age:
tianlun
zhang
23
11485603
internal
storage
Row key Column name Column value
:
Columns
are sorted
by column
name

compound primary key in CQL3
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
content text,
PRIMARY KEY (article_id, posted_at)
);
Row Key The remaining component
ensures that the columns in a
row are stored in ascending
order on disk

Columns are sorted first by posted_at and
then by column name
article_id posted_at author content
550e8400-..
1970-01-17 00:08:19+0900
yukim blah, blah, blah
550e8400-..
1970-01-17 05:08:19+0900
yukim well, well, well

Since columns of a row are sorted by time,
we could efficiently get the comment on an
article after a certain time
SELECT * FROM comments WHERE
article_id = '550e8400-..' AND
posted_at >= '1970-01-17 03:08:19+0900';
article_id posted_at author content
550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well

How about query on value ?
Secondary index enables us to query on value
SELECT * FROM comments where author = 'yukim';
Bad Request: No indexed columns present in by-columns
clause with Equal operator

● Index on column values (should not be primary
key or part of compound primary key)
● Cassandra implements secondary indexes as a
hidden column family (invisible to client),
separate from the column family that contains
the values being indexed
Secondary index

CREATE INDEX c_author on comments (author);
`
yukim [550e8400-.., 1350499616, author]:
[550e8400-.., 1368499616, author]:
Index column family
Base CF and Index CF are
flushed to disk at the same time
Column value
Row key + column name

SELECT * FROM comments where author='yukim';
● Index column family is stored on the same
node as base column family
● Cassandra doesn't maintain column value
information in any one node and the query
still needs to be sent to all nodes

Using multiple secondary indexes
● If 'bob' is less frequent than 'smith', Cassandra
will process users_fname = 'bob' first for
efficiency

DELETE FROM comments where author='yukim';
● This is not allowed
● Delete a indexed column won't update index

Secondary index updates
● Cassandra appends data to the commit log,
updates the memtable, and updates the
secondary index
● If a read sees a stale index entry before
compaction purges it, the reader thread
invalidates it

Secondary index overhead
● Built on existing data in the background
automatically, without blocking reads or writes
(the CREATE clause)
● Updating indexes blocks reads or writes at row
level
(the INSERT clause)

There are more...
● Virtual nodes
● Atomic batches
● Request tracing
● Expiring / counter columns
● CQL collections
● Composite partition keys

Cassandra links
● Cassandra Official website
http://cassandra.apache.org/
● Apache Cassandra 1.2 Documentation
http://www.datastax.com/docs/1.2/index
● Cassandra trunk
http://git-wip-us.apache.org/repos/asf/cassandra.git
● Configuration file
conf / cassandra.yaml

Cassandra1.2

More Related Content

What's hot (17)

Similar to Cassandra1.2 (20)

Recently uploaded (20)

Cassandra1.2