Intro to Cassandra

Intro to

Cassandra
Tyler Hobbs

History

Dynamo BigTable
(clustering) (data model)

Cassandra

Clustering

Every node plays the same role
– No masters, slaves, or special nodes
– No single point of failure

Consistent Hashing

0

50 10

40 20

30

Consistent Hashing
Key: “www.google.com”
0

50 10

40 20

30

Consistent Hashing
0
md5(“www.google.com”)
50 10

14

40 20

30

Consistent Hashing
0
md5(“www.google.com”)
50 10

14

40 20

30
Replication Factor = 3

Clustering

Client can talk to any node

Scaling

RF = 2 0

50 10

The node at
50 owns the
red portion 20

30

Scaling

RF = 2 0

50 10

Add a new 40 20
node at 40
30

Node Failures

RF = 2 0

50 10

Replicas
40 20

30

Node Failures

RF = 2 0

50 10

40 20

30

Consistency, Availability

Consistency
– Can I read stale data?

Availability
– Can I write/read at all?

Tunable Consistency

Consistency

N = Total number of replicas

R = Number of replicas read from
– (before the response is returned)

W = Number of replicas written to
– (before the write is considered a success)

Consistency

N = Total number of replicas

R = Number of replicas read from
– (before the response is returned)

W = Number of replicas written to
– (before the write is considered a success)

W + R > N gives strong consistency

Consistency

N=3
W=2
R=2

2 + 2 > 3 ==> strongly consistent

Consistency

N=3
W=2
R=2

2 + 2 > 3 ==> strongly consistent

Only 2 of the 3 replicas must be
available.

Consistency

Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation

Consistency

Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation
– Quorum: N/2 + 1
• R = W = Quorum
• Strong consistency
• Tolerate the loss of N – Quorum replicas
– R, W can also be 1 or N

Availability

Can tolerate the loss of:
– N – R replicas for reads
– N – W replicas for writes

CAP Theorem
During node or network failure:

100%
Not
Possible

Availability
Possible

Consistency 100%

CAP Theorem
During node or network failure:

100%
Not
Ca Possible
ss
an
dr
Availability a
Possible

Consistency 100%

Clustering

No single point of failure

Replication that works

Scales linearly
– 2x nodes = 2x performance
• For both writes and reads
– Up to 100's of nodes

Operationally simple

Multi-Datacenter Replication

Data Model

Comes from Google BigTable

Goals
– Minimize disk seeks
– High throughput
– Low latency
– Durable

Data Model

Keyspace
– A collection of Column Families
– Controls replication settings

Column Family
– Kinda resembles a table

Column Families

Static
– Object data
– Similar to a table in a relational database

Dynamic
– Pre-calculated query results
– Materialized views

Static Column Families
Users
zznate password: * name: Nate

driftx password: * name: Brandon

thobbs password: * name: Tyler

jbellis password: * name: Jonathan site: riptano.com

Dynamic Column Families

Rows
– Each row has a unique primary key
– Sorted list of (name, value) tuples
• Like a sorted map or dictionary
– The (name, value) tuple is called a “column”

Following
zznate driftx: thobbs:

driftx

thobbs zznate:

jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate


Column Timestamps
– Each column (tuple) has a timestamp
– In the case of a collision, the latest timestamp wins
– Client specifies timestamp with write
– Writes are idempotent
• Infinite retries allowed


Other Examples:
– Timeline of tweets by a user
– Timeline of tweets by all of the people a user is
following
– List of comments sorted by score
– List of friends grouped by state

The Data API

Two choices
– RPC-based API
– CQL
• Cassandra Query Language

Inserting Data
INSERT INTO users (KEY, “name”, “age”)
VALUES (“thobbs”, “Tyler”, 24);

Updating Data
Updates are the same as inserts:
INSERT INTO users (KEY, “age”)
VALUES (“thobbs”, 34);

Or
UPDATE users SET “age” = 34
WHERE KEY = “thobbs”;

Fetching Data
Whole row select:
SELECT * FROM users WHERE KEY = “thobbs”;

Fetching Data
Explicit column select:
SELECT “name”, “age” FROM users

Fetching Data
Get a slice of columns
UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
WHERE KEY = “key”;

SELECT 1..3 FROM letters WHERE KEY = “key”;

Returns [(1, a), (2, b), (3, c)]

Fetching Data
SELECT FIRST 2 FROM letters WHERE KEY = “key”;

Returns [(1, a), (2, b)]

SELECT FIRST 2 REVERSED FROM letters

Returns [(5, e), (4, d)]

Fetching Data
SELECT 3..'' FROM letters WHERE KEY = “key”;

Returns [(3, c), (4, d), (5, e)]

SELECT FIRST 2 REVERSED 4..'' FROM letters

Returns [(4, d), (3, c)]

Deleting Data
Delete a whole row:
DELETE FROM users WHERE KEY = “thobbs”;

Delete specific columns:
DELETE “age” FROM users

Secondary Indexes
Builtin basic indexes
CREATE INDEX ageIndex ON users (age);

SELECT name FROM USERS
WHERE age = 24 AND state = “TX”;

Performance

Writes
– 10k – 30k per second per node
– Sub-millisecond latency

Reads
– 1k – 10k per second per node
– Depends on data set, caching
– Usually 0.1 to 10ms latency

Other Features

Distributed Counters
– Can support millions of high-volume counters

Excellent Multi-datacenter Support
– Disaster recovery
– Locality

Hadoop Integration
– Isolation of resources
– Hive and Pig drivers

Compression

What Cassandra Can't Do

Transactions
– Unless you use a distributed lock
– Atomicity, Isolation
– These aren't needed as often as you'd think

Limited support for ad-hoc queries
– Know what you want to do with the data

Not One-size-fits-all

Use alongside an RDBMS
– Use the RDBMS for highly-transactional or highly-
relational data
• Usually a small set of data
– Let Cassandra scale to handle the rest

Language Support

Good:
– Java
– Python
– Ruby
– PHP
– C#

Coming Soon:
– Everything else, now that we have CQL

Questions?

Tyler Hobbs
@tylhobbs
tyler@datastax.com

Intro to Cassandra

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Intro to Cassandra (20)

Recently uploaded (20)

Intro to Cassandra