SlideShare a Scribd company logo
A highly scalable, eventually
consistent, distributed,
structured key-value store.
张天伦
Cassandra1.2
Stable release: 1.2.0 / Jan. 2, 2013
'07 '06
'09
BigTable Dynamo
Cassandra
Data model
Tablet write / read
Compaction
Bloom filter
Cluster membership
Eventual consistency
Partition
Fault tolerance
Hbase
Hypertable
Voldemort
Riak
Family tree
Architecture Overview
Messaging Layer
Cluster MembershipFailure Detector
Storage Layer
Partitioner Replicator
Cassandra API Tools
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Cassandra is a row-oriented
database
keyspace
Row Key: column
name
value
timestamp
column column
column family
Row Key: column column column
Data Model Keyspace is like database in an
RDBMS
A column family is a tableEach row has a unique Row
Key, like primary key
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Cassandra is a Distributed Hash Table using
consistent hashing
Firstly, we have an empty token ring with
2^64 positions
-2^632^63 - 1 A token
represents a
position on the
ring
We add two nodes (B and D) and their tokens
determine their positions on the ring
D
B
-2^63
0
2^63 - 1
Nodes mean
machines here
Tokens could
be assigned
manually or
generated
randomly
A node is responsible for the range between
its predecessor and itself
D
B
-2^63
0
2^63 - 1
B's range
D's range
D has a list of seed nodes that include B
such that D knows the IP address of B and
could talk to B
D
B
-2^63
0
2^63 - 1
messages
When D hasn't received a reply from B for a
while it suspects that B is down
D
B
-2^63
0
2^63 - 1
No reply
Then we add more nodes (A and C)
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
Part of B and D's
ranges are taken
by A and C
Node A and C have D as their seed node so that
they could talk to D
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages messages
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages
messages
messages
Node A gets B and C's IP addresses from D;
node C gets B and A's IP addresses from D
Now node A, B and C could talk to one another
The way A and C learn about other nodes are
called Gossip
● Gossip is a peer-to-peer communication
protocol for exchanging location and state
information between nodes
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Row is the unit of partition
a row key will also get a token (a position
on the ring)
Row key Token
The row is stored on the node that is
responsible for the range
D
AC
B
-2^63
-2^62
0
2^62
johnny
jim
suzy
carol
2^63 - 1
e.g. johnny's
token falls in
the range of
A and is
hence stored
there
Partitioner is to assign tokens
partitioner function range
Murmur3Partitioner
MurmurHash
Function
[-2^63, 2^63 - 1]
RandomPartitioner MD5 hash value [0, 2^127 - 1]
ByteOrderedPartitioner
Orders rows
lexically by key
bytes
Platform 's
default charset
(e.g. 32 bit for
utf8)
One cluster, one partitioner !
D
AC
B
Murmur3Partitioner /
RandomPartitioner:
ByteOrderedPartitioner:
Row key Column
family
carol ...
jim ...
johnny ...
suzy ...
Scans are different for them
Scan by token
Scan by row key order
Drawback of ByteOrderedPartitioner
✗ Sequential writes can cause hot spots
✗ More administrative overhead to load balance
the cluster
✗ Uneven load balancing for multiple column
families
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Data could be lost when nodes fail; we need a
replication strategy
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
The first replica is determined by partitioner
and additional replicas are placed on the next
nodes clockwise in the ring (SimpleStrategy)
Suppose we
store 3 replicas
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
What if node A, B, C are on the same rack ?
rack failure
would mean
data loss
D
AC
B
joh.
H
EG
F
Cassandra can replica data across racks and
data centers
West Data center East Data center
Suppose A, B are on rack1 and C, D are
on rack2
Suppose E, F and G are on rack1 and H are
on rack2
This is called NetworkTopologyStrategy
● Use for multiple racks in a data center and
multiple data centers
● Specify how many replicas you want in each
data center
● Places replicas in the same data center by
walking down the ring clockwise until reaching
the first node in another rack
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
A write or read request could go to any node
which serves as a coordinator
A write request where D serves as the
coordinator and replicas are stored on A, B, C
D
AC
B
client
insert 'johnny'
coordinator
johnny
johnny
johnny
By partitioner and
replica strategy, a
coordinator
determines which
nodes to get the
request
When does a coordinator return an
acknowledgement to the client ?
● When the write succeeds on consistency level
replicas
✔ Consistency is the synchronization of data on
replicas in a cluster
✔ Consistency level is a client setting that defines a
successful write or read by the number of cluster
replicas that acknowledge the write or respond to
the read request, respectively
insert 'johnny' with consistency level = one
D
AC
B
client
insert 'johnny'
coordinator
johnny
lost lost
ACK
ACK
insert 'johnny' with consistency level = quorum
D
AC
B
client coordinator
johnny
johnny
(replicas / 2) + 1ACKACK
ACK
ACK
lost
insert 'johnny'
Quorum
means
majority
get 'johnny' with consistency level = quorum
D
AC
B
client
johnny
v2
johnny
v1
johnny
v2 Coordinator returns
the most recent data
determined by timestamp
What if I want strong consistency
● Write CL + Read CL > Replicas
e.g. write one, read all
write all, read one
write quorum, read quorum
A B C
client
A B C
client
A B C
client
read
write
So Cassandra' s consistency model is tunable
A write's journey
Each column family has a Memtable
Flush after several inserts
memtable
Commit log
● Memtable
an in-memory sorted map from row key to
columns
● SSTable
an immutable data file to which Cassandra
writes memtables periodically
● Commit log
a redo log to which Cassandra appends data
for recovery in the event of a hardware failure
What are they ?
More updates and flush
memtable
Commit log
They belong to the same column family
A read's journey
memtable
Commit log
● A tombstone is written to indicate a deleted
column
● Columns marked with a tombstone exist for
configured gc_grace_seconds after which
compaction permanently deletes the column
SSTable is immutable, how about delete ?
compaction
● In the background, Cassandra periodically
merges SSTables together into larger
SSTables
● Compaction merges row fragments, removes
expired tombstones, and rebuilds indexes.
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
CQL
● Cassandra Query Language (CQL) is a SQL
like language for querying Cassandra.
● CQL doesn't support joins; Cassandra
encourages denormalization
We refer to CQL3 here
Joins require expensive random reads, which
need to be merged across the network
CQL3 structure
clientcqlsh
Thrift RPC CQL binary protocol
Query Processor
Internal write / read API
Local path Remote path
server
transport
Java / .NET driver
CQL3 queries
CREATE TABLE profiles (
id text PRIMARY KEY,
first_name text,
last_name text,
age int
);
id first_name last_name age
11485603 tianlun zhang 23
INSERT INTO profiles (id,
first_name, last_name, age)
VALUES ('11485603',
'tianlun', 'zhang', 23);
SELECT * FROM profiles;
Table means column family here
CQL3 hides internal storage from
users
id first_name last_name age
11485603 tianlun zhang 23
first_name:
last_name:
age:
tianlun
zhang
23
11485603
internal
storage
Row key Column name Column value
:
Columns
are sorted
by column
name
compound primary key in CQL3
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
content text,
PRIMARY KEY (article_id, posted_at)
);
Row Key The remaining component
ensures that the columns in a
row are stored in ascending
order on disk
Columns are sorted first by posted_at and
then by column name
article_id posted_at author content
550e8400-..
1970-01-17 00:08:19+0900
yukim blah, blah, blah
550e8400-..
1970-01-17 05:08:19+0900
yukim well, well, well
Since columns of a row are sorted by time,
we could efficiently get the comment on an
article after a certain time
SELECT * FROM comments WHERE
article_id = '550e8400-..' AND
posted_at >= '1970-01-17 03:08:19+0900';
article_id posted_at author content
550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
How about query on value ?
Secondary index enables us to query on value
SELECT * FROM comments where author = 'yukim';
Bad Request: No indexed columns present in by-columns
clause with Equal operator
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
● Index on column values (should not be primary
key or part of compound primary key)
● Cassandra implements secondary indexes as a
hidden column family (invisible to client),
separate from the column family that contains
the values being indexed
Secondary index
CREATE INDEX c_author on comments (author);
`
yukim [550e8400-.., 1350499616, author]:
[550e8400-.., 1368499616, author]:
Index column family
Base CF and Index CF are
flushed to disk at the same time
Column value
Row key + column name
SELECT * FROM comments where author='yukim';
● Index column family is stored on the same
node as base column family
● Cassandra doesn't maintain column value
information in any one node and the query
still needs to be sent to all nodes
Using multiple secondary indexes
● If 'bob' is less frequent than 'smith', Cassandra
will process users_fname = 'bob' first for
efficiency
DELETE FROM comments where author='yukim';
● This is not allowed
● Delete a indexed column won't update index
Secondary index updates
● Cassandra appends data to the commit log,
updates the memtable, and updates the
secondary index
● If a read sees a stale index entry before
compaction purges it, the reader thread
invalidates it
Secondary index overhead
● Built on existing data in the background
automatically, without blocking reads or writes
(the CREATE clause)
● Updating indexes blocks reads or writes at row
level
(the INSERT clause)
There are more...
● Virtual nodes
● Atomic batches
● Request tracing
● Expiring / counter columns
● CQL collections
● Composite partition keys
Cassandra links
● Cassandra Official website
http://cassandra.apache.org/
● Apache Cassandra 1.2 Documentation
http://www.datastax.com/docs/1.2/index
● Cassandra trunk
http://git-wip-us.apache.org/repos/asf/cassandra.git
● Configuration file
conf / cassandra.yaml
Thank you !

More Related Content

PDF
The Apache Cassandra ecosystem
PPTX
Talk About Apache Cassandra
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PDF
Deep Packet Inspection with Regular Expression Matching
PDF
Hadoop introduction
PDF
Ek35775781
PDF
Cassandra presentation at NoSQL
PDF
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
The Apache Cassandra ecosystem
Talk About Apache Cassandra
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Deep Packet Inspection with Regular Expression Matching
Hadoop introduction
Ek35775781
Cassandra presentation at NoSQL
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing

What's hot (17)

PDF
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
PDF
Fpga based low power and high performance address generator for wimax deinter...
PDF
Fpga based low power and high performance address
PDF
20150207 howes-gpgpu8-dark secrets
DOC
Lab 10 nmr n1_2011
PDF
MariaDB ColumnStore
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PPT
Cassandra 1.2 by Eddie Satterly
PDF
Discriminators for use in flow-based classification
PPTX
High performance queues with Cassandra
PPTX
Cassandra Deep Diver & Data Modeling
PDF
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
PDF
Ternary content addressable memory for longest prefix matching based on rando...
DOC
DB2 utilities
PDF
Final report
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Fpga based low power and high performance address generator for wimax deinter...
Fpga based low power and high performance address
20150207 howes-gpgpu8-dark secrets
Lab 10 nmr n1_2011
MariaDB ColumnStore
Welcome to International Journal of Engineering Research and Development (IJERD)
Cassandra 1.2 by Eddie Satterly
Discriminators for use in flow-based classification
High performance queues with Cassandra
Cassandra Deep Diver & Data Modeling
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
Ternary content addressable memory for longest prefix matching based on rando...
DB2 utilities
Final report
Ad

Similar to Cassandra1.2 (20)

PDF
Cassandra overview
PDF
Cassandra Basics, Counters and Time Series Modeling
PDF
Cassandra
PPTX
Cassandra training
PDF
Cassandra Talk: Austin JUG
PDF
Cassandra background-and-architecture
PPTX
Apache Cassandra, part 1 – principles, data model
PDF
On Rails with Apache Cassandra
PDF
Cassandra: Open Source Bigtable + Dynamo
PPTX
Cassandra
PDF
A Deep Dive into Apache Cassandra for .NET Developers
PDF
An Introduction to Apache Cassandra
PDF
Intro to cassandra
PDF
Running Cassandra in AWS
PDF
Gluster dev session #6 understanding gluster's network communication layer
PPTX
CASSANDRA - Next to RDBMS
PPTX
Apache Cassandra at the Geek2Geek Berlin
PDF
Cassandra meetup slides - Oct 15 Santa Monica Coloft
PPTX
Kafka streams decoupling with stores
PDF
Managing Data and Operation Distribution In MongoDB
Cassandra overview
Cassandra Basics, Counters and Time Series Modeling
Cassandra
Cassandra training
Cassandra Talk: Austin JUG
Cassandra background-and-architecture
Apache Cassandra, part 1 – principles, data model
On Rails with Apache Cassandra
Cassandra: Open Source Bigtable + Dynamo
Cassandra
A Deep Dive into Apache Cassandra for .NET Developers
An Introduction to Apache Cassandra
Intro to cassandra
Running Cassandra in AWS
Gluster dev session #6 understanding gluster's network communication layer
CASSANDRA - Next to RDBMS
Apache Cassandra at the Geek2Geek Berlin
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Kafka streams decoupling with stores
Managing Data and Operation Distribution In MongoDB
Ad

Recently uploaded (20)

PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
medical staffing services at VALiNTRY
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Complete Guide to Website Development in Malaysia for SMEs
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PPTX
assetexplorer- product-overview - presentation
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
Computer Software and OS of computer science of grade 11.pptx
Design an Analysis of Algorithms I-SECS-1021-03
medical staffing services at VALiNTRY
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Odoo Companies in India – Driving Business Transformation.pdf
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Operating system designcfffgfgggggggvggggggggg
CHAPTER 2 - PM Management and IT Context
Complete Guide to Website Development in Malaysia for SMEs
Design an Analysis of Algorithms II-SECS-1021-03
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Designing Intelligence for the Shop Floor.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
assetexplorer- product-overview - presentation
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
17 Powerful Integrations Your Next-Gen MLM Software Needs

Cassandra1.2

  • 1. A highly scalable, eventually consistent, distributed, structured key-value store. 张天伦
  • 3. Stable release: 1.2.0 / Jan. 2, 2013 '07 '06 '09
  • 4. BigTable Dynamo Cassandra Data model Tablet write / read Compaction Bloom filter Cluster membership Eventual consistency Partition Fault tolerance Hbase Hypertable Voldemort Riak Family tree
  • 5. Architecture Overview Messaging Layer Cluster MembershipFailure Detector Storage Layer Partitioner Replicator Cassandra API Tools
  • 6. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 7. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 8. Cassandra is a row-oriented database
  • 9. keyspace Row Key: column name value timestamp column column column family Row Key: column column column Data Model Keyspace is like database in an RDBMS A column family is a tableEach row has a unique Row Key, like primary key
  • 10. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 11. Cassandra is a Distributed Hash Table using consistent hashing
  • 12. Firstly, we have an empty token ring with 2^64 positions -2^632^63 - 1 A token represents a position on the ring
  • 13. We add two nodes (B and D) and their tokens determine their positions on the ring D B -2^63 0 2^63 - 1 Nodes mean machines here Tokens could be assigned manually or generated randomly
  • 14. A node is responsible for the range between its predecessor and itself D B -2^63 0 2^63 - 1 B's range D's range
  • 15. D has a list of seed nodes that include B such that D knows the IP address of B and could talk to B D B -2^63 0 2^63 - 1 messages
  • 16. When D hasn't received a reply from B for a while it suspects that B is down D B -2^63 0 2^63 - 1 No reply
  • 17. Then we add more nodes (A and C) D AC B -2^63 -2^62 0 2^62 2^63 - 1 Part of B and D's ranges are taken by A and C
  • 18. Node A and C have D as their seed node so that they could talk to D D AC B -2^63 -2^62 0 2^62 2^63 - 1 messages messages
  • 19. D AC B -2^63 -2^62 0 2^62 2^63 - 1 messages messages messages Node A gets B and C's IP addresses from D; node C gets B and A's IP addresses from D Now node A, B and C could talk to one another
  • 20. The way A and C learn about other nodes are called Gossip ● Gossip is a peer-to-peer communication protocol for exchanging location and state information between nodes
  • 21. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 22. Row is the unit of partition
  • 23. a row key will also get a token (a position on the ring) Row key Token
  • 24. The row is stored on the node that is responsible for the range D AC B -2^63 -2^62 0 2^62 johnny jim suzy carol 2^63 - 1 e.g. johnny's token falls in the range of A and is hence stored there
  • 25. Partitioner is to assign tokens partitioner function range Murmur3Partitioner MurmurHash Function [-2^63, 2^63 - 1] RandomPartitioner MD5 hash value [0, 2^127 - 1] ByteOrderedPartitioner Orders rows lexically by key bytes Platform 's default charset (e.g. 32 bit for utf8) One cluster, one partitioner !
  • 26. D AC B Murmur3Partitioner / RandomPartitioner: ByteOrderedPartitioner: Row key Column family carol ... jim ... johnny ... suzy ... Scans are different for them Scan by token Scan by row key order
  • 27. Drawback of ByteOrderedPartitioner ✗ Sequential writes can cause hot spots ✗ More administrative overhead to load balance the cluster ✗ Uneven load balancing for multiple column families
  • 28. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 29. Data could be lost when nodes fail; we need a replication strategy
  • 30. D AC B -2^63 -2^62 0 2^62 2^63 - 1 johnny The first replica is determined by partitioner and additional replicas are placed on the next nodes clockwise in the ring (SimpleStrategy) Suppose we store 3 replicas
  • 31. D AC B -2^63 -2^62 0 2^62 2^63 - 1 johnny What if node A, B, C are on the same rack ? rack failure would mean data loss
  • 32. D AC B joh. H EG F Cassandra can replica data across racks and data centers West Data center East Data center Suppose A, B are on rack1 and C, D are on rack2 Suppose E, F and G are on rack1 and H are on rack2
  • 33. This is called NetworkTopologyStrategy ● Use for multiple racks in a data center and multiple data centers ● Specify how many replicas you want in each data center ● Places replicas in the same data center by walking down the ring clockwise until reaching the first node in another rack
  • 34. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 35. A write or read request could go to any node which serves as a coordinator
  • 36. A write request where D serves as the coordinator and replicas are stored on A, B, C D AC B client insert 'johnny' coordinator johnny johnny johnny By partitioner and replica strategy, a coordinator determines which nodes to get the request
  • 37. When does a coordinator return an acknowledgement to the client ? ● When the write succeeds on consistency level replicas ✔ Consistency is the synchronization of data on replicas in a cluster ✔ Consistency level is a client setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively
  • 38. insert 'johnny' with consistency level = one D AC B client insert 'johnny' coordinator johnny lost lost ACK ACK
  • 39. insert 'johnny' with consistency level = quorum D AC B client coordinator johnny johnny (replicas / 2) + 1ACKACK ACK ACK lost insert 'johnny' Quorum means majority
  • 40. get 'johnny' with consistency level = quorum D AC B client johnny v2 johnny v1 johnny v2 Coordinator returns the most recent data determined by timestamp
  • 41. What if I want strong consistency ● Write CL + Read CL > Replicas e.g. write one, read all write all, read one write quorum, read quorum A B C client A B C client A B C client read write
  • 42. So Cassandra' s consistency model is tunable
  • 43. A write's journey Each column family has a Memtable
  • 44. Flush after several inserts memtable Commit log
  • 45. ● Memtable an in-memory sorted map from row key to columns ● SSTable an immutable data file to which Cassandra writes memtables periodically ● Commit log a redo log to which Cassandra appends data for recovery in the event of a hardware failure What are they ?
  • 46. More updates and flush memtable Commit log They belong to the same column family
  • 48. ● A tombstone is written to indicate a deleted column ● Columns marked with a tombstone exist for configured gc_grace_seconds after which compaction permanently deletes the column SSTable is immutable, how about delete ?
  • 49. compaction ● In the background, Cassandra periodically merges SSTables together into larger SSTables ● Compaction merges row fragments, removes expired tombstones, and rebuilds indexes.
  • 50. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 51. CQL ● Cassandra Query Language (CQL) is a SQL like language for querying Cassandra. ● CQL doesn't support joins; Cassandra encourages denormalization We refer to CQL3 here Joins require expensive random reads, which need to be merged across the network
  • 52. CQL3 structure clientcqlsh Thrift RPC CQL binary protocol Query Processor Internal write / read API Local path Remote path server transport Java / .NET driver
  • 53. CQL3 queries CREATE TABLE profiles ( id text PRIMARY KEY, first_name text, last_name text, age int ); id first_name last_name age 11485603 tianlun zhang 23 INSERT INTO profiles (id, first_name, last_name, age) VALUES ('11485603', 'tianlun', 'zhang', 23); SELECT * FROM profiles; Table means column family here
  • 54. CQL3 hides internal storage from users id first_name last_name age 11485603 tianlun zhang 23 first_name: last_name: age: tianlun zhang 23 11485603 internal storage Row key Column name Column value : Columns are sorted by column name
  • 55. compound primary key in CQL3 CREATE TABLE comments ( article_id uuid, posted_at timestamp, author text, content text, PRIMARY KEY (article_id, posted_at) ); Row Key The remaining component ensures that the columns in a row are stored in ascending order on disk
  • 56. Columns are sorted first by posted_at and then by column name article_id posted_at author content 550e8400-.. 1970-01-17 00:08:19+0900 yukim blah, blah, blah 550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
  • 57. Since columns of a row are sorted by time, we could efficiently get the comment on an article after a certain time SELECT * FROM comments WHERE article_id = '550e8400-..' AND posted_at >= '1970-01-17 03:08:19+0900'; article_id posted_at author content 550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
  • 58. How about query on value ? Secondary index enables us to query on value SELECT * FROM comments where author = 'yukim'; Bad Request: No indexed columns present in by-columns clause with Equal operator
  • 59. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 60. ● Index on column values (should not be primary key or part of compound primary key) ● Cassandra implements secondary indexes as a hidden column family (invisible to client), separate from the column family that contains the values being indexed Secondary index
  • 61. CREATE INDEX c_author on comments (author); ` yukim [550e8400-.., 1350499616, author]: [550e8400-.., 1368499616, author]: Index column family Base CF and Index CF are flushed to disk at the same time Column value Row key + column name
  • 62. SELECT * FROM comments where author='yukim'; ● Index column family is stored on the same node as base column family ● Cassandra doesn't maintain column value information in any one node and the query still needs to be sent to all nodes
  • 63. Using multiple secondary indexes ● If 'bob' is less frequent than 'smith', Cassandra will process users_fname = 'bob' first for efficiency
  • 64. DELETE FROM comments where author='yukim'; ● This is not allowed ● Delete a indexed column won't update index
  • 65. Secondary index updates ● Cassandra appends data to the commit log, updates the memtable, and updates the secondary index ● If a read sees a stale index entry before compaction purges it, the reader thread invalidates it
  • 66. Secondary index overhead ● Built on existing data in the background automatically, without blocking reads or writes (the CREATE clause) ● Updating indexes blocks reads or writes at row level (the INSERT clause)
  • 67. There are more... ● Virtual nodes ● Atomic batches ● Request tracing ● Expiring / counter columns ● CQL collections ● Composite partition keys
  • 68. Cassandra links ● Cassandra Official website http://cassandra.apache.org/ ● Apache Cassandra 1.2 Documentation http://www.datastax.com/docs/1.2/index ● Cassandra trunk http://git-wip-us.apache.org/repos/asf/cassandra.git ● Configuration file conf / cassandra.yaml