Cassandra

Prepared by,
Pooja G.V
8th sem, CSE
GECH.

• Cassandra is a distributed database from Apache
that is highly scalable and designed to manage
very large amounts of structured data.
•It provides high availability with no single point of
failure.

• Open-source database
management system
(DBMS)
• Several key features of
Cassandra differentiate it
from other similar systems
Continued.....

History of Cassandra
● Cassandra was created to power
the Facebook Inbox Search
● Facebook open-sourced Cassandra in 2008 and
became an Apache Incubator project
● In 2010, Cassandra graduated to a top-level
project, regular update and releases followed

Motivation and Function
● Designed to handle large amount of data across
multiple servers
● There is a lot of unorganized data out there
● Easy to implement and deploy
● Mimics traditional relational database systems, but
with triggers and lightweight transactions
● Raw, simple data structures

General Design Features
Emphasis on performance over analysis
● Still has support for analysis tools such as Hadoop
Organization
● Rows are organized into tables
● First component of a table’s primary key is the partition key
● Rows are clustered by the remaining columns of the key
● Columns may be indexed separately from the primary key
● Tables may be created, dropped, altered at runtime without blocking
queries
Language
● CQL (Cassandra Query Language) introduced, similar to SQL (flattened
learning curve)

Peer to Peer Cluster
• Decentralized design
• No single point of failure
• No bottlenecking

Fault Tolerance/Durability
Failures happen all the time with multiple nodes
● Hardware Failure
● Bugs
● Operator error
● Power Outage, etc.
Solution: Buy cheap, redundant hardware,
replicate, maintain consistency

Fault Tolerance/Durability
• Replication
• Distribution of data to multiple data centers

Performance
• Core architectural designs allow Cassandra to
outperform its competitors
• Very good read and write throughputs

Scalability
Read and write throughput increase linearly as more machines are added
“In terms of scalability, there is a clear winner throughout our experiments.
Cassandra achieves the highest throughput for the maximum number of
nodes…” - University of Toronto

Comparisons
Apache Cassandra Google Big Table Amazon DynamoDB
Storage Type Column Column Key-Value
Best Use Write often, read less Designed for large
scalability
Large database solution
Concurrency Control MVCC Locks ACID
Characteristics High Availability
Partition Tolerance
Persistence
Consistency
High Availability
Partition Tolerance
Persistence
Consistency
High Availability

Netflix
• online DVD and Blu-Ray
movie retailer
• Nielsen study showed
that 38% of Americans
use or subscribe to
Netflix

Netflix: Why Cassandra
● Using a central SQL database negatively
impacted scalability and availability
● International Expansion required Multi-
Datacenter solution
● Need for configurable Replication, Consistency,
and Resiliency in the face of failure
● Cassandra on AWS offered high levels of
scalability and availability
Jason Brown,
Senior Software
Engineer at Netflix

Hulu
• a website and a
subscription service
offering on-demand
streaming video media
• ~30 million unique
viewers per month

Hulu: Why Cassandra
• need for Availability
• need for Scalability
• Good Performance
• Nearly Linear Scalability
• Geo-Replication
• Minimal Maintenance Requirements
Andres Rangel,
Senior Software
Engineer at Hulu

Reasons for Choosing Cassandra
• Value availability over
consistency
• Require high write-throughput
• High scalability required
• No single point of failure

• Cassandra is a column oriented
NoSQL system
• Column families: sets of key-
value pairs
• A row is a collection of columns
labeled with a name
Key-Value Model

Cassandra Row
• the value of a row is itself a
sequence of key-value pairs
• such nested key-value pairs are
columns
• key = column name
• a row must contain at least 1
column

Column names storing values
• key: User ID
• column names store
tweet ID values
• values of all column
names are set to “-”
(empty byte array) as
they are not used

• A Key Space is a group of
column families together. It
is only a logical grouping of
column families and
provides an isolated scope
for names
Key Space

Comparing Cassandra (C*) and RDBMS
• with RDBMS, a normalized data model is created
without considering the exact queries
• with C*, the data model is designed for specific
queries
• C*: NO joins, relationships, or foreign keys

Cassandra Query Language - CQL
• creating a keyspace - namespace of tables
CREATE KEYSPACE demo
WITH replication = {‘class’: ’SimpleStrategy’,
replication_factor’: 3};
• to use namespace:
USE demo;

• creating tables:
CREATE TABLE users( CREATE TABLE tweets(
email varchar, email varchar,
bio varchar, time_posted timestamp,
birthday timestamp, tweet varchar,
active boolean, PRIMARY KEY (email,
time_posted));
PRIMARY KEY (email));

• inserting data
INSERT INTO users (email, bio, birthday, active)
VALUES (‘john.doe@bti360.com’, ‘BT360 Teammate’,
516513600000, true);

• querying tables
• SELECT expression reads one or more records from
Cassandra column family and returns a result-set of rows
SELECT * FROM users;
SELECT email FROM users WHERE active = true;

• perfect for time-series data
• high performance
• Decentralization
• nearly linear scalability
• replication support
• no single points of failure
• MapReduce support
Cassandra Advantages

Cassandra Weaknesses
● no referential integrity
● querying options for retrieving data are limited
● sorting data is a design decision
● no support for atomic operations
● first think about queries, then about data model

Cassandra: Points to Consider
● Cassandra is designed as a distributed database
management system
● Cassandra write performance is always excellent, but read
performance depends on write patterns
● having a high-level understanding of some internals is a plus

References
• Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured
storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
• Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media, 2010.
• http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/a
rchitectureTOC.html
• http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding-
apache-cassandra
• http://www.slideshare.net/DataStax/evaluating-apache-cassandra-as-a-cloud-
database
• http://planetcassandra.org/functional-use-cases/
• http://marsmedia.info/en/cassandra-pros-cons-and-model.php
• http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-
cassandra
• http://wiki.apache.org/cassandra/CassandraLimitations

Cassandra

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Cassandra (20)

Recently uploaded (20)

Cassandra

Editor's Notes