SlideShare a Scribd company logo
Prepared by,
Pooja G.V
8th sem, CSE
GECH.
Cassandra
• Cassandra is a distributed database from Apache
that is highly scalable and designed to manage
very large amounts of structured data.
•It provides high availability with no single point of
failure.
• Open-source database
management system
(DBMS)
• Several key features of
Cassandra differentiate it
from other similar systems
Continued.....
History of Cassandra
● Cassandra was created to power
the Facebook Inbox Search
● Facebook open-sourced Cassandra in 2008 and
became an Apache Incubator project
● In 2010, Cassandra graduated to a top-level
project, regular update and releases followed
Motivation and Function
● Designed to handle large amount of data across
multiple servers
● There is a lot of unorganized data out there
● Easy to implement and deploy
● Mimics traditional relational database systems, but
with triggers and lightweight transactions
● Raw, simple data structures
General Design Features
Emphasis on performance over analysis
● Still has support for analysis tools such as Hadoop
Organization
● Rows are organized into tables
● First component of a table’s primary key is the partition key
● Rows are clustered by the remaining columns of the key
● Columns may be indexed separately from the primary key
● Tables may be created, dropped, altered at runtime without blocking
queries
Language
● CQL (Cassandra Query Language) introduced, similar to SQL (flattened
learning curve)
Peer to Peer Cluster
• Decentralized design
• No single point of failure
• No bottlenecking
Fault Tolerance/Durability
Failures happen all the time with multiple nodes
● Hardware Failure
● Bugs
● Operator error
● Power Outage, etc.
Solution: Buy cheap, redundant hardware,
replicate, maintain consistency
Fault Tolerance/Durability
• Replication
• Distribution of data to multiple data centers
Performance
• Core architectural designs allow Cassandra to
outperform its competitors
• Very good read and write throughputs
Scalability
Read and write throughput increase linearly as more machines are added
“In terms of scalability, there is a clear winner throughout our experiments.
Cassandra achieves the highest throughput for the maximum number of
nodes…” - University of Toronto
Comparisons
Apache Cassandra Google Big Table Amazon DynamoDB
Storage Type Column Column Key-Value
Best Use Write often, read less Designed for large
scalability
Large database solution
Concurrency Control MVCC Locks ACID
Characteristics High Availability
Partition Tolerance
Persistence
Consistency
High Availability
Partition Tolerance
Persistence
Consistency
High Availability
Cassandra Use Cases
Netflix
• online DVD and Blu-Ray
movie retailer
• Nielsen study showed
that 38% of Americans
use or subscribe to
Netflix
Netflix: Why Cassandra
● Using a central SQL database negatively
impacted scalability and availability
● International Expansion required Multi-
Datacenter solution
● Need for configurable Replication, Consistency,
and Resiliency in the face of failure
● Cassandra on AWS offered high levels of
scalability and availability
Jason Brown,
Senior Software
Engineer at Netflix
Hulu
• a website and a
subscription service
offering on-demand
streaming video media
• ~30 million unique
viewers per month
Hulu: Why Cassandra
• need for Availability
• need for Scalability
• Good Performance
• Nearly Linear Scalability
• Geo-Replication
• Minimal Maintenance Requirements
Andres Rangel,
Senior Software
Engineer at Hulu
Reasons for Choosing Cassandra
• Value availability over
consistency
• Require high write-throughput
• High scalability required
• No single point of failure
CAP Theorem
Cassandra’s Data Model
• Cassandra is a column oriented
NoSQL system
• Column families: sets of key-
value pairs
• A row is a collection of columns
labeled with a name
Key-Value Model
Cassandra Row
• the value of a row is itself a
sequence of key-value pairs
• such nested key-value pairs are
columns
• key = column name
• a row must contain at least 1
column
Example of Columns
Column names storing values
• key: User ID
• column names store
tweet ID values
• values of all column
names are set to “-”
(empty byte array) as
they are not used
• A Key Space is a group of
column families together. It
is only a logical grouping of
column families and
provides an isolated scope
for names
Key Space
Comparing Cassandra (C*) and RDBMS
• with RDBMS, a normalized data model is created
without considering the exact queries
• with C*, the data model is designed for specific
queries
• C*: NO joins, relationships, or foreign keys
Cassandra Query Language - CQL
• creating a keyspace - namespace of tables
CREATE KEYSPACE demo
WITH replication = {‘class’: ’SimpleStrategy’,
replication_factor’: 3};
• to use namespace:
USE demo;
Cassandra Query Language - CQL
• creating tables:
CREATE TABLE users( CREATE TABLE tweets(
email varchar, email varchar,
bio varchar, time_posted timestamp,
birthday timestamp, tweet varchar,
active boolean, PRIMARY KEY (email,
time_posted));
PRIMARY KEY (email));
Cassandra Query Language - CQL
• inserting data
INSERT INTO users (email, bio, birthday, active)
VALUES (‘john.doe@bti360.com’, ‘BT360 Teammate’,
516513600000, true);
Cassandra Query Language - CQL
• querying tables
• SELECT expression reads one or more records from
Cassandra column family and returns a result-set of rows
SELECT * FROM users;
SELECT email FROM users WHERE active = true;
Cassandra: Conclusion
• perfect for time-series data
• high performance
• Decentralization
• nearly linear scalability
• replication support
• no single points of failure
• MapReduce support
Cassandra Advantages
Cassandra Weaknesses
● no referential integrity
● querying options for retrieving data are limited
● sorting data is a design decision
● no support for atomic operations
● first think about queries, then about data model
Cassandra: Points to Consider
● Cassandra is designed as a distributed database
management system
● Cassandra write performance is always excellent, but read
performance depends on write patterns
● having a high-level understanding of some internals is a plus
References
• Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured
storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
• Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media, 2010.
• http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/a
rchitectureTOC.html
• http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding-
apache-cassandra
• http://www.slideshare.net/DataStax/evaluating-apache-cassandra-as-a-cloud-
database
• http://planetcassandra.org/functional-use-cases/
• http://marsmedia.info/en/cassandra-pros-cons-and-model.php
• http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-
cassandra
• http://wiki.apache.org/cassandra/CassandraLimitations

More Related Content

PDF
Introduction to Apache Cassandra
PPTX
Presentation of Apache Cassandra
PDF
Mongo DB
PDF
Cassandra Database
PPTX
Couchbase 101
PPTX
Couchbase presentation
PPTX
An Overview of Apache Cassandra
PPT
7. Key-Value Databases: In Depth
Introduction to Apache Cassandra
Presentation of Apache Cassandra
Mongo DB
Cassandra Database
Couchbase 101
Couchbase presentation
An Overview of Apache Cassandra
7. Key-Value Databases: In Depth

What's hot (20)

PPT
Hadoop hive presentation
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
PDF
Cassandra 101
PDF
Your first ClickHouse data warehouse
ZIP
NoSQL databases
PPTX
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
PPSX
ODP
Introduction to Structured Streaming
PPTX
Introduction to NoSQL Databases
PPTX
Introduction to HiveQL
PDF
Introduction to Cassandra
PPTX
Introduction to Apache Spark
PPTX
Session 14 - Hive
PPT
Oracle Transparent Data Encryption (TDE) 12c
PPTX
Apache Spark Architecture
PPTX
Indexing with MongoDB
PPTX
Cloudera Hadoop Distribution
PDF
Data Engineering Basics
PPTX
Cassandra ppt 1
Hadoop hive presentation
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Cassandra 101
Your first ClickHouse data warehouse
NoSQL databases
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Introduction to Structured Streaming
Introduction to NoSQL Databases
Introduction to HiveQL
Introduction to Cassandra
Introduction to Apache Spark
Session 14 - Hive
Oracle Transparent Data Encryption (TDE) 12c
Apache Spark Architecture
Indexing with MongoDB
Cloudera Hadoop Distribution
Data Engineering Basics
Cassandra ppt 1
Ad

Viewers also liked (10)

PDF
Cassandra devoxx 2010
PPTX
Cassandra - Deep Dive ...
PDF
Cassandra - A Decentralized Structured Storage System
PPTX
Cassandra - A decentralized storage system
PPTX
Apache Cassandra 2.0
PPT
Cassandra architecture
PPTX
Cassandra - Research Paper Overview
PDF
The Cassandra Distributed Database
PPTX
Cassandra Data Modeling - Practical Considerations @ Netflix
PPTX
Shall we play a game?
Cassandra devoxx 2010
Cassandra - Deep Dive ...
Cassandra - A Decentralized Structured Storage System
Cassandra - A decentralized storage system
Apache Cassandra 2.0
Cassandra architecture
Cassandra - Research Paper Overview
The Cassandra Distributed Database
Cassandra Data Modeling - Practical Considerations @ Netflix
Shall we play a game?
Ad

Similar to Cassandra (20)

PPTX
cassandra_presentation_final
PPTX
Appache Cassandra
PPTX
Cassandra for mission critical data
PPTX
Apache Cassandra introduction
PDF
cassandra
PPTX
Unit -3 _Cassandra-CRUD Operations_Practice Examples
PPTX
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
PDF
White paper on cassandra
PPTX
Cassandra tutorial
PPTX
Learning Cassandra NoSQL
PDF
04-Introduction-to-CassandraDB-.pdf
PDF
PPTX
Learn Cassandra at edureka!
PDF
Apache Cassandra overview
PDF
Cassandra NoSQL Tutorial
PPTX
Apache Cassandra Database 2016
ODP
Intro to cassandra
PDF
Introduction to Cassandra
PPTX
Introduction to NoSQL CassandraDB
PPTX
Apache Cassandra, part 1 – principles, data model
cassandra_presentation_final
Appache Cassandra
Cassandra for mission critical data
Apache Cassandra introduction
cassandra
Unit -3 _Cassandra-CRUD Operations_Practice Examples
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
White paper on cassandra
Cassandra tutorial
Learning Cassandra NoSQL
04-Introduction-to-CassandraDB-.pdf
Learn Cassandra at edureka!
Apache Cassandra overview
Cassandra NoSQL Tutorial
Apache Cassandra Database 2016
Intro to cassandra
Introduction to Cassandra
Introduction to NoSQL CassandraDB
Apache Cassandra, part 1 – principles, data model

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
Modernizing your data center with Dell and AMD
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
Modernizing your data center with Dell and AMD
Unlocking AI with Model Context Protocol (MCP)
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx

Cassandra

  • 1. Prepared by, Pooja G.V 8th sem, CSE GECH.
  • 3. • Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. •It provides high availability with no single point of failure.
  • 4. • Open-source database management system (DBMS) • Several key features of Cassandra differentiate it from other similar systems Continued.....
  • 5. History of Cassandra ● Cassandra was created to power the Facebook Inbox Search ● Facebook open-sourced Cassandra in 2008 and became an Apache Incubator project ● In 2010, Cassandra graduated to a top-level project, regular update and releases followed
  • 6. Motivation and Function ● Designed to handle large amount of data across multiple servers ● There is a lot of unorganized data out there ● Easy to implement and deploy ● Mimics traditional relational database systems, but with triggers and lightweight transactions ● Raw, simple data structures
  • 7. General Design Features Emphasis on performance over analysis ● Still has support for analysis tools such as Hadoop Organization ● Rows are organized into tables ● First component of a table’s primary key is the partition key ● Rows are clustered by the remaining columns of the key ● Columns may be indexed separately from the primary key ● Tables may be created, dropped, altered at runtime without blocking queries Language ● CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)
  • 8. Peer to Peer Cluster • Decentralized design • No single point of failure • No bottlenecking
  • 9. Fault Tolerance/Durability Failures happen all the time with multiple nodes ● Hardware Failure ● Bugs ● Operator error ● Power Outage, etc. Solution: Buy cheap, redundant hardware, replicate, maintain consistency
  • 10. Fault Tolerance/Durability • Replication • Distribution of data to multiple data centers
  • 11. Performance • Core architectural designs allow Cassandra to outperform its competitors • Very good read and write throughputs
  • 12. Scalability Read and write throughput increase linearly as more machines are added “In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes…” - University of Toronto
  • 13. Comparisons Apache Cassandra Google Big Table Amazon DynamoDB Storage Type Column Column Key-Value Best Use Write often, read less Designed for large scalability Large database solution Concurrency Control MVCC Locks ACID Characteristics High Availability Partition Tolerance Persistence Consistency High Availability Partition Tolerance Persistence Consistency High Availability
  • 15. Netflix • online DVD and Blu-Ray movie retailer • Nielsen study showed that 38% of Americans use or subscribe to Netflix
  • 16. Netflix: Why Cassandra ● Using a central SQL database negatively impacted scalability and availability ● International Expansion required Multi- Datacenter solution ● Need for configurable Replication, Consistency, and Resiliency in the face of failure ● Cassandra on AWS offered high levels of scalability and availability Jason Brown, Senior Software Engineer at Netflix
  • 17. Hulu • a website and a subscription service offering on-demand streaming video media • ~30 million unique viewers per month
  • 18. Hulu: Why Cassandra • need for Availability • need for Scalability • Good Performance • Nearly Linear Scalability • Geo-Replication • Minimal Maintenance Requirements Andres Rangel, Senior Software Engineer at Hulu
  • 19. Reasons for Choosing Cassandra • Value availability over consistency • Require high write-throughput • High scalability required • No single point of failure
  • 22. • Cassandra is a column oriented NoSQL system • Column families: sets of key- value pairs • A row is a collection of columns labeled with a name Key-Value Model
  • 23. Cassandra Row • the value of a row is itself a sequence of key-value pairs • such nested key-value pairs are columns • key = column name • a row must contain at least 1 column
  • 25. Column names storing values • key: User ID • column names store tweet ID values • values of all column names are set to “-” (empty byte array) as they are not used
  • 26. • A Key Space is a group of column families together. It is only a logical grouping of column families and provides an isolated scope for names Key Space
  • 27. Comparing Cassandra (C*) and RDBMS • with RDBMS, a normalized data model is created without considering the exact queries • with C*, the data model is designed for specific queries • C*: NO joins, relationships, or foreign keys
  • 28. Cassandra Query Language - CQL • creating a keyspace - namespace of tables CREATE KEYSPACE demo WITH replication = {‘class’: ’SimpleStrategy’, replication_factor’: 3}; • to use namespace: USE demo;
  • 29. Cassandra Query Language - CQL • creating tables: CREATE TABLE users( CREATE TABLE tweets( email varchar, email varchar, bio varchar, time_posted timestamp, birthday timestamp, tweet varchar, active boolean, PRIMARY KEY (email, time_posted)); PRIMARY KEY (email));
  • 30. Cassandra Query Language - CQL • inserting data INSERT INTO users (email, bio, birthday, active) VALUES (‘john.doe@bti360.com’, ‘BT360 Teammate’, 516513600000, true);
  • 31. Cassandra Query Language - CQL • querying tables • SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows SELECT * FROM users; SELECT email FROM users WHERE active = true;
  • 33. • perfect for time-series data • high performance • Decentralization • nearly linear scalability • replication support • no single points of failure • MapReduce support Cassandra Advantages
  • 34. Cassandra Weaknesses ● no referential integrity ● querying options for retrieving data are limited ● sorting data is a design decision ● no support for atomic operations ● first think about queries, then about data model
  • 35. Cassandra: Points to Consider ● Cassandra is designed as a distributed database management system ● Cassandra write performance is always excellent, but read performance depends on write patterns ● having a high-level understanding of some internals is a plus
  • 36. References • Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40. • Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media, 2010. • http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/a rchitectureTOC.html • http://www.slideshare.net/planetcassandra/a-deep-dive-into-understanding- apache-cassandra • http://www.slideshare.net/DataStax/evaluating-apache-cassandra-as-a-cloud- database • http://planetcassandra.org/functional-use-cases/ • http://marsmedia.info/en/cassandra-pros-cons-and-model.php • http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global- cassandra • http://wiki.apache.org/cassandra/CassandraLimitations

Editor's Notes

  • #6: It combines Amazon Dynamo’s fully distributed design with Google Bigtable’s column-oriented data model.
  • #7: designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failur
  • #16: offering streaming movies through video game consoles, Apple TV, TiVo and more http://www.usatoday.com/story/life/tv/2013/09/18/netflix-hulu-amazon-nielsen-viewership-data/2831535/
  • #17: Central Oracle database -> everything in one place, convenient until it fails Schema changes required downtime Cassandra stores 3 local copies, 1 per zone Synchronous access, durable Replicates at destination Global Coverage business agility Local Access better latency fault isolation
  • #19: Geo-Replication – distribution of data across multiple regions
  • #29: SimpleStrategy is a replication strategy. There are two: SimpleStrategy and NetworkTopologyStrategy. SimpleStrategy is used for simple data center clusters. It is default. NetworkTopologyStrategy is used for multi-datacenter deployments
  • #30: Table – when schema is static Column Family – when schema is dynamic
  • #35: No join or subquery support, and limited support for aggregation. This is by design, to force you to denormalize into partitions that can be efficiently queried from a single replica, instead of having to gather data from across the entire cluster. Ordering is done per-partition, and is specified at table creation time. Again, this is to enforce good application design; sorting thousands or millions of rows can be fast in development, but sorting billions in production is a bad idea.
  • #36: It’s important to analyze how you are going to query your data. Spending time to design your schema around your query pattern can save a lot of hassle debugging performance issues while also ensuring that you can scale easily. Additionally, having a high-level understanding of some of the internals such has how deletions are implemented, how secondary indices operate, and when to use the row cache can go a long way in designing a strong application built atop Cassandra.