SlideShare a Scribd company logo
DATABASE SHARDING AND
CASSANDRA ARCHITECTURE
SOUPIK CHOWDHURY
THE PROBLEM
● Consider a very popular shopping cart application (ex. Amazon)
● Each transaction from the user needs to be maintained in a database.
● Millions of people all around the world are buying items daily.
● A point arises when size of the database exceeds the maximum memory limit of
the server.
SOLUTIONS
● One solution can be introducing more memory or buying bigger machines.
○ There’s an upper limit to the above approach due to hardware constraints.
○ This is called vertical scaling.
● The second solution can be, vertical partitioning i.e moving few columns to a
different table.
○ This is not same as normalization. We can use vertical partitioning even on a normalized
database.
○ This reduces the size of each entry. However again there will be an upper bound
● The third solution can be, horizontally partition the table and store each partition in
a different server.
○ This is also called database sharding.
DATABASE SHARDING
● Process of breaking up a large table into smaller chunks called shards
● This is partitioning the table horizontally,
● Each table will have same schema but entire different rows.
● The partitioning can be done on basis on some key or non key attribute or set of
attributes.
● For example, we can partitioning the data based on geographical location of the
users.
ADVANTAGES OF SHARDING
● Facilitates horizontal scaling on databases. Now there is no upper bound on data
that can get into a particular table.
● Increases request efficiency. Now the same request is served by a smaller version
of the database so is faster.
● For read-write-heavy databases, the efficiency is further increased if sharding is
done based on some factor like geographical location and storing corresponding
shards in servers nearby those locations.
DISADVANTAGES OF SHARDING
● Manual implementation of sharding is not easy. Might not work at all.
● Identifying the property or properties to shard on is the key factor and affects the
performance completely. So choosing wrong set of properties might become costly
than giving advantage. (Example, after sharding, we need to join across shards to
serve the request)
● Might be difficult to return to the unsharded version.
● Not every database supports sharding by default. We need to manually implement
for them. This might become unnecessary complexity when our application is
small.
INTRODUCTION TO CASSANDRA
● Cassandra is a distributed, nosql database management system designed by
Apache.
● It is a peer-to-peer, distributed database that runs on a cluster of homogenous
nodes.
● Handles large volume of data, while providing high availability.
● Provides high read and write throughput.
SOME PROPERTIES OF CASSANDRA
● Distributed - Every node in Cassandra has same role. There is no single point of
failure. Data is distributed across the cluster but there is no master and each node
can service any request.
● Replication - Replication strategies are configurable. This avoids single point of
failure, as if a node goes out of service, the requests gets served by another node.
● Eventual Consistency - Since data is replicated across nodes, we need to ensure
data is synchronized across replicas. Cassandra follows eventual consistency
model which states that provided there are no new updates, all replicas will
eventually return the last updated value.
● Consistent Hashing
CONSISTENT HASHING
● A hash function is a function that maps one piece of data - typically
describing some kind of object, often of arbitrary size to another piece of
data, typically an integer, known as hash code or simply hash.
● Hash functions are widely used. Some examples can be cryptographic
hashing of sensitive information in websites, hashing data to compress
them, hashing strings for substring matching etc.
● Distributed Hashing - The hash table might grow a lot in size. So we can
partition it into several parts and each part is hosted in different servers.
CONSISTENT HASHING (CONTD…)
● Rehashing problem - If one of the server crashes, the keys needs to be
redistributed across the remaining servers. Even if a single server is added or
removed, all keys are likely to get rehashed into different buckets (at least most of
them). Thus, rehashing is expensive in normal distributed hashing.
● Consistent hashing - Solves rehashing problem i.e minimizes the rehashing when
servers are added/removed. Consistent hashing is a distributed hashing scheme
that operates independently of servers or objects in a distributed hash table by
assigning them a position on an abstract circle, or hash ring. This allows servers
and objects to scale without affecting the overall system.
CONSISTENT HASHING (IMPLEMENTATION)
● Map all the hash values on the circle using a proper hash function (that maps to an
angle in [0-2𝝅] radians).
● Map all the servers as well to the circle using the same hash function. (or different
if needed).
● Assign the hash values to the servers that is nearest to it in clockwise direction.
CONSISTENT HASHING IN CASSANDRA
● Consistent hashing is used in Cassandra for sharding and also load balancing.
● Let the database be sharded using key. We use a hash function h , where h(key)
maps to the circle.
● Suppose we have k servers. Map these servers in the circle as well. All the key
values are mapped to the nearest server in clockwise direction.
● Now suppose a request comes with key = key1. We compute h(key1) and then
serve the request using that server which is nearest in clockwise sense.
REPLICATION IN CASSANDRA
● The database admin has to fix a number K (called Replication factor).
● Data corresponding to a given key is stored not only in nearest server in clockwise
direction but nearest K servers.
● If a server goes down, all the requests meant for that server goes to the next server
in clockwise sense.
REFERENCES
● Gaurav Sen’s lecture videos : Cassandra , Consistent Hashing
● Wikipedia : Cassandra , Sharding
● Stackoverflow : Sharding
THANK YOU

More Related Content

PDF
Cassandra
PPTX
Migrating from a Relational Database to Cassandra: Why, Where, When and How
PPTX
Apache Cassandra Lunch #70: Basics of Apache Cassandra
PPTX
Cassandra - A Basic Introduction Guide
PPTX
Distributed Caching - Cache Unleashed
PDF
Try Cloud Spanner
PPTX
Cassandra - A decentralized storage system
Cassandra
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Cassandra - A Basic Introduction Guide
Distributed Caching - Cache Unleashed
Try Cloud Spanner
Cassandra - A decentralized storage system

What's hot (20)

PDF
Cassandra Workshop - Cassandra from scratch in one day
PPTX
Cassandra - Research Paper Overview
PDF
Cassandra for Sysadmins
PPTX
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
PDF
Running MySQL in AWS
PPTX
Running Cassandra on Amazon EC2
PDF
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
PDF
MEETUP - Unboxing Apache Cassandra 3.10
PPTX
Using Cassandra with your Web Application
PDF
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
PPTX
How to size up an Apache Cassandra cluster (Training)
PDF
CASSANDRA MEETUP - Choosing the right cloud instances for success
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
DOCX
Cassandra data modelling best practices
PPT
Cassandra architecture
PPTX
Cosmos db
PPT
Cassandra - A Distributed Database System
PPTX
Cassandra implementation for collecting data and presenting data
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Cassandra Workshop - Cassandra from scratch in one day
Cassandra - Research Paper Overview
Cassandra for Sysadmins
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running MySQL in AWS
Running Cassandra on Amazon EC2
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
MEETUP - Unboxing Apache Cassandra 3.10
Using Cassandra with your Web Application
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to size up an Apache Cassandra cluster (Training)
CASSANDRA MEETUP - Choosing the right cloud instances for success
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra data modelling best practices
Cassandra architecture
Cosmos db
Cassandra - A Distributed Database System
Cassandra implementation for collecting data and presenting data
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Ad

Similar to Database Shrading and cassandra architecture (20)

PPTX
Apache Cassandra, part 1 – principles, data model
PPTX
NoSql Database
PPTX
An Introduction to Cassandra - Oracle User Group
PPTX
Scaling opensimulator inventory using nosql
PPTX
Introduction to cassandra
PPT
No sql
PPT
No sql
PPTX
Basics of Distributed Systems - Distributed Storage
PPT
The No SQL Principles and Basic Application Of Casandra Model
PPTX
PDF
A Deep Dive into Apache Cassandra for .NET Developers
PPTX
Cassandra tutorial
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
ODP
Intro to cassandra
PPTX
final demo 1.pptx about Property rental system
PPTX
NoSQL Intro with cassandra
PDF
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
PDF
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
PPTX
Cassandra an overview
PPTX
L6.sp17.pptx
Apache Cassandra, part 1 – principles, data model
NoSql Database
An Introduction to Cassandra - Oracle User Group
Scaling opensimulator inventory using nosql
Introduction to cassandra
No sql
No sql
Basics of Distributed Systems - Distributed Storage
The No SQL Principles and Basic Application Of Casandra Model
A Deep Dive into Apache Cassandra for .NET Developers
Cassandra tutorial
Scaling Web Applications with Cassandra Presentation (1).ppt
Intro to cassandra
final demo 1.pptx about Property rental system
NoSQL Intro with cassandra
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
Cassandra an overview
L6.sp17.pptx
Ad

Recently uploaded (20)

DOCX
The Five Best AI Cover Tools in 2025.docx
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Complete React Javascript Course Syllabus.pdf
PDF
AI in Product Development-omnex systems
PPT
JAVA ppt tutorial basics to learn java programming
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
System and Network Administraation Chapter 3
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
medical staffing services at VALiNTRY
PPTX
Essential Infomation Tech presentation.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
The Five Best AI Cover Tools in 2025.docx
ISO 45001 Occupational Health and Safety Management System
Complete React Javascript Course Syllabus.pdf
AI in Product Development-omnex systems
JAVA ppt tutorial basics to learn java programming
2025 Textile ERP Trends: SAP, Odoo & Oracle
System and Network Administraation Chapter 3
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
L1 - Introduction to python Backend.pptx
Materi-Enum-and-Record-Data-Type (1).pptx
Operating system designcfffgfgggggggvggggggggg
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Which alternative to Crystal Reports is best for small or large businesses.pdf
medical staffing services at VALiNTRY
Essential Infomation Tech presentation.pptx
Design an Analysis of Algorithms I-SECS-1021-03
Materi_Pemrograman_Komputer-Looping.pptx

Database Shrading and cassandra architecture

  • 1. DATABASE SHARDING AND CASSANDRA ARCHITECTURE SOUPIK CHOWDHURY
  • 2. THE PROBLEM ● Consider a very popular shopping cart application (ex. Amazon) ● Each transaction from the user needs to be maintained in a database. ● Millions of people all around the world are buying items daily. ● A point arises when size of the database exceeds the maximum memory limit of the server.
  • 3. SOLUTIONS ● One solution can be introducing more memory or buying bigger machines. ○ There’s an upper limit to the above approach due to hardware constraints. ○ This is called vertical scaling. ● The second solution can be, vertical partitioning i.e moving few columns to a different table. ○ This is not same as normalization. We can use vertical partitioning even on a normalized database. ○ This reduces the size of each entry. However again there will be an upper bound ● The third solution can be, horizontally partition the table and store each partition in a different server. ○ This is also called database sharding.
  • 4. DATABASE SHARDING ● Process of breaking up a large table into smaller chunks called shards ● This is partitioning the table horizontally, ● Each table will have same schema but entire different rows. ● The partitioning can be done on basis on some key or non key attribute or set of attributes. ● For example, we can partitioning the data based on geographical location of the users.
  • 5. ADVANTAGES OF SHARDING ● Facilitates horizontal scaling on databases. Now there is no upper bound on data that can get into a particular table. ● Increases request efficiency. Now the same request is served by a smaller version of the database so is faster. ● For read-write-heavy databases, the efficiency is further increased if sharding is done based on some factor like geographical location and storing corresponding shards in servers nearby those locations.
  • 6. DISADVANTAGES OF SHARDING ● Manual implementation of sharding is not easy. Might not work at all. ● Identifying the property or properties to shard on is the key factor and affects the performance completely. So choosing wrong set of properties might become costly than giving advantage. (Example, after sharding, we need to join across shards to serve the request) ● Might be difficult to return to the unsharded version. ● Not every database supports sharding by default. We need to manually implement for them. This might become unnecessary complexity when our application is small.
  • 7. INTRODUCTION TO CASSANDRA ● Cassandra is a distributed, nosql database management system designed by Apache. ● It is a peer-to-peer, distributed database that runs on a cluster of homogenous nodes. ● Handles large volume of data, while providing high availability. ● Provides high read and write throughput.
  • 8. SOME PROPERTIES OF CASSANDRA ● Distributed - Every node in Cassandra has same role. There is no single point of failure. Data is distributed across the cluster but there is no master and each node can service any request. ● Replication - Replication strategies are configurable. This avoids single point of failure, as if a node goes out of service, the requests gets served by another node. ● Eventual Consistency - Since data is replicated across nodes, we need to ensure data is synchronized across replicas. Cassandra follows eventual consistency model which states that provided there are no new updates, all replicas will eventually return the last updated value. ● Consistent Hashing
  • 9. CONSISTENT HASHING ● A hash function is a function that maps one piece of data - typically describing some kind of object, often of arbitrary size to another piece of data, typically an integer, known as hash code or simply hash. ● Hash functions are widely used. Some examples can be cryptographic hashing of sensitive information in websites, hashing data to compress them, hashing strings for substring matching etc. ● Distributed Hashing - The hash table might grow a lot in size. So we can partition it into several parts and each part is hosted in different servers.
  • 10. CONSISTENT HASHING (CONTD…) ● Rehashing problem - If one of the server crashes, the keys needs to be redistributed across the remaining servers. Even if a single server is added or removed, all keys are likely to get rehashed into different buckets (at least most of them). Thus, rehashing is expensive in normal distributed hashing. ● Consistent hashing - Solves rehashing problem i.e minimizes the rehashing when servers are added/removed. Consistent hashing is a distributed hashing scheme that operates independently of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.
  • 11. CONSISTENT HASHING (IMPLEMENTATION) ● Map all the hash values on the circle using a proper hash function (that maps to an angle in [0-2𝝅] radians). ● Map all the servers as well to the circle using the same hash function. (or different if needed). ● Assign the hash values to the servers that is nearest to it in clockwise direction.
  • 12. CONSISTENT HASHING IN CASSANDRA ● Consistent hashing is used in Cassandra for sharding and also load balancing. ● Let the database be sharded using key. We use a hash function h , where h(key) maps to the circle. ● Suppose we have k servers. Map these servers in the circle as well. All the key values are mapped to the nearest server in clockwise direction. ● Now suppose a request comes with key = key1. We compute h(key1) and then serve the request using that server which is nearest in clockwise sense.
  • 13. REPLICATION IN CASSANDRA ● The database admin has to fix a number K (called Replication factor). ● Data corresponding to a given key is stored not only in nearest server in clockwise direction but nearest K servers. ● If a server goes down, all the requests meant for that server goes to the next server in clockwise sense.
  • 14. REFERENCES ● Gaurav Sen’s lecture videos : Cassandra , Consistent Hashing ● Wikipedia : Cassandra , Sharding ● Stackoverflow : Sharding