SlideShare a Scribd company logo
Cassandra Essentials
Tutorial Series

    Understanding
  Data Partitioning
 and Replication in
Apache Cassandra
Agenda
› Overview  of partitioning
› Setting up data partitioning
› Overview of replication
› Replication strategies (e.g. single, multi-
   data center)
› Replication mechanics
› Where to get Cassandra




                   www.datastax.com
Overview of Data Partitioning in Cassandra
Cassandra is a distributed database management
system that easily and transparently partitions your data
across all participating nodes in a database cluster. Each
node is responsible for part of the overall database.



                                                Data is inserted and
                                                assigned a row key in a
                                                column family




                                     Inserted
                                        row
                                                Data placed on node
                                                based on its column
                                                family row key




                      www.datastax.com
Overview of Data Partitioning in Cassandra
There are two basic data partitioning strategies:

1.  Random partitioning – this is the default and
    recommended strategy. Partitions data as evenly
    as possible across all nodes using an MD5 hash of
    every column family row key
2.  Ordered partitioning – stores column family row keys
    in sorted order across the nodes in a database
    cluster




                      www.datastax.com
Setting up Data Partitioning in Cassandra
The data partitioning strategy is controlled via the
Cassandra configuration file (cassandra.yaml)
partitioner option. There are no other mechanics,
work, sharding, etc., to partition data in Cassandra.

Note that once a cluster is initialized with a partitioner
option, it cannot be changed without reloading all of
the data in the cluster.




                       www.datastax.com
Overview of Replication in Cassandra
To ensure fault tolerance and no single point of failure,
you can replicate one or more copies of every row in a
column family across participating nodes in a
database cluster.


                                          Data is inserted and
                                          assigned a row key in a
                                          column family


                               Original
                                row



                                          Copy of row is replicated
                                          across various nodes in
                                          the cluster based on the
                               Copy of    assigned replication
                                row
                                          factor


                      www.datastax.com
Overview of Replication in Cassandra
Replication is controlled by what is called the
replication factor. A replication factor of 1 means there
is only one copy of a row in a cluster. A replication
factor of 2 means there are two copies of a row stored
in a cluster.

Replication is controlled at the keyspace level in
Cassandra.

                                                     Original
                                                      row




                                                     Copy of
                                                      row




                      www.datastax.com
Replication Strategies
There are different replication strategies:

Simple Strategy: places the original row on a node
determined by the partitioner. Additional replica rows
are placed on the next nodes clockwise in the ring
without considering rack or data center location.



                                              Original
                                               row



                                                         Copy of
                                                          row




                       www.datastax.com
Replication Strategies
Network Topology Strategy: allows for replication
between different racks in a data center and/or
between multiple data centers. This strategy provides
more control over where replica rows are placed.




                      www.datastax.com
Replication Strategies
Network Topology Strategy: The original row is placed
according to the partitioner. Additional replica rows in
the same data center are then placed by walking the
ring clockwise until a node in a different rack from the
previous replica is found. If there is no such node,
additional replicas will be placed in the same rack.




                      www.datastax.com
Replication Strategies
Network Topology Strategy: To replicate data
between 1-n data centers, a replica group is defined
and mapped to each logical or physical data center.
This definition is specified when a keyspace is created
in Cassandra.




                      www.datastax.com
Replication Strategies
Below is a CQL example of creating a keyspace that
uses the Network Topology replication strategy and has
three data replicas:

CREATE KEYSPACE mykeyspace WITH
strategy_class = 'NetworkTopologyStrategy’ AND
strategy_options:DC1 = 3;


     Replica group     Number of replicas
                                                   Original
                                                    row



                                        2nd copy              1st copy of
                                         of row                   row




                     www.datastax.com
Replication Mechanics
Cassandra uses a snitch to define how nodes are
grouped together within the overall network topology
(such as rack and data center groupings). The snitch is
defined in the cassandra.yaml file




                      www.datastax.com
Replication Mechanics
The basic snitches include:

1.  Simple Snitch – the default and used for the simple
    replication strategy
2.  Rack Inferring Snitch - infers the topology of the network by
    analyzing the node IP addresses. This snitch assumes that
    the second octet identifies the data center where a node
    is located, and the third octet identifies the rack.
3.  Property File Snitch – determines the location of nodes by
    referring to a user-defined description of the network
    details located in the property file cassandra-
    topology.properties.
4.  EC2 Snitch - is for deployments on Amazon EC2 only.
    Instead of using the IP to infer node location, this snitch
    uses the AWS API to request region and availability zone.


                         www.datastax.com
Reading and Writing to Cassandra Nodes
Cassandra is a read/write anywhere architecture, so any
user can connect to any node in any data center and
read/write the data they need, with all writes being
partitioned and replicated for them automatically
throughout the cluster.




                     www.datastax.com
Where to get Cassandra?
›  Go to www.datastax.com
›  DataStax makes free smart start installers
    available for Cassandra that include:
   ›  The most up-to-date Cassandra version that is
       production quality
   ›  A version of DataStax OpsCenter, which is a visual,
       browser-based management tool for managing
       and monitoring Cassandra
   ›  Drivers and connectors for popular development
       languages
   ›  Same database and application
   ›  Automatic configuration assistance for ensuring
       optimal performance and setup for either stand-
       alone or cluster implementations
   ›  Getting Started Guide

                       www.datastax.com
Where Can I Learn More?




          www.datastax.com

         ›    Free Online Documentation
         ›    Technical White Papers
         ›    Technical Articles
         ›    Tutorials
         ›    User Forums
         ›    User/Customer Case Studies
         ›    FAQ’s
         ›    Videos
         ›    Blogs
         ›    Software downloads



                  www.datastax.com
Cassandra Essentials
Tutorial Series
         Understanding
   Data Partitioning and
  Replication in Apache
              Cassandra
                 Thanks!

More Related Content

PPTX
Apache Spark Architecture
ODP
Introduction to Apache Cassandra
PDF
Cassandra 101
PDF
Cassandra Introduction & Features
PPTX
Cassandra an overview
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Introduction to Spark Streaming
PPTX
An Overview of Apache Cassandra
Apache Spark Architecture
Introduction to Apache Cassandra
Cassandra 101
Cassandra Introduction & Features
Cassandra an overview
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Introduction to Spark Streaming
An Overview of Apache Cassandra

What's hot (20)

PDF
Apache Spark Overview
PDF
Introduction to Cassandra
PDF
Cassandra Database
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Apache Spark Core – Practical Optimization
PDF
Parquet performance tuning: the missing guide
PPT
Hadoop hive presentation
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
PDF
Introduction to Apache Cassandra
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
Cassandra overview
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Physical Plans in Spark SQL
PPT
Galera Cluster Best Practices for DBA's and DevOps Part 1
PPTX
A Deep Dive Into Understanding Apache Cassandra
PDF
Log Structured Merge Tree
PDF
Introduction to Spark with Python
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark Overview
Introduction to Cassandra
Cassandra Database
Spark (Structured) Streaming vs. Kafka Streams
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Core – Practical Optimization
Parquet performance tuning: the missing guide
Hadoop hive presentation
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Introduction to Apache Cassandra
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Cassandra overview
Apache Iceberg - A Table Format for Hige Analytic Datasets
Physical Plans in Spark SQL
Galera Cluster Best Practices for DBA's and DevOps Part 1
A Deep Dive Into Understanding Apache Cassandra
Log Structured Merge Tree
Introduction to Spark with Python
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Ad

Similar to Understanding Data Partitioning and Replication in Apache Cassandra (20)

PPTX
Talk About Apache Cassandra
PPTX
Talk about apache cassandra, TWJUG 2011
PDF
Understanding Data Consistency in Apache Cassandra
PPTX
DataStax TechDay - Munich 2014
PPTX
Cassandra - A decentralized storage system
PDF
Cassandra v1.0
ODP
Cassandra Insider
PDF
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
PPTX
Cassandra & Python - Springfield MO User Group
PPTX
cybersecurity notes for mca students for learning
PPTX
Cassandra tech talk
PPTX
Apache cassandra
PPTX
cassandra.pptx
PPTX
Cassandra - A Basic Introduction Guide
PPTX
No SQL Cassandra
PDF
cassandra
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PPTX
Introduction to Apache Cassandra
PDF
04-Introduction-to-CassandraDB-.pdf
Talk About Apache Cassandra
Talk about apache cassandra, TWJUG 2011
Understanding Data Consistency in Apache Cassandra
DataStax TechDay - Munich 2014
Cassandra - A decentralized storage system
Cassandra v1.0
Cassandra Insider
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Cassandra & Python - Springfield MO User Group
cybersecurity notes for mca students for learning
Cassandra tech talk
Apache cassandra
cassandra.pptx
Cassandra - A Basic Introduction Guide
No SQL Cassandra
cassandra
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduction to Apache Cassandra
04-Introduction-to-CassandraDB-.pdf
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf

Understanding Data Partitioning and Replication in Apache Cassandra

  • 1. Cassandra Essentials Tutorial Series Understanding Data Partitioning and Replication in Apache Cassandra
  • 2. Agenda › Overview of partitioning › Setting up data partitioning › Overview of replication › Replication strategies (e.g. single, multi- data center) › Replication mechanics › Where to get Cassandra www.datastax.com
  • 3. Overview of Data Partitioning in Cassandra Cassandra is a distributed database management system that easily and transparently partitions your data across all participating nodes in a database cluster. Each node is responsible for part of the overall database. Data is inserted and assigned a row key in a column family Inserted row Data placed on node based on its column family row key www.datastax.com
  • 4. Overview of Data Partitioning in Cassandra There are two basic data partitioning strategies: 1.  Random partitioning – this is the default and recommended strategy. Partitions data as evenly as possible across all nodes using an MD5 hash of every column family row key 2.  Ordered partitioning – stores column family row keys in sorted order across the nodes in a database cluster www.datastax.com
  • 5. Setting up Data Partitioning in Cassandra The data partitioning strategy is controlled via the Cassandra configuration file (cassandra.yaml) partitioner option. There are no other mechanics, work, sharding, etc., to partition data in Cassandra. Note that once a cluster is initialized with a partitioner option, it cannot be changed without reloading all of the data in the cluster. www.datastax.com
  • 6. Overview of Replication in Cassandra To ensure fault tolerance and no single point of failure, you can replicate one or more copies of every row in a column family across participating nodes in a database cluster. Data is inserted and assigned a row key in a column family Original row Copy of row is replicated across various nodes in the cluster based on the Copy of assigned replication row factor www.datastax.com
  • 7. Overview of Replication in Cassandra Replication is controlled by what is called the replication factor. A replication factor of 1 means there is only one copy of a row in a cluster. A replication factor of 2 means there are two copies of a row stored in a cluster. Replication is controlled at the keyspace level in Cassandra. Original row Copy of row www.datastax.com
  • 8. Replication Strategies There are different replication strategies: Simple Strategy: places the original row on a node determined by the partitioner. Additional replica rows are placed on the next nodes clockwise in the ring without considering rack or data center location. Original row Copy of row www.datastax.com
  • 9. Replication Strategies Network Topology Strategy: allows for replication between different racks in a data center and/or between multiple data centers. This strategy provides more control over where replica rows are placed. www.datastax.com
  • 10. Replication Strategies Network Topology Strategy: The original row is placed according to the partitioner. Additional replica rows in the same data center are then placed by walking the ring clockwise until a node in a different rack from the previous replica is found. If there is no such node, additional replicas will be placed in the same rack. www.datastax.com
  • 11. Replication Strategies Network Topology Strategy: To replicate data between 1-n data centers, a replica group is defined and mapped to each logical or physical data center. This definition is specified when a keyspace is created in Cassandra. www.datastax.com
  • 12. Replication Strategies Below is a CQL example of creating a keyspace that uses the Network Topology replication strategy and has three data replicas: CREATE KEYSPACE mykeyspace WITH strategy_class = 'NetworkTopologyStrategy’ AND strategy_options:DC1 = 3; Replica group Number of replicas Original row 2nd copy 1st copy of of row row www.datastax.com
  • 13. Replication Mechanics Cassandra uses a snitch to define how nodes are grouped together within the overall network topology (such as rack and data center groupings). The snitch is defined in the cassandra.yaml file www.datastax.com
  • 14. Replication Mechanics The basic snitches include: 1.  Simple Snitch – the default and used for the simple replication strategy 2.  Rack Inferring Snitch - infers the topology of the network by analyzing the node IP addresses. This snitch assumes that the second octet identifies the data center where a node is located, and the third octet identifies the rack. 3.  Property File Snitch – determines the location of nodes by referring to a user-defined description of the network details located in the property file cassandra- topology.properties. 4.  EC2 Snitch - is for deployments on Amazon EC2 only. Instead of using the IP to infer node location, this snitch uses the AWS API to request region and availability zone. www.datastax.com
  • 15. Reading and Writing to Cassandra Nodes Cassandra is a read/write anywhere architecture, so any user can connect to any node in any data center and read/write the data they need, with all writes being partitioned and replicated for them automatically throughout the cluster. www.datastax.com
  • 16. Where to get Cassandra? ›  Go to www.datastax.com ›  DataStax makes free smart start installers available for Cassandra that include: ›  The most up-to-date Cassandra version that is production quality ›  A version of DataStax OpsCenter, which is a visual, browser-based management tool for managing and monitoring Cassandra ›  Drivers and connectors for popular development languages ›  Same database and application ›  Automatic configuration assistance for ensuring optimal performance and setup for either stand- alone or cluster implementations ›  Getting Started Guide www.datastax.com
  • 17. Where Can I Learn More? www.datastax.com ›  Free Online Documentation ›  Technical White Papers ›  Technical Articles ›  Tutorials ›  User Forums ›  User/Customer Case Studies ›  FAQ’s ›  Videos ›  Blogs ›  Software downloads www.datastax.com
  • 18. Cassandra Essentials Tutorial Series Understanding Data Partitioning and Replication in Apache Cassandra Thanks!