Apache cassandra

Getting started with Apache
Cassandra and Python
By
Adnan Siddiqi
(http://adnansiddiqi.me)

What is Apache Cassandra?
• According to Wikipedia:
Apache Cassandra is a free and open-source,
distributed, wide column store, NoSQL database
management system designed to handle large
amounts of data across many commodity servers,
providing high availability with no single point of
failure. Cassandra offers robust support for
clusters spanning multiple datacenters,[1] with
asynchronous masterless replication allowing low
latency operations for all clients.

History
• Developed by two Facebook engineers to deal
with search mechanism of Inbox.
• Released as an open-source project after few
years.
• Handed over to Apache Foundation.

Companies using Cassandra
• Apple
• Netflix
• eBay
• Weather Channel

Architecture(Contd…)
• Node:- The basic component of the data, a
machine where the data is stored.
• Datacenter:- A collection of related nodes. It
can be a physical datacenter or virtual.
• Cluster:- A cluster contains one or more
datacenters, it could span across locations.
• Commit Log:- Every write operation is first
stored in the commit log. It is used for crash
recovery.

Architecture(Contd…)
• Mem-Table:- After data is written to the
commit log it then is stored in Mem-
Table(Memory Table) which remains there till
it reaches to the threshold.
• SSTable:- Sorted-String Table or SSTable is a
disk file which stores data from MemTable
once it reaches to the threshold. SSTables are
stored on disk sequentially and maintained for
each database table.

Write Operations(Contd…)
• Write request is stored in both CommitLog to
make sure that data is saved.
• Data is written in Memtable which holds data
till it reaches to threshold.
• Data is flused to SSTable once Memtable
reaches to its threshold.
• The node that accepts requests called
Coordinator.

Read Operations
• Direct Request:- The coordinator node sends
the read request to one of the replicas.
• Digest:- The coordinator contacts the replicas
specified by the consistency level. The
contacted nodes respond with a digest
request of the required data. Comparison
takes place to make sure that the update data
is sent back.

Replication Strategies
• Simple Strategy
• Network Topology

Simple Strategy
• It is used when you have only one data center.
It places the first replica on the node selected
by the partitioner. A partitioner determines
how data is distributed across the nodes in the
cluster (including replicas). After that,
remaining replicas are placed in a clockwise
direction in the Node ring.

Network Topology Strategy
• Deployments across multiple Datacenters.
• This strategy places replicas in the same
datacenter by traversing the ring clockwise
until reaching the first node in another rack.
• This strategy is highly recommended for
scalability purpose and future expansion.

Network Topology Strategy(Contd…)

Installation and Setup
• Dockerized Version.
• docker pull cassandra
• Make sure to set the Docker memory to 4GB
atleast to avoid 137 exit error code.

Installation and Setup(Contd…)
• data docker exec -it cas1
nodetool status

Cassandra Data Modeling
• Keyspace:- It is the container collection of
column families. You can think of it as a
Database in the RDBMS world.
• Column Family:- A column family is a
container for an ordered collection of rows.
Each row, in turn, is an ordered collection of
columns. Think of it as a Table in the RDBMS
world.

Cassandra Data Modeling(Contd…)

Creating KeySpace
• Creating Keyspace with name CityInfo.
• create keyspace CityInfo with
replication = {'class' :
'SimpleStrategy',
'replication_factor':2}

Designing Modeling Goals
• Evenly spread of data in a cluster.
• Minimize the number of Reads.

Cassandra and Python
• pip install cassandra-driver

Reading Data
from cassandra.cluster import Cluster
if __name__ == "__main__":
cluster = Cluster(['0.0.0.0'],port=9042)
session =
cluster.connect('cityinfo',wait_for_all_pools=T
rue)
session.execute('USE cityinfo')
rows = session.execute('SELECT * FROM
users')
for row in rows:
print(row.age,row.name,row.username)

Apache cassandra

More Related Content

What's hot (20)

Similar to Apache cassandra (20)

More from Adnan Siddiqi (6)

Recently uploaded (20)

Apache cassandra